Author Archives: Conor Wild

Fragment Based Drug Discovery with Crystallographic Fragment Screening at XChem and Beyond

Disclaimer: I’m a current PhD student working on PanDDA 2 for Frank von Delft and Charlotte Deane, and sponsored by Global Phasing, and some of this is my opinion – if it isn’t obvious in one of the references I probably said it so take it with a pinch of salt

Fragment Based Drug Discovery

Principle

Fragment based drugs discovery (FBDD) is a technique for finding lead compounds for medicinal chemistry. In FBDD a protein target of interest is identified for inhibition and a small library, typically of a few hundred compounds, is screened against it. Though these typically bind weakly, they can be used as a starting point for chemical elaboration towards something more lead-like. This approach is primarily contrasted with high throughput screening (HTS), in which an enormous number of larger, more complex molecules are screened in order to find ones which bind. The key idea is recognizing that the molecules in these HTS libraries can typically be broken down into a much smaller number of common substructures, fragments, so screening these ought to be more informative: between them they describe more of the “chemical space” which interacts with the protein. Since it first appeared about 25 years ago, FBDD has delivered four drugs for clinical use and over 40 molecules to clinical trials.

Continue reading

Model validation in Crystallographic Fragment Screening

Fragment based drug discovery is a powerful technique for finding lead compounds for medicinal chemistry. Crystallographic fragment screening is particularly useful because it informs one not just about whether a fragment binds, but has the advantage of providing information on how it binds. This information allows for rational elaboration and merging of fragments.

However, this comes with a unique challenge: the confidence in the experimental readout, if and how a fragment binds, is tied to the quality of the crystallographic model that can be built. This intimately links crystallographic fragment screening to the general statistical idea of a “model”, and the statistical ideas of goodness of fit and overfitting.

Continue reading

Lessons in Scientific Code Deployment

So, I recently deployed my first piece of scientific code. Well, sort of. I made a github with instructions on how to download, install and run it.

And then everyone broke it.

So, now having been on tech support duty for a few weeks, it seemed like a good idea to have a think about what I’ve learned.

Now, there is a big preface to this: the first and most important thing I learned is that I should do some reading on how to do this well. I have not yet done that reading, so this post isn’t so much going to offer any advice as catalogue my mistakes. Mistakes that will probably look extremely silly to anyone who has any familiarity with deployment, but might be interesting to anyone who doesn’t.

A surprising number of people really don’t want to touch the command line

Being a programmer who spends the vast majority of their time on the command line, invoking programs from there is very natural. As such, I very much underestimated the obstacle that even installing anaconda, a few packages, and cloning the source code would be. Even with instructions to copy and paste. 

The issue is, if anything goes wrong, there is a good chance they don’t know whether it is my code or their environment breaking, which probably means they need to contact me about it (more on environments later). 

Really, I probably could have saved myself an awful lot of support by making it an installable, and more with a gui to guide people through using the program.

Python is a pain

So, the first thing I learned was something I’d kind of been warned about: deploying python code is a pain in the butt. Especially to people who aren’t familiar with python, managing python environments is both tricky and overwhelming easy to break code with. Run a python script from the wrong environment and it is going to fail: if you are lucky with a failure to import a module, if you are unlucky with a cryptic error due to say changes between various python versions.
Speaking of python versions, developing in 3.9 and not testing in 3.7 then telling people to install that can result in a surprising number of surprisingly difficult bugs.

The instructions weren’t clear enough

Scientific code I think generally caters an awful lot to expert users, people who really understand the model and even are willing to open the source code to figure out the implementation.

My first stab at documentation managed to not be clear enough to the people who didn’t want to touch the command line and those who were willing to open the source code because they wanted to do something spicy.

So yeah, good documentation is an acquired skill.

Distributed computing is a nightmare

In principle, distribution is terrific: get a library that will allow you to reduce running arbitrary python code on multiple nodes to a simple map-like interface. On big clusters, like a lot of scientists use, this can mean speed ups from 10 to even 1000 times.

The only problem is, everyone’s cluster is a special snowflake, and you can’t access most of them to fix things. This can make iteration with a non-programmer painfully slow. 

Libraries don’t help as much as I’d have thought either: indeed, my experience of Dask and Dask Jobqueue has been a consistently uphill battle. From the fact that my workload likes individual nodes sharing lots of memory and a few cpus to some truly arcane errors (one that broke in the msgpack code), I have generally considered (and even started) writing my own code to do this.

Active development doesn’t reach people

Code that is being updated several times a day in response to bugfixes can be great – but if people aren’t pulling and installing it, no-one is going to benefit. I’m seriously tempted to write some code to either auto-update on running or at least let folk know it has been updated.

Summary

In summary, a lot went wrong in my first stab at this. Very much come to appreciate a good deployment is an artform, and I’ve got an awful lot of reading to do. In particular, the above problem areas really have eaten a lot of time that probably could have been used doing actual science with the code, so there is a good incentive to get it right. 

Hidden Markov Models in Python: A simple Hidden Markov Model with Known Emission Matrix fitted with hmmlearn

The Hidden Markov Model

Consider a sensor which tells you whether it is cloudy or clear, but is wrong with some probability. Now, the weather *is* cloudy or clear, we could go and see which it was, so there is a “true” state, but we only have noisy observations on which to attempt to infer it.  

We might model this process (with the assumption of sufficiently precious weather), and attempt to make inferences about the true state of the weather over time, the rate of change of the weather and how noisy our sensor is by using a Hidden Markov Model. 

The Hidden Markov Model describes a hidden Markov Chain which at each step emits an observation with a probability that depends on the current state. In general both the hidden state and the observations may be discrete or continuous.

But for simplicity’s sake let’s consider the case where both the hidden and observed spaces are discrete. Then, the Hidden Markov Model is parameterised by two matrices: 

Continue reading

Calculating symmeterised small molecule RMSDs using graph automorphisms in python with GEMMI and NetworkX

When a ring flips, how do we calculate RMSD?

This surprisingly simple question leads to a very interesting problem! If we take a benzene molecule, say, and rotate it 180 degrees, then we have the exact same molecule, but if we have a data structure in which our atoms are labelled, and we apply the same transformation to the atomic positions, the numbering does not reflect that symmetry. If we were then naively to calculate the RMSD it would be huge, despite the fact that the molecule is, chemically speaking, identical.

How can we make our RMSD calculations reflect these symmetries?

Continue reading

Real Space Correlation Coefficient

Introduction

In crystalography we are often faced with the question of how well a part of our model fits the data. Now crystalography has well developed probability models for the reflection amplitudes given then entire fitted model, but these do not provide a metric for “how much of the ligand is inside the blob”. This is because the reflection based models are inherently global.

Continue reading

GEMMI: A Python Cookbook

General MacroMocelecular I/O, or GEMMI, is a C++ 11 header only library for low level crystalographic .

Because its header only it is certainly the easiest to access and use low level crystalographic C++ library, however GEMMI comes with python binding via Pybind11, making it arguably the easiest low level crystalographic library to access and use in python as well!

What follows is a cookbook of useful Python code that uses GEMMI to accomplish macromolecular crystalographic tasks.

Continue reading

Functional Programming in Python

Introduction

The difficulty of reasoning about the behaviour of stateful programs, especially in concurrnent enviroments, has led to increased in intrest in a programming paradigm called functional programming. This style emphasises the connection between programs and mathematics, encouraging code that is easy to understand and, in some critical cases, even possible to prove properties of.

Continue reading

Property based testing in Python with Hypothesis : how to break your own code before someone else does

Traceback (most recent call last):
ZeroDivisionError: integer division or modulo by 0

We’ve all been there. You’ve written your code, tested it out on some toy data and then when you make the move to the real data, there was something you didn’t expect.

Maybe some samples have been truncated to zero. Maybe the input arrays are the wrong shape. Suddenly your code comes crashing down around you, and you’re left thinking: well how could I have known that was going to happen? I can’t test everything

Continue reading