Monthly Archives: June 2022

Exploring topological fingerprints in RDKit

Finding a way to express the similarity of irregular and discrete molecular graphs to enable quantitative algorithmic reasoning in chemical space is a fundamental problem in data-driven small molecule drug discovery.

Virtually all algorithms that are widely and successfully used in this setting boil down to extracting and comparing (multi-)sets of subgraphs, differing only in the space of substructures they consider and the extent to which they are able to adapt to specific downstream applications.

A large body of recent work has explored approaches centred around graph neural networks (GNNs), which can often maximise both of these considerations. However, the subgraph-derived embeddings learned by these algorithms may not always perform well beyond the specific datasets they are trained on and for many generic or resource-constrained applications more traditional “non-parametric” topological fingerprints may still be a viable and often preferable choice .

This blog post gives an overview of the topological fingerprint algorithms implemented in RDKit. In general, they count the occurrences of a certain family of subgraphs in a given molecule and then represent this set/multiset as a bit/count vector, which can be compared to other fingerprints with the Jaccard/Dice similarity metric or further processed by other algorithms.

Continue reading

Tackling horizontal and vertical limitations

A blog post about reviewing papers and preparing papers for publication.

We start with the following premise: all papers have limitations. There is not a single paper without limitations. A method may not be generally applicable, a result may not be completely justified by the data or a theory may make restrictive assumptions. To cover all limitations would make a paper infinitely long, so we must stop somewhere.

A lot of limitations fall into the following scenario. The results or methods are presented but they could have extended them in some way. Suppose, we obtain results on a particular cell type using an immortalized cell-line. Are the results still true, if we performed the experiments on primary or patient-derived cells? If the signal from the original cells was sufficiently robust then we would hope so. However, we can not be one hundred percent sure. A similar example is a method that can be applied to a certain type of data. It may be possible to extend the method to be applied to other data types. However, this may require some new methodology. I call this flavor of limitations vertical limitations. They are vertical in the sense that they build upon an already developed result in the manuscript. For certain journals, they will require that you tackle vertical limitations by adapting the original idea or method to demonstrate broad appeal or that idea could permeate multiple fields. Most of the time, however, the premise of an approach is not to keep extending it. It works. Leave it alone. Do not ask for more. An idea done well does not need more.

Continue reading

Oxford MRC DTP Symposium 2022

The Oxford Medical Research Council Doctoral Training Partnership (MRC DTP), the program through which my DPhil is funded, hosts an annual Symposium to highlight research being conducted by DTP students and offer insights into the career paths of external speakers.

This year, I was on the committee organising the Symposium and was involved in selecting student presenters, as well as deciding on and inviting external speakers. It was a great experience!

Panel on careers in biotech featuring Loïc Roux, Ochre Bio (centre); Helena Meyer-Berg, Sirion Biotech (centre right); and Claire Shingler, Oxford BioEscalator (right).

Here are my key takeaways from the Symposium:

Continue reading

Entering a Stable Relationship with your Neural Network

Over the past year, I have been working on building a graph-based paratope (antibody binding site) prediction tool – Paragraph. Fortunately, I have had moderate success with this and you can now check out the preprint of this work here.

However, for a long time, I struggled with a highly unstable network, where different random seeds yielded very different results. I believe this instability was largely due to the high class imbalance in my data – only ~10% of all residues in the Fv (variable region of the antibody) belong to the paratope.

I tried many different things in an attempt to stabilise my training, most of which failed. I will share all of these ideas with you though – successful or not – as what works for one person/network is never guaranteed to work for another. I hope that the below may provide some ideas to try out for others facing similar issues. Where possible, I also provide some example hyperparameter values that could act as sensible starting points.

Continue reading

AIRR Community Meeting VI May 17-19 

Eve, Brennan and I were delighted to attend the sixth AIRR (adaptive immune receptor repertoire) Community Meeting: Exploring New Frontiers in San Diego. Eve and I had been awaiting this meeting for a mere 3 years, since it was announced during the last in-person AIRR Community Meeting back in 2019. Fortunately, San Diego did not disappoint. 

After a rocky start (featuring many hours stuck in traffic on the M40, one missed flight and one delayed flight), we made it to California! The three day conference had ~230 participants (remote and in-person) and featured great talks from academia and industry. We particularly enjoyed keynote talks from Dennis Burton on rational vaccine design using broadly neutralising antibodies, Gunilla Karlsson Hedestam on functional consequences of allelic variation, Shane Crotty on covid and HIV vaccine design, and Atul Butte on uses of electronic health record data and how we should all found start-ups.

We had fun delivering a tutorial on OPIG antibody tools and, most importantly, we all won AIRR t-shirts in the raffle (potentially we were the only people who noticed how to enter on the conference app). Highlights outside of the conference included paddle boarding and seeing hummingbirds, pelicans, sealions, seals, ‘Garibaldi’ the state fish, and meeting Bob the golden retriever at a surfing shop. We’re now off to find jobs on the West Coast so we can live at the beach….

 The AIRR community has many webinars and talks available on their youtube channel https://www.youtube.com/c/AIRRCommunity

Sarah, Eve & Brennan

Visualise with Weight and Biases

Understanding what’s going on when you’ve started training your shiny new ML model is hard enough. Will it work? Have I got the right parameters? Is it the data? Probably.  Any tool that can help with that process is a Godsend. Weights and biases is a great tool to help you visualise and track your model throughout your production cycle. In this blog post, I’m going to detail some basics on how you can initialise and use it to visualise your next project.

Installation

To use weights and biases (wandb), you need to make an account. For individuals it is free, however, for team-oriented features, you will have to pay. Wandb can then be installed using pip or conda.

$ 	conda install -c conda-forge wandb

or 

$   pip install wandb

To initialise your project, import the package, sign in, and then use the following command using your chosen project name and username (if you want):

import wandb

wandb.login()

wandb.init(project='project1')

In addition to your project, you can also initialise a config dictionary with starting parameter values:

Continue reading

Sharing Data Responsibly: The FAIR Principles

So you’ve submitted your paper, made your code publicly available, and maybe even provided documentation to ensure somebody can reproduce your work. But what about the data your work is based on? Is that readily available to your readers, too?

Maybe it’s too large to put on GitHub alongside your code. Maybe it’s sensitive, or subject to GDPR restrictions, so you can’t just stick a download link on your website. Maybe it’s in a proprietary format that needs non-open software to read. There are many reasons sharing data can be less straightforward than sharing code, and often it’s not entirely clear what ‘best practices’ are for a given situation. Data management is a complicated topic, and to do it justice would require far more than a quick blog post. Instead, I’d like to focus on a single source of guidance that serves as a useful starting point for thinking about responsible data management: the FAIR principles.

Continue reading

Viewing fragment elaborations in RDKit

As a reasonably new RDKit user, I was relieved to find that using its built-in functionality for generating basic images from molecules is quite easy to use. However, over time I have picked up some additional tricks to make the images generated slightly more pleasing on the eye!

The first of these (which I definitely stole from another blog post at some point…) is to ask it to produce SVG images rather than png:

#ensure the molecule visualisation uses svg rather than png format
IPythonConsole.ipython_useSVG=True

Now for something slightly more interesting: as a fragment elaborator, I often need to look at a long list of elaborations that have been made to a starting fragment. As these have usually been docked, these don’t look particularly nice when loaded straight into RDKit and drawn:

#load several mols from a single sdf file using SDMolSupplier
#add these to a list
elabs = [mol for mol in Chem.SDMolSupplier('frag2/elabsTestNoRefine_Docked_0.sdf')]

#get list of ligand efficiencies so these can be displayed alongside the molecules
LEs = [(float(mol.GetProp('Gold.PLP.Fitness'))/mol.GetNumHeavyAtoms()) for mol in elabs]

Draw.MolsToGridImage(elabs, legends = [str(LE) for LE in LEs])
Fig. 1: Images generated without doing any tinkering

Two quick changes that will immediately make this image more useful are aligning the elaborations by a supplied substructure (here I supplied the original fragment so that it’s always in the same place) and calculating the 2D coordinates of the molecules so we don’t see the twisty business happening in the bottom right of Fig. 1:

Continue reading

Antibodies as Drugs: Keystone Symposia

Between the 27th April and 1st of May, I was very fortunate to be able attend the Antibodies as Drugs Keystone Symposium and give my first conference talk internationally, in which I spoke about the methods our group has developed for using structure to make predictions about where an antibody binds relative to other antibodies. This included paratyping [1], Ab-Ligity [2] and most recently SPACE [3].

I will preface this by saying that lots of the work people spoke about was unpublished, which was so exciting, but makes for a difficult blog post to write. To avoid any possibility of putting my foot in my mouth I will keep the science very surface level. The conference was held at the Keystone resort in Colorado, and the science combined with a kind of landscape I have never experienced before made for an extremely cool experience. This meeting was originally combined with a protein design meeting, and the two were split by COVID – this meant that in-silico methods were the minority in the program, but I didn’t mind that as the computational work that was presented was quite diverse so it was definitely a good representation of the field still. I also really enjoyed the large number of infectious disease talks in which we got a good range of the major human pathogens – ebolaviruses, SARS-CoV-2 of course, dengue, hantaviruses, metapneumovirus, HIV, TB and malaria all featured. The bispecific session was another highlight for me. The conference was very well organised and I liked how we were all asked to share a fun fact about ourselves – one speaker shared that he is a Christmas tree farmer in his spare time (I won’t share his name in case he is keeping that under wraps). That made me reconsider how fun I can truly consider myself…

Without turning this into a travel blog, I also want to add that Keystone was insanely beautiful and make you look at some pics I got. 

We got to experience snow
Continue reading

Making better plots with matplotlib.pyplot in Python3

The default plots made by Python’s matplotlib.pyplot module are almost always insufficient for publication. With a ~20 extra lines of code, however, you can generate high-quality plots suitable for inclusion in your next article.

Let’s start with code for a very default plot:

import matplotlib.pyplot as plt
import numpy as np

np.random.seed(1)
d1 = np.random.normal(1.0, 0.1, 1000)
d2 = np.random.normal(3.0, 0.1, 1000)
xvals = np.arange(1, 1000+1, 1)

plt.plot(xvals, d1, label='data1')
plt.plot(xvals, d2, label='data2')
plt.legend(loc='best')
plt.xlabel('Time, ns')
plt.ylabel('RMSD, Angstroms')
plt.savefig('bad.png', dpi=300)

The result of this will be:

Plot generated with matplotlib.pyplot defaults

The fake data I generated for the plot look something like Root Mean Square Deviation (RMSD) versus time for a converged molecular dynamics simulation, so let’s pretend they are. There are a number of problems with this plot: it’s overall ugly, the color scheme is not very attractive and may not be color-blind friendly, the y-axis range of the data extends outside the range of the tick labels, etc.

We can easily convert this to a much better plot:

Continue reading