Sharing Data Responsibly: The FAIR Principles

So you’ve submitted your paper, made your code publicly available, and maybe even provided documentation to ensure somebody can reproduce your work. But what about the data your work is based on? Is that readily available to your readers, too?

Maybe it’s too large to put on GitHub alongside your code. Maybe it’s sensitive, or subject to GDPR restrictions, so you can’t just stick a download link on your website. Maybe it’s in a proprietary format that needs non-open software to read. There are many reasons sharing data can be less straightforward than sharing code, and often it’s not entirely clear what ‘best practices’ are for a given situation. Data management is a complicated topic, and to do it justice would require far more than a quick blog post. Instead, I’d like to focus on a single source of guidance that serves as a useful starting point for thinking about responsible data management: the FAIR principles.

Continue reading

Viewing fragment elaborations in RDKit

As a reasonably new RDKit user, I was relieved to find that using its built-in functionality for generating basic images from molecules is quite easy to use. However, over time I have picked up some additional tricks to make the images generated slightly more pleasing on the eye!

The first of these (which I definitely stole from another blog post at some point…) is to ask it to produce SVG images rather than png:

#ensure the molecule visualisation uses svg rather than png format
IPythonConsole.ipython_useSVG=True

Now for something slightly more interesting: as a fragment elaborator, I often need to look at a long list of elaborations that have been made to a starting fragment. As these have usually been docked, these don’t look particularly nice when loaded straight into RDKit and drawn:

#load several mols from a single sdf file using SDMolSupplier
#add these to a list
elabs = [mol for mol in Chem.SDMolSupplier('frag2/elabsTestNoRefine_Docked_0.sdf')]

#get list of ligand efficiencies so these can be displayed alongside the molecules
LEs = [(float(mol.GetProp('Gold.PLP.Fitness'))/mol.GetNumHeavyAtoms()) for mol in elabs]

Draw.MolsToGridImage(elabs, legends = [str(LE) for LE in LEs])
Fig. 1: Images generated without doing any tinkering

Two quick changes that will immediately make this image more useful are aligning the elaborations by a supplied substructure (here I supplied the original fragment so that it’s always in the same place) and calculating the 2D coordinates of the molecules so we don’t see the twisty business happening in the bottom right of Fig. 1:

Continue reading

Antibodies as Drugs: Keystone Symposia

Between the 27th April and 1st of May, I was very fortunate to be able attend the Antibodies as Drugs Keystone Symposium and give my first conference talk internationally, in which I spoke about the methods our group has developed for using structure to make predictions about where an antibody binds relative to other antibodies. This included paratyping [1], Ab-Ligity [2] and most recently SPACE [3].

I will preface this by saying that lots of the work people spoke about was unpublished, which was so exciting, but makes for a difficult blog post to write. To avoid any possibility of putting my foot in my mouth I will keep the science very surface level. The conference was held at the Keystone resort in Colorado, and the science combined with a kind of landscape I have never experienced before made for an extremely cool experience. This meeting was originally combined with a protein design meeting, and the two were split by COVID – this meant that in-silico methods were the minority in the program, but I didn’t mind that as the computational work that was presented was quite diverse so it was definitely a good representation of the field still. I also really enjoyed the large number of infectious disease talks in which we got a good range of the major human pathogens – ebolaviruses, SARS-CoV-2 of course, dengue, hantaviruses, metapneumovirus, HIV, TB and malaria all featured. The bispecific session was another highlight for me. The conference was very well organised and I liked how we were all asked to share a fun fact about ourselves – one speaker shared that he is a Christmas tree farmer in his spare time (I won’t share his name in case he is keeping that under wraps). That made me reconsider how fun I can truly consider myself…

Without turning this into a travel blog, I also want to add that Keystone was insanely beautiful and make you look at some pics I got. 

We got to experience snow
Continue reading

Making better plots with matplotlib.pyplot in Python3

The default plots made by Python’s matplotlib.pyplot module are almost always insufficient for publication. With a ~20 extra lines of code, however, you can generate high-quality plots suitable for inclusion in your next article.

Let’s start with code for a very default plot:

import matplotlib.pyplot as plt
import numpy as np

np.random.seed(1)
d1 = np.random.normal(1.0, 0.1, 1000)
d2 = np.random.normal(3.0, 0.1, 1000)
xvals = np.arange(1, 1000+1, 1)

plt.plot(xvals, d1, label='data1')
plt.plot(xvals, d2, label='data2')
plt.legend(loc='best')
plt.xlabel('Time, ns')
plt.ylabel('RMSD, Angstroms')
plt.savefig('bad.png', dpi=300)

The result of this will be:

Plot generated with matplotlib.pyplot defaults

The fake data I generated for the plot look something like Root Mean Square Deviation (RMSD) versus time for a converged molecular dynamics simulation, so let’s pretend they are. There are a number of problems with this plot: it’s overall ugly, the color scheme is not very attractive and may not be color-blind friendly, the y-axis range of the data extends outside the range of the tick labels, etc.

We can easily convert this to a much better plot:

Continue reading

Dealing with multiple compilers

I don’t know you, but when I am compiling a complicated program and everything goes straightforward I feel a mixture of joy and surprise. Let’s face it, compiling can be quite frustrating, and if you need to compile something relatively old, chances are that you will spend hours and hours trying to understand the compiler error messages.

Several such compiler errors, that in many cases can be quite convoluted, tell you that your program requires an older version, so you first need to install it. I am going to assume that you have sudo rights, otherwise, we will be playing the game of compiling a compiler, something that I recommend you to do at least and at most once in your life.

In common Linux distributions like Ubuntu, installing an older compiler is as easy as using apt or yum:

#Ubuntu
$ sudo apt install build-essential
$ sudo apt install gcc-7 g++-7
Continue reading

Benford’s law and OAS

Benford’s law is an observation that in numerical data (produced by many kinds of process), the leading digit tends to be small. Wikipedia tells you that it in datasets obeying Benford’s law, the number 1 appears as the leading digit about 30% of the time while 9 appears less than 5% of the time (p(n) = log10(1+1/n) where n is the leading digit). Wikipedia further lists multiple kinds of data where this tends to be true such as electricity bills, population numbers and physical and mathematical constants, and particularly where data can be described by a power law.

Power laws and antibodies have been co-discussed in reference to network descriptions of antigen-experienced BCR repertoires [1], which are often described as scale-free to use the network terminology (following a power law). This means a few highly-connected nodes in the network and lots of nodes with few or no connections. This is an obvious candidate for Benford’s law.

This is of no practical relevance, but I wondered if I could see Benford’s law in other kinds of data besides clone counts in the Observed Antibody Space (OAS). For example, I looked at the leading digit in the number of sequences in all of the data units in OAS. It looks like a good fit for Benford’s law (though with more density at the smaller leading digits) and has a chi-squared value of 0.007 (Figure 1A).

Continue reading

Congratulations to Prof. Charlotte Deane, MBE

A while back, we read about the pivotal role Prof. Deane played at UKRI during the height of the COVID-19 pandemic.

Image of Prof. Charlotte M. Deane.
Prof. Charlotte M. Deane

Her Majesty the Queen’s Birthday Honours list, released ahead of her Platinum Jubilee, included Prof. Charlotte M. Deane, who was awarded an MBE (Member of the Most Excellent Order of the British Empire):

Professor Charlotte Mary Deane. Deputy Executive Chair, UK Research and Innovation. For services to Covid-19 Research. (Oxford, Oxfordshire)

The Queen’s Birthday Honours 2022

Congratulations, Charlotte!

How to turn a SMILES string into a vector of molecular descriptors using RDKit

Molecular descriptors are quantities associated with small molecules that specify physical or chemical properties of interest. They can be used to numerically describe many different aspects of a molecule such as:

  • molecular graph structure,
  • lipophilicity (logP),
  • molecular refractivity,
  • electrotopological state,
  • druglikeness,
  • fragment profile,
  • molecular charge,
  • molecular surface,

Vectors whose components are molecular descriptors can be used (amongst other things) as high-level feature representations for molecular machine learning. In my experience, molecular descriptor vectors tend to fall slightly short of more low-level molecular representation methods such as extended-connectivity fingerprints or graph neural networks when it comes to predictive performance on large and medium-sized molecular property prediction data sets. However, one advantage of molecular descriptor vectors is their interpretability; there is a reasonable chance that the meaning of a physicochemical descriptor can be intuitively understood by a chemical expert.

A wide variety of useful molecular descriptors can be automatically and easily computed via RDKit purely on the basis of the SMILES string of a molecule. Here is a code snippet to illustrate how this works:

Continue reading

GitHub actions can be useful

GitHub actions is a (relatively) novel GitHub feature that allows you to run code on GitHub when a predefined event is triggered. The most widespread use case for GitHub actions is for Continuous Integration, because it allows you to automatically test your code on any machine immediately after each push. For a great tutorial on how to use it for this see here.

But you can do so much more with them!! Basically you can set up any workflow to run after any event. An event is basically when a specific activity on GitHub happens, while a workflow is basically the script you want to run after the event has happened. For a full list of the events you can use see here. Workflow scripts are written in a .yml file and should be saved within the .github/worflows directory within your repository. I am incapable of writing a better tutorial for these than what is already on their documentation, but I will show a copy of a workflow script I recently put together and walk you through it.

In one of my previous blog posts I wrote about how to upload your code to PyPI. Hopefully I convinced you that this is quite easy, but it does require a few steps that you may not want to be doing every time you come up with a new feature (find a bug) and have to re-upload it. Luckily, you don’t have to!! Just stick the code into a GitHub actions workflow so it will automatically re-upload it for you. Here is the script I use for this:

Continue reading

CryoEM is now the dominant technique for solving antibody structures

Last year, the Structural Antibody Database (SAbDab) listed a record-breaking 894 new antibody structures, driven in no small part by the continued efforts of the researchers to understand SARS-CoV-2.

Fig. 1: The aggregate growth in antibody structure data (all methods) over time. Taken from http://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/stats/ on 25th May 2022.

In this blog post I wanted to highlight the major driving force behind this curve – the huge increase in cryo electron microscopy (cryoEM) data – and the implications of this for the field of structure-based antibody informatics.

Continue reading