Making better plots with matplotlib.pyplot in Python3

The default plots made by Python’s matplotlib.pyplot module are almost always insufficient for publication. With a ~20 extra lines of code, however, you can generate high-quality plots suitable for inclusion in your next article.

Let’s start with code for a very default plot:

import matplotlib.pyplot as plt
import numpy as np

np.random.seed(1)
d1 = np.random.normal(1.0, 0.1, 1000)
d2 = np.random.normal(3.0, 0.1, 1000)
xvals = np.arange(1, 1000+1, 1)

plt.plot(xvals, d1, label='data1')
plt.plot(xvals, d2, label='data2')
plt.legend(loc='best')
plt.xlabel('Time, ns')
plt.ylabel('RMSD, Angstroms')
plt.savefig('bad.png', dpi=300)

The result of this will be:

Plot generated with matplotlib.pyplot defaults

The fake data I generated for the plot look something like Root Mean Square Deviation (RMSD) versus time for a converged molecular dynamics simulation, so let’s pretend they are. There are a number of problems with this plot: it’s overall ugly, the color scheme is not very attractive and may not be color-blind friendly, the y-axis range of the data extends outside the range of the tick labels, etc.

We can easily convert this to a much better plot:

Continue reading

Dealing with multiple compilers

I don’t know you, but when I am compiling a complicated program and everything goes straightforward I feel a mixture of joy and surprise. Let’s face it, compiling can be quite frustrating, and if you need to compile something relatively old, chances are that you will spend hours and hours trying to understand the compiler error messages.

Several such compiler errors, that in many cases can be quite convoluted, tell you that your program requires an older version, so you first need to install it. I am going to assume that you have sudo rights, otherwise, we will be playing the game of compiling a compiler, something that I recommend you to do at least and at most once in your life.

In common Linux distributions like Ubuntu, installing an older compiler is as easy as using apt or yum:

#Ubuntu
$ sudo apt install build-essential
$ sudo apt install gcc-7 g++-7
Continue reading

Benford’s law and OAS

Benford’s law is an observation that in numerical data (produced by many kinds of process), the leading digit tends to be small. Wikipedia tells you that it in datasets obeying Benford’s law, the number 1 appears as the leading digit about 30% of the time while 9 appears less than 5% of the time (p(n) = log10(1+1/n) where n is the leading digit). Wikipedia further lists multiple kinds of data where this tends to be true such as electricity bills, population numbers and physical and mathematical constants, and particularly where data can be described by a power law.

Power laws and antibodies have been co-discussed in reference to network descriptions of antigen-experienced BCR repertoires [1], which are often described as scale-free to use the network terminology (following a power law). This means a few highly-connected nodes in the network and lots of nodes with few or no connections. This is an obvious candidate for Benford’s law.

This is of no practical relevance, but I wondered if I could see Benford’s law in other kinds of data besides clone counts in the Observed Antibody Space (OAS). For example, I looked at the leading digit in the number of sequences in all of the data units in OAS. It looks like a good fit for Benford’s law (though with more density at the smaller leading digits) and has a chi-squared value of 0.007 (Figure 1A).

Continue reading

Congratulations to Prof. Charlotte Deane, MBE

A while back, we read about the pivotal role Prof. Deane played at UKRI during the height of the COVID-19 pandemic.

Image of Prof. Charlotte M. Deane.
Prof. Charlotte M. Deane

Her Majesty the Queen’s Birthday Honours list, released ahead of her Platinum Jubilee, included Prof. Charlotte M. Deane, who was awarded an MBE (Member of the Most Excellent Order of the British Empire):

Professor Charlotte Mary Deane. Deputy Executive Chair, UK Research and Innovation. For services to Covid-19 Research. (Oxford, Oxfordshire)

The Queen’s Birthday Honours 2022

Congratulations, Charlotte!

How to turn a SMILES string into a vector of molecular descriptors using RDKit

Molecular descriptors are quantities associated with small molecules that specify physical or chemical properties of interest. They can be used to numerically describe many different aspects of a molecule such as:

  • molecular graph structure,
  • lipophilicity (logP),
  • molecular refractivity,
  • electrotopological state,
  • druglikeness,
  • fragment profile,
  • molecular charge,
  • molecular surface,

Vectors whose components are molecular descriptors can be used (amongst other things) as high-level feature representations for molecular machine learning. In my experience, molecular descriptor vectors tend to fall slightly short of more low-level molecular representation methods such as extended-connectivity fingerprints or graph neural networks when it comes to predictive performance on large and medium-sized molecular property prediction data sets. However, one advantage of molecular descriptor vectors is their interpretability; there is a reasonable chance that the meaning of a physicochemical descriptor can be intuitively understood by a chemical expert.

A wide variety of useful molecular descriptors can be automatically and easily computed via RDKit purely on the basis of the SMILES string of a molecule. Here is a code snippet to illustrate how this works:

Continue reading

GitHub actions can be useful

GitHub actions is a (relatively) novel GitHub feature that allows you to run code on GitHub when a predefined event is triggered. The most widespread use case for GitHub actions is for Continuous Integration, because it allows you to automatically test your code on any machine immediately after each push. For a great tutorial on how to use it for this see here.

But you can do so much more with them!! Basically you can set up any workflow to run after any event. An event is basically when a specific activity on GitHub happens, while a workflow is basically the script you want to run after the event has happened. For a full list of the events you can use see here. Workflow scripts are written in a .yml file and should be saved within the .github/worflows directory within your repository. I am incapable of writing a better tutorial for these than what is already on their documentation, but I will show a copy of a workflow script I recently put together and walk you through it.

In one of my previous blog posts I wrote about how to upload your code to PyPI. Hopefully I convinced you that this is quite easy, but it does require a few steps that you may not want to be doing every time you come up with a new feature (find a bug) and have to re-upload it. Luckily, you don’t have to!! Just stick the code into a GitHub actions workflow so it will automatically re-upload it for you. Here is the script I use for this:

Continue reading

CryoEM is now the dominant technique for solving antibody structures

Last year, the Structural Antibody Database (SAbDab) listed a record-breaking 894 new antibody structures, driven in no small part by the continued efforts of the researchers to understand SARS-CoV-2.

Fig. 1: The aggregate growth in antibody structure data (all methods) over time. Taken from http://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/stats/ on 25th May 2022.

In this blog post I wanted to highlight the major driving force behind this curve – the huge increase in cryo electron microscopy (cryoEM) data – and the implications of this for the field of structure-based antibody informatics.

Continue reading

Why can a man not lift himself by pulling up on his bootstrap hypothesis test?

This blogpost highlights a typical mistake when performing the bootstrap hypothesis test. Bootstrapping is a method of resampling data to estimate measures of variability, such as confidence intervals or variance. 

In the simplest form of the bootstrap, assume you have a set of values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. You want to estimate the mean and variability of the mean using these data. The recipe is as follows:

Continue reading

Linux Horror stories vol II: Automatic drivers update

As promised, I will tell you about another Linux Horror Story: The Nvidia driver automatic update that breaks your machine. This is a recurrent problem that I have suffered so many times that I tend to disable all Nvidia updates just to avoid it. Unfortunately, I forgot to do so on my new laptop, so it happened once more. 

It all started when I tried to connect my dual monitor to my laptop, as I have been doing for the last 8 months. But the SO did not recognize the monitor. After unplugging and plugging my monitor a few times and rebooting my machine several times, I started thinking that it may be a drivers-related problem, so I just executed the command nvidia-smi to check if the GPU drivers were working. A familiar error message confirmed my fears: 

NVIDIA-SMI has failed because it couldn’t communicate with the Nvidia driver. 

 Make sure that the latest NVIDIA driver is installed and running. 

If you are lucky enough, this is a consequence of the driver update and rebooting the machine will make it work again. Unfortunately, it was not my case, so I started the process of uninstalling and reinstalling the drivers. To do so, in an Ubuntu machine, you only need to use the following two commands.

Continue reading

Make your code do more, with less

When you wrangle data for a living, you start to wonder why everything takes so darn long. Through five years of introspection, I have come to conclude that two simple factors limit every computational project. One is, of course, your personal productivity. Your time of focused work, minus distractions (and yes, meetings figure here), times your energy and mental acuity. All those things you have little control over, unfortunately. But the second is the productivity of your code and tools. And this, in principle, is a variable that you have full control over.

Even quick calculations, when applied to tens of millions of sequences, can take quite some time!

This is a post about how to increase your productivity, by helping you navigate all those instances when the progress bar does not seem to go fast enough. I want to discuss actionable tools to make your code run faster, and generate more results, with less effort, in less time. Instructions to tinker less and think more, so you can do the science that you truly want to be doing. And, above all, I want to give out advice that is so counter-intuitive that you should absolutely consider following it.

Continue reading