Monthly Archives: June 2022

Dealing with multiple compilers

I don’t know you, but when I am compiling a complicated program and everything goes straightforward I feel a mixture of joy and surprise. Let’s face it, compiling can be quite frustrating, and if you need to compile something relatively old, chances are that you will spend hours and hours trying to understand the compiler error messages.

Several such compiler errors, that in many cases can be quite convoluted, tell you that your program requires an older version, so you first need to install it. I am going to assume that you have sudo rights, otherwise, we will be playing the game of compiling a compiler, something that I recommend you to do at least and at most once in your life.

In common Linux distributions like Ubuntu, installing an older compiler is as easy as using apt or yum:

#Ubuntu
$ sudo apt install build-essential
$ sudo apt install gcc-7 g++-7
Continue reading

Benford’s law and OAS

Benford’s law is an observation that in numerical data (produced by many kinds of process), the leading digit tends to be small. Wikipedia tells you that it in datasets obeying Benford’s law, the number 1 appears as the leading digit about 30% of the time while 9 appears less than 5% of the time (p(n) = log10(1+1/n) where n is the leading digit). Wikipedia further lists multiple kinds of data where this tends to be true such as electricity bills, population numbers and physical and mathematical constants, and particularly where data can be described by a power law.

Power laws and antibodies have been co-discussed in reference to network descriptions of antigen-experienced BCR repertoires [1], which are often described as scale-free to use the network terminology (following a power law). This means a few highly-connected nodes in the network and lots of nodes with few or no connections. This is an obvious candidate for Benford’s law.

This is of no practical relevance, but I wondered if I could see Benford’s law in other kinds of data besides clone counts in the Observed Antibody Space (OAS). For example, I looked at the leading digit in the number of sequences in all of the data units in OAS. It looks like a good fit for Benford’s law (though with more density at the smaller leading digits) and has a chi-squared value of 0.007 (Figure 1A).

Continue reading

Congratulations to Prof. Charlotte Deane, MBE

A while back, we read about the pivotal role Prof. Deane played at UKRI during the height of the COVID-19 pandemic.

Image of Prof. Charlotte M. Deane.
Prof. Charlotte M. Deane

Her Majesty the Queen’s Birthday Honours list, released ahead of her Platinum Jubilee, included Prof. Charlotte M. Deane, who was awarded an MBE (Member of the Most Excellent Order of the British Empire):

Professor Charlotte Mary Deane. Deputy Executive Chair, UK Research and Innovation. For services to Covid-19 Research. (Oxford, Oxfordshire)

The Queen’s Birthday Honours 2022

Congratulations, Charlotte!

How to turn a SMILES string into a vector of molecular descriptors using RDKit

Molecular descriptors are quantities associated with small molecules that specify physical or chemical properties of interest. They can be used to numerically describe many different aspects of a molecule such as:

  • molecular graph structure,
  • lipophilicity (logP),
  • molecular refractivity,
  • electrotopological state,
  • druglikeness,
  • fragment profile,
  • molecular charge,
  • molecular surface,

Vectors whose components are molecular descriptors can be used (amongst other things) as high-level feature representations for molecular machine learning. In my experience, molecular descriptor vectors tend to fall slightly short of more low-level molecular representation methods such as extended-connectivity fingerprints or graph neural networks when it comes to predictive performance on large and medium-sized molecular property prediction data sets. However, one advantage of molecular descriptor vectors is their interpretability; there is a reasonable chance that the meaning of a physicochemical descriptor can be intuitively understood by a chemical expert.

A wide variety of useful molecular descriptors can be automatically and easily computed via RDKit purely on the basis of the SMILES string of a molecule. Here is a code snippet to illustrate how this works:

Continue reading