Category Archives: Statistical Modelling

A Simple Way to Quantify the Similarity Between Two Sets of Molecules

When designing machine learning algorithms with the aim of accelerating the discovery of novel and more effective therapeutics, we often care deeply about their ability to generalise to new regions of chemical space and accurately predict the properties of molecules that are structurally or functionally dissimilar to the ones we have already explored. To evaluate the performance of algorithms in such an out-of-distribution setting, it is essential that we are able to quantify the data shift that is induced by the train-test splits that we rely on to decide which model to deploy in production.

For our recent ICML 2023 paper Drug Discovery under Covariate Shift with Domain-Informed Prior Distributions over Functions, we chose to quantify the distributional similarity between two sets of molecules through the Maximum Mean Discrepancy (MMD).

Continue reading

9th Joint Sheffield Conference on Cheminformatics

Over the next few days, researchers from around the world will be gathering in Sheffield for the 9th Joint Sheffield Conference on Cheminformatics. As one of the organizers (wearing my Molecular Graphics and Modeling Society ‘hat’), I can say we have an exciting array of speakers and sessions:

  • De Novo Design
  • Open Science
  • Chemical Space
  • Physics-based Modelling
  • Machine Learning
  • Property Prediction
  • Virtual Screening
  • Case Studies
  • Molecular Representations

It has traditionally taken place every three years, but despite the global pandemic it is returning this year, once again in person in the excellent conference facilities at The Edge. You can download the full programme in iCal format, and here is the conference calendar:

Continue reading

Chained or Unchained: Markov, Nekrasov and Free Will

A Markov Chain moving between two states A and B. Animation by Devin Soni

Markov chains are simple probabilistic models which model sequences of related events through time. In a Markov chain, events at the present time depend on the previous event in the sequence. The example above shows a model of a dynamical system with two states A and B and the events are either moving between states A and B, or staying put.

More formally, a Markov chain is a model of any sequence of events with the following relationship

P(X_t=x|X_{t-1}=x_{t-1},X_{t-2}=x_{t-2},..,X_1=x_1)=P(X_t|X_{t-1}).

That is, the event that the sequence \{X_t\}_{t} is in state x at time t is conditionally independent of all of its past states given its immediate past. This simple relationship between past and present provides a useful simplifying assumption to model, to a surprising degree of accuracy, many real world systems. These range from air particles diffusing through a room, to the migration patterns of insects, to the evolution of your genome, and even your web browser activity. Given their broad use in describing natural phenomena, it is very curious that Markov first invented the Markov chain to settle a dispute in Mathematical Theology, one in which the atheist Markov was pitted against the devoutly Orthodox Pavel Nekrasov.

Continue reading

Former OPIGlets – where are they now?

Since OPIG began in 2003, 53 students* have managed to escape. But where are these glorious people now? I decided to find out, using my best detective skills (aka LinkedIn, Google and Twitter).

* I’m only including full members who have left the group, as per the former members list on the OPIG website

Where are they?

Firstly, the countries. OPIGlets are mostly still residing in the UK, primarily in the ‘golden triangle’ of London, Oxford and Cambridge. The US comes in second, followed closely by Germany (Note: one former OPIGlet is in Malta, which is too small to be recognised in Geopandas so just imagine it is shown on the world map below)

Continue reading

How do I do regression when my predictors have multicollinearity?

A quick summary of the key idea of principal components regression (PCR), its advantages and extensions.

Sometimes we find ourselves in a dire situation. We have measured some response y and a set of predictors W. Unfortunately, W is a wide but short matrix, say 10×100 or worse 10×100000. We’ve made only 10 observations. Standard regression is simply not going to work, because W is singular. Some would say p is bigger than n.

So what can we do? Many of us would jump to LASSO or ridge regression. However, there is another way that is often overlooked.

Continue reading

CAML: Courses in Applied Machine Learning

*Shameless self-promotion klaxon!! Have a look at my new website!*

I’m excited to share a project I’ve been working on for the past few months! One of the biggest challenges of working on an interdisciplinary research project is getting to grips with the core principles of the disciplines which you don’t have much formal training in. For me, that means learning the basics of Medicinal Chemistry and Structural Biology so that when someone mentions pi-stacking I don’t think they’re talking about the logistics of managing a bakery; for people coming from Bio/Chem backgrounds it can mean understanding the Maths and Statistics necessary to make sense of the different algorithms which are central to their work.

Continue reading

No labels, no problem! A quick introduction to Gaussian Mixture Models

Statistical Modelling Big Data AnalyticsTM is in vogue at the moment, and there’s nothing quite so fashionable as the neural network. Capable of capturing complex non-linear relationships and scalable for high-dimensional datasets, they’re here to stay.

For your garden-variety neural network, you need two things: a set of features, X, and a label, Y. But what do you do if labelling is prohibitively expensive or your expert labeller goes on holiday for 2 months and all you have in the meantime is a set of features? Happily, we can still learn something about the labels, even if we might not know what they are!

Continue reading