Out-of-distribution generalisation and scaffold splitting in molecular property prediction

The ability to successfully apply previously acquired knowledge to novel and unfamiliar situations is one of the main hallmarks of successful learning and general intelligence. This capability to effectively generalise is amongst the most desirable properties a prediction model (or a mind, for that matter) can have.

In supervised machine learning, the standard way to evaluate the generalisation power of a prediction model for a given task is to randomly split the whole available data set X into two sets – a training set X_{\text{train}} and a test set X_{\text{test}}. The model is then subsequently trained on the examples in the training set X_{\text{train}} and afterwards its prediction abilities are measured on the untouched examples in the test set X_{\text{test}} via a suitable performance metric.

Since in this scenario the model has never seen any of the examples in X_{\text{test}} during training, its performance on X_{\text{test}} must be indicative of its performance on novel data X_{\text{new}} which it will encounter in the future. Right?

No.

In practise, one can regularly observe a situation where a machine learning model which performs well on a randomly selected test set X_{\text{test}} fails spectacularly when confronted with novel data X_{\text{new}} which was collected at a later point in time, by a different lab, in a different environment, or in some other context that differs from the original context in which the initial data set X was collected. The reason for this can be found in the distributional shift between X_{\text{train}} and X_{\text{new}} which frequently occurs when the data collection context (and thus the data generating process) is altered in some way.

If the data split for the initial data set X into training set X_{\text{train}} and test set X_{\text{test}} is done uniformly at random (as is usual), then both X_{\text{train}} and X_{\text{test}} follow the same distribution. This random uniform data split is very much in accordance with the framework of classical statistical learning theory [1], where one assumes that a learning model is primarily built to deal with training- and test data examples that have all been sampled independently from the same underlying probability distribution.

Unfortunately, a random uniform data split is rarely a good simulation of practical reality where a newly collected data set X_{\text{new}} which is fed into a machine learning model to obtain predictions almost never follows the data distribution of the data set X_{\text{train}} on which the model was originally trained. This distributional shift between the initial training data set X_{\text{train}} and the newly collected data set X_{\text{new}} normally leads to a substantial drop in performance of the model on X_{\text{new}} compared to its performance on a test set X_{\text{test}} which follows the same distribution as X_{\text{train}}. Thus, splitting the initial data set X uniformly at random into a test set X_{\text{test}} and a training set X_{\text{train}} often leads to overoptimistic results when trying to estimate the predictive abilities of a machine learning model in a practical setting.

To get a more reliable picture of the real-world predictive capabilities of a trained machine learning model one must find a way to model a meaningful distributional shift and build it into the test set X_{\text{test}}. Evaluating the model on X_{\text{test}} can then provide a measure for the out-of-distribution generalisation abilities of the model.

Measuring out-of-distribution generalisation is of particular relevance in the field of molecular property prediction where distributional shifts tend to be large and difficult to handle for machine learning models. Different molecular data sets obtained by distinct pharmaceutical companies and research groups often contain compounds from vastly different areas of chemical space that exhibit high structural heterogeneity. An elegant solution for the modelling of such distributional shifts in chemical space is given by the idea of scaffold splitting.

The notion of a (two-dimensional) molecular scaffold is described in the article by Bemis and Murcko [2]. A molecular scaffold reduces the chemical structure of a compound to its core components, essentially by removing all side chains and only keeping ring systems and parts which link together ring systems. An additional option for making molecular scaffolds even more general is to “forget” the identities of the bonds and atoms by replacing all atoms with carbons and all bonds with single bonds.

Bemis-Murcko scaffolds can be automatically generated in RDKit via the following Python code:

# how to extract the Bemis-Murcko scaffold of a molecular compound via RDKit

# import packages
from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold

# define compound via its SMILES string
smiles = "CN1CCCCC1CCN2C3=CC=CC=C3SC4=C2C=C(C=C4)SC"

# convert SMILES string to RDKit mol object 
mol = Chem.MolFromSmiles(smiles)

# create RDKit mol object corresponding to Bemis-Murcko scaffold of original compound
mol_scaffold = MurckoScaffold.GetScaffoldForMol(mol)

# make the scaffold generic by replacing all atoms with carbons and all bonds with single bonds
mol_scaffold_generic = MurckoScaffold.MakeScaffoldGeneric(mol_scaffold)

# convert the generic scaffold mol object back to a SMILES string format
smiles_scaffold_generic = Chem.CanonSmiles(Chem.MolToSmiles(mol_scaffold_generic))

# display compound and its generic Bemis-Murcko scaffold
display(mol)
print(smiles)
display(mol_scaffold_generic)
print(smiles_scaffold_generic)

If we now have a molecular data set X, we can map each compound in X to its respective scaffold. Let us assume that a total number of s pairwise distinct scaffolds appear in X and that these scaffolds are numbered consecutively from 1 to s. We can then define an equivalence relation on X by calling two compounds equivalent if they share the same scaffold. The associated equivalence classes consist of compound sets X_{1}, ..., X_s whereby a given set X_{k} contains all compounds in X which share the k-th scaffold. It is not hard to see that the sets X_{1}, ..., X_s form a partition of the original data set X. Without loss of generality, we assume that the equivalence classes X_1, ..., X_s are ordered by size in descending order, i.e. we assume that X_{1} contains at least as many molecules as X_{2}, and so on.

One appropriate way to now produce a scaffold split of the molecular data set X into a training set X_{\text{train}} and a test set X_{\text{test}} for machine learning is to define X_{\text{train}} as the union of the first (larger) sets X_1, ..., X_c and X_{\text{test}} as the union of the last (smaller) sets X_{c+1},...,X_s. Here c is a custom index parameter which can be used to control the respective sizes of X_{\text{train}} and X_{\text{test}}; frequently c is chosen such that X_{\text{train}} contains approximately 80 \% of the examples in X.

While a scaffold split is certainly not perfect, it is already a lot better than a uniform random split at providing a relevant measure of the practical utility of a molecular property prediction model. It mimics a situation where the training set X_{\text{train}} was sampled from a structurally different area of chemical space than the test set X_{\text{test}}. This creates a distributional shift between X_{\text{train}} and X_{\text{test}} which is comparable to the distributional shifts which are commonly observed in real chemical data sets. Evaluating a molecular machine learning model using a scaffold split rather than a uniform random split thus leads to significantly more robust results.

References:

[1] Poggio, Tomaso, and Christian R. Shelton. “On the mathematical foundations of learning.” American Mathematical Society 39.1 (2002): 1-49.

[2] Bemis, Guy W., and Mark A. Murcko. “The properties of known drugs. 1. Molecular frameworks.” Journal of medicinal chemistry 39.15 (1996): 2887-2893.

Author