A Simple Way to Quantify the Similarity Between Two Sets of Molecules

When designing machine learning algorithms with the aim of accelerating the discovery of novel and more effective therapeutics, we often care deeply about their ability to generalise to new regions of chemical space and accurately predict the properties of molecules that are structurally or functionally dissimilar to the ones we have already explored. To evaluate the performance of algorithms in such an out-of-distribution setting, it is essential that we are able to quantify the data shift that is induced by the train-test splits that we rely on to decide which model to deploy in production.

For our recent ICML 2023 paper Drug Discovery under Covariate Shift with Domain-Informed Prior Distributions over Functions, we chose to quantify the distributional similarity between two sets of molecules through the Maximum Mean Discrepancy (MMD).


The MMD (Gretton et al., 2012) is a non-parametric test statistic that allows us to quantify how likely two sets of molecules are to have been drawn from the same distribution over chemical space. It works by first embedding each molecule in a reproducing kernel Hilbert space (RKHS) defined by a mapping and similarity function:

and then quantifying distributional similarity as the distance between the mean of the embeddings of the training and test set:

which can be easily computed as:


While some technical restrictions apply on which embeddings and similarity functions yield valid estimators, they are conveniently fulfilled by standard Morgan/extended-connectivity fingerprints and the Jaccard/Tanimoto kernel, meaning that for an all-by-all Tanimoto similarity matrix sim_mat and a boolean train-test split index train_idx, the distributional similarity of the training and test sets can be easily quantified as:

import numpy as np

def distributional_similarity(train_idx, sim_mat):
    """
    Implements an unbiased estimator of the maximum mean discrepancy statistic for a given training index and kernel similarity matrix.
    """
    train_mean = (sim_mat[np.ix_(train_idx, train_idx)].sum() - train_idx.sum()) / (train_idx.sum() * (train_idx.sum() - 1))
    test_mean = (sim_mat[np.ix_(~train_idx, ~train_idx)].sum() - (~train_idx).sum()) / ((~train_idx).sum() * ((~train_idx).sum() - 1))
    train_test_mean = sim_mat[np.ix_(train_idx, ~train_idx)].mean()
        
    mmd_squared = train_mean + test_mean - 2 * train_test_mean
    return np.sqrt(mmd_squared)

Using such a rigorous and easily computable statistic allowed us to (1) investigate the extent to which standard train-test splits induce practically relevant covariate shift, (2) construct a maximally extrapolative evaluation setup and (3) develop methods that show improved out-of-distribution robustness and generalisation. Feel free to check out the paper for more details and background.

Author