When designing machine learning algorithms with the aim of accelerating the discovery of novel and more effective therapeutics, we often care deeply about their ability to generalise to new regions of chemical space and accurately predict the properties of molecules that are structurally or functionally dissimilar to the ones we have already explored. To evaluate the performance of algorithms in such an out-of-distribution setting, it is essential that we are able to quantify the data shift that is induced by the train-test splits that we rely on to decide which model to deploy in production.
For our recent ICML 2023 paper Drug Discovery under Covariate Shift with Domain-Informed Prior Distributions over Functions, we chose to quantify the distributional similarity between two sets of molecules through the Maximum Mean Discrepancy (MMD).
Continue reading