Category Archives: Machine Learning

Featurisation is Key: One Version Change that Halved DiffDock’s Performance

1. Introduction 

Molecular docking with graph neural networks works by representing the molecules as featurized graphs. In DiffDock, each ligand becomes a graph of atoms (nodes) and bonds (edges), with features assigned to every atom using chemical properties such as atom type, implicit valence and formal charge. 
 
We recently discovered that a change in RDKit versions significantly reduces performance on the PoseBusters benchmark, due to changes in the “implicit valence” feauture. This post walks through: 

  • How DiffDock featurises ligands 
  • What happened when we upgraded RDKit 2022.03.3 → 2025.03.1 
  • Why training with zero-only features and testing on non-zero features is so bad 

TL:DR: Use the dependencies listed in the environment.yml file, especially in the case of DiffDock, or your performance could half!  

2. Graph Representation in DiffDock 

DiffDock turns a ligand into input for a graph neural net by 

  1. Loading the ligand from an SDF file via RDKit. 
  1. Stripping all hydrogens to keep heavy atoms only. 
  1. Featurising each atom into a 16-dimensional vector: 

0: Atomic number 

1: Chirality tag 

2: Total bond degree 

3: Formal charge 

4: Implicit valence 

5: Number of implicit H’s 

6: Radical electrons 

7: Hybridisation 

8: Aromatic flag 

9-15: Ring-membership flags (rings of size 3–8) 

  1. Building a PyG HeteroData containing node features and bond-edges. 
  1. Randomizing position, orientation and torsion angles before inputting to the model for inference. 

3. PoseBusters Benchmark & RDKit Version Bump 

Using the supplied evaluation.py script which docks into whichever protein chains the ground truth is bound to, we evaluated on the 428-complex PoseBusters set using two different RDKit versions: 

RDKit version <2rmsd success rate 
2022.03.3 50.89 % 
2025.03.1 23.72 % 

With no changes other than the RDKit version, the success rate dropped by over half. 

Having checked the evaluation and conformer-generation steps, I took a more detailed look at the preprocessed data being fed into the model using the different RDKit versions. Everything was identical except implicit valence: 
– RDKit 2022.03.3: implicit valence = 0 for every atom 
– RDKit 2025.03.1: implicit valence ranges from 0-3  

Relevant Changes to RDKit’s GetImplicitValence() 

Between 2022.03.3 and 2025.03.1, RDKit was refactored so that implicit hydrogen counts are recomputed rather than permanently zeroed out after stripping explicit H’s. 

Old 2022.03.3 behavior: 

  • RemoveHs() deletes all explicit hydrogens and sets each heavy atom’s internal flag df_noImplicit = true, keeping only a heavy atom representation. 
  • Once df_noImplicit is set, asking for implicit valence always returns 0, even if you re-run sanitization. 

New 2025.03.1 behavior: 

  • RemoveHs() deletes explicit hydrogens but does not flag df_noImplicit = true, allowing recomputation of implicit valence. 
  • Sanitization calculates implicit valence = allowed valence – sum of explicit bonds 
  • GetImplicitValence() returns the correct implicit valence, even after stripping all H’s. 

These changes mean: 
Old (2022.03.3): RemoveHs() → df_noImplicit → GetImplicitValence() always 0 
New (2025.03.1): RemoveHs() (flag untouched) → sanitization recomputes → GetImplicitValence() returns the correct implicit-H count 
 
Because DiffDock was only ever trained on zeros at that index, suddenly inputting non-zero values at inference caused this collapse in performance. 

We force-zeroed that column and recovered peformance under the new RDkit, validating that this caused the drop in performance: 

– implicit_valence = atom.GetImplicitValence() 

+ implicit_valence = 0 

RDKit build Success rate 
2022.03.3 baseline 50.89 % 
2025.03.1 unpatched 23.72 % 
2025.03.1 patched 50.26 % 

4. Why Zero-Trained → Non-Zero-Tested Is So Bad  

The weight, w, controls how much “implicit valence” influences the network. There’s also a built-in bias b and an activation function ϕ. Together they compute: 
 
    output = ϕ (w v + b) 

Where v is the implicit valence feature. 

What Happens When You Train on Only Zeros? 

  • Implicit valence (v) = 0 every time you train. 
  • Since the input is always zero, there’s no signal telling w to move. In the absence of an explicit mechanism for the weights to become zero, such as weight decay, they will remain non-zero. 
  • Effectively, the model learns that the implicit valence feature column doesn’t matter, and w remains at the random starting point. 

What happens at test time? 

  • The implicit valence feature (v) might be 1, 2, or 3 now. 
  • The unchanged, random w multiplies this new v, producing unpredictable activations ϕ(wrandom v+b). 
  • These activations continue through downstream layers to the final prediction output. 

5. Conclusion 

Featurisation is very important – in the case of DiffDock, one library tweak changed one feature column and halved the performance! The fix was easy once it was found, but remember: 

  1. Featurization is key  
  1. Particularly in the case of DiffDock, use the listed dependency versions! 
  1. If you see a sudden large change in performance, it might be worth checking the package versions and the features… 

Happy docking! 

NVIDIA Reimagines CUDA for Python Developers

According to GitHub’s Open Source Survey, Python has officially become the world’s most popular programming language in 2024 – ultimately surpassing JavaScript. Due to its exceptional popularity, NVIDIA announced Python support for its CUDA toolkit at last year’s GTC conference, marking a major leap in the accessibility of GPU computing. With the latest update (https://nvidia.github.io/cuda-python/latest/) and for the first time, developers can write Python code that runs directly on NVIDIA GPUs without the need for intermediate C or C++ code.

Historically tied to C and C++, CUDA has found its way into Python code through third-party wrappers and libraries. Now, the arrival of native support means a smoother, more intuitive experience.

This paradigm shift opens the door for millions of Python programmers – including our scientific community – to build powerful AI and scientific tools without having to switch languages or learn legacy syntax.

Continue reading

Confidence in ML models

Recently, I have been interested in adding a confidence metric to the predictions made by a machine learning model I have been working on. In this blog post, I will outline a few strategies I have been exploring to do this. Powerful deep learning models like AlphaFold are great, not only for the predictions they make, but they also generate confidence measures to give the user a sense of how much to trust the prediction.

Continue reading

Combining Multiple Comparisons Similarity plots for statistical tests

Following on from my previous blopig post, Garrett gave the very helpful suggestion of combining Multiple Comparisons Similarity (MCSim) plots to reduce information redundancy. For example, this an MCSim plot from my previous blog post:

This plot shows effect sizes from a statistical test (specifically Tukey HSD) between mean absolute error (MAE) scores for different molecular featurization methods on a benchmark dataset. Red shows that the method on the y-axis has a greater average MAE score than the method on the x-axis; blue shows the inverse. There is redundancy in this plot, as the same information is displayed in both the upper and lower triangles. Instead, we could plot both the effect size and the p-values from a test on the same MCSim.

Continue reading

Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data

I’m delighted to report our collaboration (Ísak Valsson, Matthew Warren, Aniket Magarkar, Phil Biggin, & Charlotte Deane), on “Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data”, has been published in Nature’s Communications Chemistry (https://doi.org/10.1038/s42004-025-01428-y).


During his MSc dissertation project in the Department of Statistics, University of Oxford, OPIG member Ísak Valsson developed an attention-based GNN to predict protein-ligand binding affinity called “AEV-PLIG”. It featurizes a ligand’s atoms using Atomic Environment Vectors to describe the Protein-Ligand Interactions found in a 3D protein-ligand complex. AEV-PLIG is free and open source (BSD 3-Clause), available from GitHub at https://github.com/oxpig/AEV-PLIG, and forked at https://github.com/bigginlab/AEV-PLIG.

Continue reading

Estimating the Generalisability of Machine Learning Models in Drug Discovery

Machine learning (ML) has significantly advanced key computational tasks in drug discovery, including virtual screening, binding affinity prediction, protein-ligand structure prediction (co-folding), and docking. However, the extent to which these models generalise beyond their training data is often overestimated due to shortcomings in benchmarking datasets. Existing benchmarks frequently fail to account for similarities between the training and test sets, leading to inflated performance estimates. This issue is particularly pronounced in tasks where models tend to memorise training examples rather than learning generalisable biophysical principles. The figure below demonstrates two examples of model performance decreasing with increased dissimilarity between training and test data, for co-folding (left) and binding affinity prediction (right).

Continue reading

Baby’s First NeurIPS: A Survival Guide for Conference Newbies

There’s something very surreal about stepping into your first major machine learning conference: suddenly, all those GitHub usernames, paper authors, and protagonists of heated twitter spats become real people, the hallways are buzzing with discussions of papers you’ve been meaning to read, and somehow there are 17,000 other people trying to navigate it all alongside you. That was my experience at NeurIPS this year, and despite feeling like a microplankton in an ocean of ML research, I had a grand time. While some of this success was pure luck, much of it came down to excellent advice from the group’s ML conference veterans and lessons learned through trial and error. So, before the details fade into a blur of posters and coffee breaks, here’s my guide to making the most of your first major ML conference.

Continue reading

Generating Haikus with Llama 3.2

At the recent OPIG retreat, I was tasked with writing the pub quiz. The quiz included five rounds, and it’s always fun to do a couple “how well do you know your group?” style rounds. Since I work with Transformers, I thought it would be fun to get AI to create Haiku summaries of OPIGlet research descriptions from the website.

AI isn’t as funny as it used to be, but it’s a lot easier to get it to write something coherent. There are also lots of knobs you can turn like temperature, top_p, and the details of the prompt. I decided to use Meta’s new Llama 3.2-3B-Instruct model which is publicly available on Hugging Face. I ran it locally using vllm, and instructed it to write a haiku for each member’s description using a short script which parses the html from the website.

Continue reading

Visualising and validating differences between machine learning models on small benchmark datasets

Introduction
Author

Sam Money-Kyrle

Introduction

An epidemic is sweeping through cheminformatics (and machine learning) research: ugly results tables. These tables are typically bloated with metrics (such as regression and classification metrics next to each other), vastly differing tasks, erratic bold text, and many models. As a consequence, results become difficult to analyse and interpret. Additionally, it is rare to see convincing evidence, such as statistical tests, for whether one model is ‘better’ than another (something Pat Walters has previously discussed). Tables are a practical way to present results and are appropriate in many cases; however, this practicality should not come at the cost of clarity.

The terror of ugly tables extends to benchmark leaderboards, such as Therapeutic Data Commons (TDC). These leaderboard tables do not show:

  1. whether differences in metrics between methods are statistically significant,
  2. whether methods use ensembles or single models,
  3. whether methods use classical (such as Morgan fingerprints) or learned (such as Graph Neural Networks) representations,
  4. whether methods are pre-trained or not,
  5. whether pre-trained models are supervised, self-supervised, or both,
  6. the data and tasks that pre-trained models are pre-trained on.

This lack of context makes meaningful comparisons between approaches challenging, obscuring whether performance discrepancies are due to variance, ensembling, overfitting, exposure to more data, or novelties in model architecture and molecular featurisation. Confirming the statistical significance of performance differences (under consistent experimental conditions!) is crucial in constructing a more lucid picture of machine learning in drug discovery. Using figures to share results in a clear, non-tabular format would also help.

Statistical validation is particularly relevant in domains with small datasets, such as drug discovery, as the small number of test samples leads to high variance in performance between different splits. Recent work by Ash et al. (2024) sought to alleviate the lack of statistical validation in cheminformatics by sharing a helpful set of guidelines for researchers. Here, we explore implementing some of the methods they suggest (plus some others) in Python.

Continue reading