Category Archives: Molecular Recognition

Featurisation is Key: One Version Change that Halved DiffDock’s Performance

1. Introduction 

Molecular docking with graph neural networks works by representing the molecules as featurized graphs. In DiffDock, each ligand becomes a graph of atoms (nodes) and bonds (edges), with features assigned to every atom using chemical properties such as atom type, implicit valence and formal charge. 
 
We recently discovered that a change in RDKit versions significantly reduces performance on the PoseBusters benchmark, due to changes in the “implicit valence” feauture. This post walks through: 

  • How DiffDock featurises ligands 
  • What happened when we upgraded RDKit 2022.03.3 → 2025.03.1 
  • Why training with zero-only features and testing on non-zero features is so bad 

TL:DR: Use the dependencies listed in the environment.yml file, especially in the case of DiffDock, or your performance could half!  

2. Graph Representation in DiffDock 

DiffDock turns a ligand into input for a graph neural net by 

  1. Loading the ligand from an SDF file via RDKit. 
  1. Stripping all hydrogens to keep heavy atoms only. 
  1. Featurising each atom into a 16-dimensional vector: 

0: Atomic number 

1: Chirality tag 

2: Total bond degree 

3: Formal charge 

4: Implicit valence 

5: Number of implicit H’s 

6: Radical electrons 

7: Hybridisation 

8: Aromatic flag 

9-15: Ring-membership flags (rings of size 3–8) 

  1. Building a PyG HeteroData containing node features and bond-edges. 
  1. Randomizing position, orientation and torsion angles before inputting to the model for inference. 

3. PoseBusters Benchmark & RDKit Version Bump 

Using the supplied evaluation.py script which docks into whichever protein chains the ground truth is bound to, we evaluated on the 428-complex PoseBusters set using two different RDKit versions: 

RDKit version <2rmsd success rate 
2022.03.3 50.89 % 
2025.03.1 23.72 % 

With no changes other than the RDKit version, the success rate dropped by over half. 

Having checked the evaluation and conformer-generation steps, I took a more detailed look at the preprocessed data being fed into the model using the different RDKit versions. Everything was identical except implicit valence: 
– RDKit 2022.03.3: implicit valence = 0 for every atom 
– RDKit 2025.03.1: implicit valence ranges from 0-3  

Relevant Changes to RDKit’s GetImplicitValence() 

Between 2022.03.3 and 2025.03.1, RDKit was refactored so that implicit hydrogen counts are recomputed rather than permanently zeroed out after stripping explicit H’s. 

Old 2022.03.3 behavior: 

  • RemoveHs() deletes all explicit hydrogens and sets each heavy atom’s internal flag df_noImplicit = true, keeping only a heavy atom representation. 
  • Once df_noImplicit is set, asking for implicit valence always returns 0, even if you re-run sanitization. 

New 2025.03.1 behavior: 

  • RemoveHs() deletes explicit hydrogens but does not flag df_noImplicit = true, allowing recomputation of implicit valence. 
  • Sanitization calculates implicit valence = allowed valence – sum of explicit bonds 
  • GetImplicitValence() returns the correct implicit valence, even after stripping all H’s. 

These changes mean: 
Old (2022.03.3): RemoveHs() → df_noImplicit → GetImplicitValence() always 0 
New (2025.03.1): RemoveHs() (flag untouched) → sanitization recomputes → GetImplicitValence() returns the correct implicit-H count 
 
Because DiffDock was only ever trained on zeros at that index, suddenly inputting non-zero values at inference caused this collapse in performance. 

We force-zeroed that column and recovered peformance under the new RDkit, validating that this caused the drop in performance: 

– implicit_valence = atom.GetImplicitValence() 

+ implicit_valence = 0 

RDKit build Success rate 
2022.03.3 baseline 50.89 % 
2025.03.1 unpatched 23.72 % 
2025.03.1 patched 50.26 % 

4. Why Zero-Trained → Non-Zero-Tested Is So Bad  

The weight, w, controls how much “implicit valence” influences the network. There’s also a built-in bias b and an activation function ϕ. Together they compute: 
 
    output = ϕ (w v + b) 

Where v is the implicit valence feature. 

What Happens When You Train on Only Zeros? 

  • Implicit valence (v) = 0 every time you train. 
  • Since the input is always zero, there’s no signal telling w to move. In the absence of an explicit mechanism for the weights to become zero, such as weight decay, they will remain non-zero. 
  • Effectively, the model learns that the implicit valence feature column doesn’t matter, and w remains at the random starting point. 

What happens at test time? 

  • The implicit valence feature (v) might be 1, 2, or 3 now. 
  • The unchanged, random w multiplies this new v, producing unpredictable activations ϕ(wrandom v+b). 
  • These activations continue through downstream layers to the final prediction output. 

5. Conclusion 

Featurisation is very important – in the case of DiffDock, one library tweak changed one feature column and halved the performance! The fix was easy once it was found, but remember: 

  1. Featurization is key  
  1. Particularly in the case of DiffDock, use the listed dependency versions! 
  1. If you see a sudden large change in performance, it might be worth checking the package versions and the features… 

Happy docking! 

Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data

I’m delighted to report our collaboration (Ísak Valsson, Matthew Warren, Aniket Magarkar, Phil Biggin, & Charlotte Deane), on “Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data”, has been published in Nature’s Communications Chemistry (https://doi.org/10.1038/s42004-025-01428-y).


During his MSc dissertation project in the Department of Statistics, University of Oxford, OPIG member Ísak Valsson developed an attention-based GNN to predict protein-ligand binding affinity called “AEV-PLIG”. It featurizes a ligand’s atoms using Atomic Environment Vectors to describe the Protein-Ligand Interactions found in a 3D protein-ligand complex. AEV-PLIG is free and open source (BSD 3-Clause), available from GitHub at https://github.com/oxpig/AEV-PLIG, and forked at https://github.com/bigginlab/AEV-PLIG.

Continue reading

Making Peace with Molecular Entropy

I first stumbled upon OPIG blogs through a post on ligand-binding thermodynamics, which refreshed my understanding of some thermodynamics concepts from undergrad, bringing me face-to-face with the concept that made most molecular physics students break out in cold sweats: Entropy. Entropy is that perplexing measure of disorder and randomness in a system. In the context of molecular dynamics simulations (MD), it calculates the conformational freedom and disorder within protein molecules which becomes particularly relevant when calculating binding free energies.

In MD, MM/GBSA and MM/PBSA are fancy terms for trying to predict how strongly molecules stick together and are the go-to methods for binding free energy calculations. MM/PBSA uses the Poisson–Boltzmann (PB) equation to account for solvent polarisation and ionic effects accurately but at a high computational cost. While MM/GBSA approximates PB, using the Generalised Born (GB) model, offering faster calculations suitable for large systems, though with reduced accuracy. Consider MM/PBSA as the careful accountant who considers every detail but takes forever, while MM/GBSA is its faster, slightly less accurate coworker who gets the job done when you’re in a hurry.

Like many before me, I made the classic error of ignoring entropy, assuming that entropy changes that were similar across systems being compared would have their terms cancel out and could be neglected. This would simplify calculations and ease computational constraints (in other words it was too complicated, and I had deadlines breathing down my neck). This worked fine… until it didn’t. The wake-up call came during a project studying metal-isocitrate complexes in IDH1. For context, IDH1 is a homodimer with a flexible ‘hinge’ region that becomes unstable without its corresponding subunit, giving rise to very high fluctuations. By ignoring entropy in this unstable system, I managed to generate binding free energy results that violated several laws of thermodynamics and would make Clausius roll in his grave.

Continue reading

Conference Summary: MGMS Adaptive Immune Receptors Meeting 2024

On 5th April 2024, over 60 researchers braved the train strikes and gusty weather to gather at Lady Margaret Hall in Oxford and engage in a day full of scientific talks, posters and discussions on the topic of adaptive immune receptor (AIR) analysis!

Continue reading

Inverse Vaccines

One of the nice things about OPIG, is that you can talk about something which is outside of your wheelhouse without feeling that the specialists in the group are going to eat your lunch. Last week, I gave an overview of the Hubbell group‘s Nature paper Synthetically glycosylated antigens for the antigen-specific suppression of established immune responses. I am not an immunologist by any stretch of the imagination, but sometimes you come across a piece of really interesting science and just want to say to people: Have you seen this, look at this, it’s really clever!

Continue reading

9th Joint Sheffield Conference on Cheminformatics

Over the next few days, researchers from around the world will be gathering in Sheffield for the 9th Joint Sheffield Conference on Cheminformatics. As one of the organizers (wearing my Molecular Graphics and Modeling Society ‘hat’), I can say we have an exciting array of speakers and sessions:

  • De Novo Design
  • Open Science
  • Chemical Space
  • Physics-based Modelling
  • Machine Learning
  • Property Prediction
  • Virtual Screening
  • Case Studies
  • Molecular Representations

It has traditionally taken place every three years, but despite the global pandemic it is returning this year, once again in person in the excellent conference facilities at The Edge. You can download the full programme in iCal format, and here is the conference calendar:

Continue reading

histo.fyi: A Useful New Database of Peptide:Major Histocompatibility Complex (pMHC) Structures

pMHCs are set to become a major target class in drug discovery; unusual peptide fragments presented by MHC can be used to distinguish infected/cancerous cells from healthy cells more precisely than over-expressed biomarkers. In this blog post, I will highlight a prototype resource: Dr. Chris Thorpe’s new database of pMHC structures, histo.fyi.

histo.fyi provides a one-stop shop for data on (currently) around 1400 pMHC complexes. Similar to our dedicated databases for antibody/nanobody structures (SAbDab) and T-cell receptor (TCR) structures (STCRDab), histo.fyi will scrape the PDB on a weekly basis for any new pMHC data and process these structures in a way that facilitates their analysis.

Continue reading

CryoEM is now the dominant technique for solving antibody structures

Last year, the Structural Antibody Database (SAbDab) listed a record-breaking 894 new antibody structures, driven in no small part by the continued efforts of the researchers to understand SARS-CoV-2.

Fig. 1: The aggregate growth in antibody structure data (all methods) over time. Taken from http://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/stats/ on 25th May 2022.

In this blog post I wanted to highlight the major driving force behind this curve – the huge increase in cryo electron microscopy (cryoEM) data – and the implications of this for the field of structure-based antibody informatics.

Continue reading

New review on BCR/antibody repertoire analysis out in MAbs!

In our latest immunoinformatics review, OPIG has teamed up with experienced antibody consultant Dr. Anthony Rees to outline the evidence for BCR/antibody repertoire convergence on common epitopes post-pathogen exposure, and all the ways we can go about detecting it from repertoire gene sequencing data. We highlight the new advances in the repertoire functional analysis field, including the role for OPIG’s latest tools for structure-aware antibody analytics: Structural Annotation of AntiBody repertoires+ (SAAB+), Paratyping, Ab-Ligity, Repertoire Structural Profiling & Structural Profiling of Antibodies to Cluster by Epitope (‘SPACE’).

Continue reading

Automated intermolecular interaction detection using the ODDT Python Module

Detecting intermolecular interactions is often one of the first steps when assessing the binding mode of a ligand. This usually involves the human researcher opening up a molecular viewer and checking the orientations of the ligand and protein functional groups, sometimes aided by the viewer’s own interaction detecting functionality. For looking at single digit numbers of structures, this approach works fairly well, especially as more experienced researchers can spot cases where the automated interaction detection has failed. When analysing tens or hundreds of binding sites, however, an automated way of detecting and recording interaction information for downstream processing is needed. When I had to do this recently, I used an open-source Python module called ODDT (Open Drug Discovery Toolkit, its full documentation can be found here).

My use case was fairly standard: starting with a list of holo protein structures as pdb files and their corresponding ligands in .sdf format, I wanted to detect any hydrogen bonds between a ligand and its native protein crystal structure. Specifically, I needed the number and name of the the interacting residue, its chain ID, and the name of the protein atom involved in the interaction. A general example on how to do this can be found in the ODDT documentation. Below, I show how I have used the code on PDB structure 1a9u.

Continue reading