Category Archives: AI

Featurisation is Key: One Version Change that Halved DiffDock’s Performance

1. Introduction 

Molecular docking with graph neural networks works by representing the molecules as featurized graphs. In DiffDock, each ligand becomes a graph of atoms (nodes) and bonds (edges), with features assigned to every atom using chemical properties such as atom type, implicit valence and formal charge. 
 
We recently discovered that a change in RDKit versions significantly reduces performance on the PoseBusters benchmark, due to changes in the “implicit valence” feauture. This post walks through: 

  • How DiffDock featurises ligands 
  • What happened when we upgraded RDKit 2022.03.3 → 2025.03.1 
  • Why training with zero-only features and testing on non-zero features is so bad 

TL:DR: Use the dependencies listed in the environment.yml file, especially in the case of DiffDock, or your performance could half!  

2. Graph Representation in DiffDock 

DiffDock turns a ligand into input for a graph neural net by 

  1. Loading the ligand from an SDF file via RDKit. 
  1. Stripping all hydrogens to keep heavy atoms only. 
  1. Featurising each atom into a 16-dimensional vector: 

0: Atomic number 

1: Chirality tag 

2: Total bond degree 

3: Formal charge 

4: Implicit valence 

5: Number of implicit H’s 

6: Radical electrons 

7: Hybridisation 

8: Aromatic flag 

9-15: Ring-membership flags (rings of size 3–8) 

  1. Building a PyG HeteroData containing node features and bond-edges. 
  1. Randomizing position, orientation and torsion angles before inputting to the model for inference. 

3. PoseBusters Benchmark & RDKit Version Bump 

Using the supplied evaluation.py script which docks into whichever protein chains the ground truth is bound to, we evaluated on the 428-complex PoseBusters set using two different RDKit versions: 

RDKit version <2rmsd success rate 
2022.03.3 50.89 % 
2025.03.1 23.72 % 

With no changes other than the RDKit version, the success rate dropped by over half. 

Having checked the evaluation and conformer-generation steps, I took a more detailed look at the preprocessed data being fed into the model using the different RDKit versions. Everything was identical except implicit valence: 
– RDKit 2022.03.3: implicit valence = 0 for every atom 
– RDKit 2025.03.1: implicit valence ranges from 0-3  

Relevant Changes to RDKit’s GetImplicitValence() 

Between 2022.03.3 and 2025.03.1, RDKit was refactored so that implicit hydrogen counts are recomputed rather than permanently zeroed out after stripping explicit H’s. 

Old 2022.03.3 behavior: 

  • RemoveHs() deletes all explicit hydrogens and sets each heavy atom’s internal flag df_noImplicit = true, keeping only a heavy atom representation. 
  • Once df_noImplicit is set, asking for implicit valence always returns 0, even if you re-run sanitization. 

New 2025.03.1 behavior: 

  • RemoveHs() deletes explicit hydrogens but does not flag df_noImplicit = true, allowing recomputation of implicit valence. 
  • Sanitization calculates implicit valence = allowed valence – sum of explicit bonds 
  • GetImplicitValence() returns the correct implicit valence, even after stripping all H’s. 

These changes mean: 
Old (2022.03.3): RemoveHs() → df_noImplicit → GetImplicitValence() always 0 
New (2025.03.1): RemoveHs() (flag untouched) → sanitization recomputes → GetImplicitValence() returns the correct implicit-H count 
 
Because DiffDock was only ever trained on zeros at that index, suddenly inputting non-zero values at inference caused this collapse in performance. 

We force-zeroed that column and recovered peformance under the new RDkit, validating that this caused the drop in performance: 

– implicit_valence = atom.GetImplicitValence() 

+ implicit_valence = 0 

RDKit build Success rate 
2022.03.3 baseline 50.89 % 
2025.03.1 unpatched 23.72 % 
2025.03.1 patched 50.26 % 

4. Why Zero-Trained → Non-Zero-Tested Is So Bad  

The weight, w, controls how much “implicit valence” influences the network. There’s also a built-in bias b and an activation function ϕ. Together they compute: 
 
    output = ϕ (w v + b) 

Where v is the implicit valence feature. 

What Happens When You Train on Only Zeros? 

  • Implicit valence (v) = 0 every time you train. 
  • Since the input is always zero, there’s no signal telling w to move. In the absence of an explicit mechanism for the weights to become zero, such as weight decay, they will remain non-zero. 
  • Effectively, the model learns that the implicit valence feature column doesn’t matter, and w remains at the random starting point. 

What happens at test time? 

  • The implicit valence feature (v) might be 1, 2, or 3 now. 
  • The unchanged, random w multiplies this new v, producing unpredictable activations ϕ(wrandom v+b). 
  • These activations continue through downstream layers to the final prediction output. 

5. Conclusion 

Featurisation is very important – in the case of DiffDock, one library tweak changed one feature column and halved the performance! The fix was easy once it was found, but remember: 

  1. Featurization is key  
  1. Particularly in the case of DiffDock, use the listed dependency versions! 
  1. If you see a sudden large change in performance, it might be worth checking the package versions and the features… 

Happy docking! 

AI generated linkers™: a tutorial

In molecular biology cutting and tweaking a protein construct is an often under-appreciated essential operation. Some protein have unwanted extra bits. Some protein may require a partner to be in the correct state, which would be ideally expressed as a fusion protein. Some protein need parts replacing. Some proteins disfavour a desired state. Half a decade ago, toolkits exists to attempt to tackle these problems, and now with the advent of de novo protein generation new, powerful, precise and way less painful methods are here. Therefore, herein I will discuss how to generate de novo inserts and more with RFdiffusion and other tools in order to quickly launch a project into the right orbit.
Furthermore, even when new methods will have come out, these design principles will still apply —so ignore the name of the de novo tool used.

Continue reading

AI in Academic Writing: Ethical Considerations and Best Practices

I don’t need to tell you how popular AI, in particular LLMs, have become in recent years. Alongside this rapid growth comes uncharted territory, especially with respect to plagiarism and integrity. As we adapt to a rapidly changing technological climate, we become increasingly reliant on AI. Need some help phrasing an email? Ask ChatGPT. Need to create a packing list for an upcoming camping trip? Get an AI-based task manager. So naturally when we’re faced with the daunting, and admittedly tedious task of writing papers, we question whether we can offload some of that work to a machine. As with many things, the question is not simply whether or not you should use AI in your writing process, it’s how you choose to do so.

When To Use AI

  1. Grammar and readability
    Particularly useful for those who are writing in a second language, or for those who struggle to write in their first language (my high school English teachers would place me firmly in this category), LLMs can be used beneficially to identify awkward phrasing, flag excessively complex sentences, and identify and fix grammatical errors.
  2. Formatting and structure
    LLMs can take care of some of the tedious, repetitive work with respect to formatting and structure. They are particularly useful for generating LaTeX templates for figures, tables, equations, and general structure. You can also use them to check that you’re matching a specific journal’s standards. For example, you can give an LLM a sample of articles from a target publication, and ask it to note the structure of these papers. Then, give it your work and ask it to make general, larger-scale suggestions to ensure that your work aligns structurally with articles typical of that journal.
  3. Reference management
    Although the references should be read and cited by an author, various management tasks like creating properly formatted BibTeX entries can be handled by LLMs. Additionally, you can use LLMs to do a sanity check to ensure that your references are an accurate reflection of the source material they are referring to. However, they should not to be used to summarise the source and create references on their own. 
  4. Summarising large volumes of literature
    If you’re reviewing large volumes of literature, LLMs can help summarise papers efficiently and point you in the right direction. Although you should always cite and refer back to the original source, LLMs can distill key points from long, dense papers, organise notes, and extract important takeaways from datasets and figures.

Regardless of how you use AI, it is importance to keep a record of all instances of such AI use throughout your research, including use during coding, Some journals will make you explicitly declare the use of AI tools, but even if it is not required this kind of record-keeping is considered good practice. 

When Not to Use AI

  1. Big-picture thinking and narrative development
    Academic papers are not solely about presenting information, they are about constructing an argument, building a narrative flow, and presenting a compelling case. LLMs are not particularly good at replicating human creativity, that work is best left to the authors. Additionally, it is dishonest to claim these important aspects of writing as your own if they are not written directly by you.
  2. Direct copy-paste
    Although AI tools may suggest minor edits, you should never directly copy-and-paste larger selections of AI-generated text. If the ethical concerns described in (1) do not persuade you as they should, there are now plenty of tools being used to detect AI-generated text by various academic institutions and journals. Although some scholars do tend to lean on AI as a more collaborative tool, transparency is key. 
  3. Source of knowledge
    LLMs don’t actually “know” anything; they generate responses based on probability. As a result, they have a tendency to “hallucinate,” or present false information as fact, misrepresent or oversimplify complex concepts, and do not have precise technical accuracy. They may also be biased based on the sources they were trained on. Peer-reviewed sources should be the go-to for actual information. If you use LLMs to summarise something, always refer back to the original text when using that information in your work.
  4. Full citation generation
    As discussed above, although AI can be used to summarise sources, it is not a reliable source of direct citations. All references should be created by hand and verified manually.
  5. General over-reliance
    From the research design process to the final stages of writing and editing, you should generally refrain from an over-reliance on AI. Although LLMs can be powerful tools that can be used to automate various lower-level tasks, they are not a substitute for critical thinking, originality, or domain expertise, and they are not a full-fledged co-author of your work. The intellectual contribution and ownership of ideas remains in the hands of the human authors. 

For further and more official guidance, check out the ethical framework for the use of AI in academic research published in Nature Machine Intelligence. This framework outlines three criteria for the responsible use of LLMs in scholarship, summarised as follows:

  1. Human vetting and guaranteeing the accuracy and integrity 
  2. Substantial human contribution across all areas of the work
  3. Acknowledgement and transparency of AI use

Confidence in ML models

Recently, I have been interested in adding a confidence metric to the predictions made by a machine learning model I have been working on. In this blog post, I will outline a few strategies I have been exploring to do this. Powerful deep learning models like AlphaFold are great, not only for the predictions they make, but they also generate confidence measures to give the user a sense of how much to trust the prediction.

Continue reading

Geometric Deep Learning meets Forces & Equilibrium

Introduction

Graphs provide a powerful mathematical framework for modelling complex systems, from molecular structures to social networks. In many physical and geometric problems, nodes represent particles, and edges encode interactions, often acting like springs. This perspective aligns naturally with Geometric Deep Learning, where learning algorithms leverage graph structures to capture spatial and relational patterns.

Understanding energy functions and the forces derived from them is fundamental to modelling such systems. In physics and computational chemistry, harmonic potentials, which penalise deviations from equilibrium positions, are widely used to describe elastic networks, protein structures, and even diffusion processes. The Laplacian matrix plays a key role in these formulations, linking energy minimisation to force computations in a clean and computationally efficient way.

By formalising these interactions using matrix notation, we gain not only a compact representation but also a foundation for more advanced techniques such as Langevin dynamics, normal mode analysis, and graph-based neural networks for physical simulations.

Continue reading

The Good (and limitations) of using a Local CoPilot with Ollama

Interactive code editors have been around for a while now, and tools like GitHub Copilot have woven their way into most development pipelines, and for good reason. They’re easy to use, exceptionally helpful (at certain tasks), and have undeniably made life as a developer smoother. Recently, I decided to switch away from relying on GitHub Copilot in favour of a local model for a few key reasons. While I don’t use it all the time, it has proven to be a useful option in many situations. In this blog post, I’ll go over why I made the switch, how I set it up, and share a bit about my experience so far.

Continue reading

Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data

I’m delighted to report our collaboration (Ísak Valsson, Matthew Warren, Aniket Magarkar, Phil Biggin, & Charlotte Deane), on “Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data”, has been published in Nature’s Communications Chemistry (https://doi.org/10.1038/s42004-025-01428-y).


During his MSc dissertation project in the Department of Statistics, University of Oxford, OPIG member Ísak Valsson developed an attention-based GNN to predict protein-ligand binding affinity called “AEV-PLIG”. It featurizes a ligand’s atoms using Atomic Environment Vectors to describe the Protein-Ligand Interactions found in a 3D protein-ligand complex. AEV-PLIG is free and open source (BSD 3-Clause), available from GitHub at https://github.com/oxpig/AEV-PLIG, and forked at https://github.com/bigginlab/AEV-PLIG.

Continue reading

LLM Coding Tools – An Overview

We’ve come a long way since GitHub Copilot first showed us what AI-assisted coding could look like. These days, there’s a whole ecosystem of LLM coding tools out there, each with their own strengths and approaches. In this blog, I’ll give you a quick overview to help you figure out which one might work best for your workflow.

Level 1: Interactive Code Assistance

Continue reading

Baby’s First NeurIPS: A Survival Guide for Conference Newbies

There’s something very surreal about stepping into your first major machine learning conference: suddenly, all those GitHub usernames, paper authors, and protagonists of heated twitter spats become real people, the hallways are buzzing with discussions of papers you’ve been meaning to read, and somehow there are 17,000 other people trying to navigate it all alongside you. That was my experience at NeurIPS this year, and despite feeling like a microplankton in an ocean of ML research, I had a grand time. While some of this success was pure luck, much of it came down to excellent advice from the group’s ML conference veterans and lessons learned through trial and error. So, before the details fade into a blur of posters and coffee breaks, here’s my guide to making the most of your first major ML conference.

Continue reading

A tougher molecular data split – spectral split

Scaffold splits have been widely used in molecular machine learning which involves identifying chemical scaffolds in the data set and ensuring scaffolds present in the train and test sets do not overlap. However, two very similar molecules can have differing scaffolds. In an example provided by Pat Walters in his article on splitting chemical data last month, he provides an example where two molecules just differ by a single atom and thus have a very high Tanimoto similarity score of 0.66. However, they have different scaffolds (figure below).

In this case, if one of the molecules were in the train set and the other in the test set, predicting the test molecule would be quite trivial as there is data leakage. Therefore, we need a better splitting method such that there is minimal overlap between the train and test set. In this blogpost, I will be discussing spectral split, a splitting method introduced by our fellow OPIG member, Klarner et. al (2023).

Spectral split

Spectral split or clustering is based on the spectral graph partitioning algorithm. The basic idea of spectral clustering is as follows: The dataset is projected on a R^n matrix. An affinity matrix using a kernel that could be domain-specific is defined. Following that, the graph Laplacian is computed from the affinity matrix, followed by its eigendecomposition. Then,  k eigenvectors corresponding to the k lowest/highest eigenvalues are selected. Finally, the clusters are formed using k-means.

In the context of molecular data splitting, one could use the Tanimoto similarity metric to construct a similarity matrix between all the molecules in the dataset. Then, a spectral clustering method could be used to partition the similarity matrix such that the similarity within the cluster is maximized whereas the similarity between the clusters is minimized. Spectral split showed the least overlap between train (blue) and test (red) set molecules compared to scaffold splits (figure from Klarner at. al. (2024) below)

In addition to spectral splits, one could attempt other tougher splits one could attempt such as UMAP splits suggested by Guo et. al. (2024). For a detailed comparison between UMAP splits and other commonly used splits please refer to Pat Walters’ article on splitting chemical data.