Category Archives: Molecular Recognition

No Pretraining, No Equivariant Architecture – Learning MLIPs without Explicit Equivariance

Paper
🤗 TransIP-L checkpoint
Code

Machine-learned interatomic potentials (MLIPs) have become a cornerstone of modern computational chemistry, enabling simulations that approach quantum accuracy at a fraction of the cost of traditional methods such as density functional theory (DFT). However, a central challenge in designing MLIPs lies in respecting the fundamental symmetries of molecular systems, especially rotational and translational invariance, while maintaining scalability and flexibility.

In our recent work, we introduced TransIP, a novel framework that formulates how symmetry is incorporated into molecular models by learning symmetry directly in the latent space of an atomic transformer model, in which we treat atoms as tokens, instead of hard-coding equivariance into the neural network architecture.

At the core of TransIP is a simple yet powerful idea: instead of enforcing SO(3) equivariance through specialized layers, the model is trained with a contrastive objective that aligns representations of rotated molecular configurations. A learned transformation network maps latent embeddings under rotations, encouraging the model to discover symmetry-consistent representations implicitly. This design preserves the flexibility and scalability of standard Transformers while still capturing the geometric structure of molecular systems.

Continue reading

SigmaDock: untwisting molecular docking with fragment-based SE(3) diffusion

Alvaro Prat, Leo Zhang, Charlotte Deane, Yee Whye Teh, & Garrett M. Morris
International Conference On Learning Representations (ICLR 2026)

Molecular docking sits at the heart of structure-based drug discovery. If we can reliably predict how a small molecule binds in a protein pocket, we can prioritize compounds faster, reason about interactions more clearly, and build better pipelines for hit discovery and lead optimization. But in practice, docking is still a difficult problem: classical methods are often robust but imperfect, while recent deep learning approaches have sometimes looked promising on headline metrics without consistently producing chemically plausible poses.

SigmaDock was built to address exactly that gap. Instead of treating docking as a problem of directly diffusing on torsion angles or unconstrained atomic coordinates, SigmaDock represents ligands as collections of rigid fragments and learns how to reassemble them inside the binding pocket using diffusion on SE(3)\text{SE}(3). In plain English: rather than trying to “wiggle” every flexible degree of freedom in a tangled way, SigmaDock breaks the ligand into chemically meaningful rigid pieces and learns where those pieces should go, and how they should reorient, to recover a valid bound pose.

Figure 1: Illustration of SigmaDock using PDB 1V4S and ligand MRK. We create an initial conformation of a query ligand where we define our mm rigid body fragments (colour coded). The corresponding forward diffusion process operates in SE(3)m\text{SE}(3)^m via independent roto-translations.
Continue reading

A more robust way to split data for protein-ligand tasks?

As I was recently reading through the paper on the PLINDER dataset while preparing for my next project, one of the aspects of the dataset that caught my attention was how the dataset splits were done to ensure minimal leakage for various protein-ligand tasks that PLINDER could be used for. They had task-specific splits as the notion of data leakage differed from task to task. For instance, in rigid body docking, having a similar protein in the train and test may not be considered leakage if the binding pocket location, conformation, or pocket interactions with a ligand are significantly different. On the other hand, in the case of co-folding, having similar proteins in the train and test sets would be considered data leakage, as predicted protein structures play a significant role in accuracy scoring. The effort that went into creating task-specific splits resonates strongly with OPIG’s view on ensuring minimal data leakage for validating the generalisability of protein-ligand models. However, it may become tedious to create task-specific dataset splits for every protein-ligand task when dealing with a large suite of such tasks. This had me thinking of potential avenues to streamline the dataset split process across the tasks, and one way to do this is by using protein-ligand interaction fingerprints or PLIFs.

Continue reading

Featurisation is Key: One Version Change that Halved DiffDock’s Performance

1. Introduction 

Molecular docking with graph neural networks works by representing the molecules as featurized graphs. In DiffDock, each ligand becomes a graph of atoms (nodes) and bonds (edges), with features assigned to every atom using chemical properties such as atom type, implicit valence and formal charge. 
 
We recently discovered that a change in RDKit versions significantly reduces performance on the PoseBusters benchmark, due to changes in the “implicit valence” feauture. This post walks through: 

  • How DiffDock featurises ligands 
  • What happened when we upgraded RDKit 2022.03.3 → 2025.03.1 
  • Why training with zero-only features and testing on non-zero features is so bad 

TL:DR: Use the dependencies listed in the environment.yml file, especially in the case of DiffDock, or your performance could half!  

Continue reading

Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data

I’m delighted to report our collaboration (Ísak Valsson, Matthew Warren, Aniket Magarkar, Phil Biggin, & Charlotte Deane), on “Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data”, has been published in Nature’s Communications Chemistry (https://doi.org/10.1038/s42004-025-01428-y).


During his MSc dissertation project in the Department of Statistics, University of Oxford, OPIG member Ísak Valsson developed an attention-based GNN to predict protein-ligand binding affinity called “AEV-PLIG”. It featurizes a ligand’s atoms using Atomic Environment Vectors to describe the Protein-Ligand Interactions found in a 3D protein-ligand complex. AEV-PLIG is free and open source (BSD 3-Clause), available from GitHub at https://github.com/oxpig/AEV-PLIG, and forked at https://github.com/bigginlab/AEV-PLIG.

Continue reading

Making Peace with Molecular Entropy

I first stumbled upon OPIG blogs through a post on ligand-binding thermodynamics, which refreshed my understanding of some thermodynamics concepts from undergrad, bringing me face-to-face with the concept that made most molecular physics students break out in cold sweats: Entropy. Entropy is that perplexing measure of disorder and randomness in a system. In the context of molecular dynamics simulations (MD), it calculates the conformational freedom and disorder within protein molecules which becomes particularly relevant when calculating binding free energies.

In MD, MM/GBSA and MM/PBSA are fancy terms for trying to predict how strongly molecules stick together and are the go-to methods for binding free energy calculations. MM/PBSA uses the Poisson–Boltzmann (PB) equation to account for solvent polarisation and ionic effects accurately but at a high computational cost. While MM/GBSA approximates PB, using the Generalised Born (GB) model, offering faster calculations suitable for large systems, though with reduced accuracy. Consider MM/PBSA as the careful accountant who considers every detail but takes forever, while MM/GBSA is its faster, slightly less accurate coworker who gets the job done when you’re in a hurry.

Like many before me, I made the classic error of ignoring entropy, assuming that entropy changes that were similar across systems being compared would have their terms cancel out and could be neglected. This would simplify calculations and ease computational constraints (in other words it was too complicated, and I had deadlines breathing down my neck). This worked fine… until it didn’t. The wake-up call came during a project studying metal-isocitrate complexes in IDH1. For context, IDH1 is a homodimer with a flexible ‘hinge’ region that becomes unstable without its corresponding subunit, giving rise to very high fluctuations. By ignoring entropy in this unstable system, I managed to generate binding free energy results that violated several laws of thermodynamics and would make Clausius roll in his grave.

Continue reading

Conference Summary: MGMS Adaptive Immune Receptors Meeting 2024

On 5th April 2024, over 60 researchers braved the train strikes and gusty weather to gather at Lady Margaret Hall in Oxford and engage in a day full of scientific talks, posters and discussions on the topic of adaptive immune receptor (AIR) analysis!

Continue reading

Inverse Vaccines

One of the nice things about OPIG, is that you can talk about something which is outside of your wheelhouse without feeling that the specialists in the group are going to eat your lunch. Last week, I gave an overview of the Hubbell group‘s Nature paper Synthetically glycosylated antigens for the antigen-specific suppression of established immune responses. I am not an immunologist by any stretch of the imagination, but sometimes you come across a piece of really interesting science and just want to say to people: Have you seen this, look at this, it’s really clever!

Continue reading

9th Joint Sheffield Conference on Cheminformatics

Over the next few days, researchers from around the world will be gathering in Sheffield for the 9th Joint Sheffield Conference on Cheminformatics. As one of the organizers (wearing my Molecular Graphics and Modeling Society ‘hat’), I can say we have an exciting array of speakers and sessions:

  • De Novo Design
  • Open Science
  • Chemical Space
  • Physics-based Modelling
  • Machine Learning
  • Property Prediction
  • Virtual Screening
  • Case Studies
  • Molecular Representations

It has traditionally taken place every three years, but despite the global pandemic it is returning this year, once again in person in the excellent conference facilities at The Edge. You can download the full programme in iCal format, and here is the conference calendar:

Continue reading

histo.fyi: A Useful New Database of Peptide:Major Histocompatibility Complex (pMHC) Structures

pMHCs are set to become a major target class in drug discovery; unusual peptide fragments presented by MHC can be used to distinguish infected/cancerous cells from healthy cells more precisely than over-expressed biomarkers. In this blog post, I will highlight a prototype resource: Dr. Chris Thorpe’s new database of pMHC structures, histo.fyi.

histo.fyi provides a one-stop shop for data on (currently) around 1400 pMHC complexes. Similar to our dedicated databases for antibody/nanobody structures (SAbDab) and T-cell receptor (TCR) structures (STCRDab), histo.fyi will scrape the PDB on a weekly basis for any new pMHC data and process these structures in a way that facilitates their analysis.

Continue reading