In this post I’ll walk through how to set up the CCDC Python API and use the CSD Geometry Analyser to evaluate the geometric quality of molecules from three representative structure-based de novo design models. I’ve put together a small GitHub repo with the full analysis code where we look at bond lengths, angles, torsions, and ring conformations across the three methods, and compare these against their PoseBusters validity scores to see what each metric is really capturing.
Continue readingCategory Archives: AI
Peering Inside the Black Box: A Beginner’s Introduction to Mechanistic Interpretability
Over the last few years, large language models (LLMs) have gone from being curiosities tucked away in research labs to something most of us interact with on a daily basis; whether for drafting emails, debugging code, or simply pondering the meaning of life at 2am. And yet, for all our reliance on these systems, a rather inconvenient truth lingers in the background: nobody, not even the people who built them, can fully explain what is going on inside.
This is where mechanistic interpretability comes in.
In essence, mechanistic interpretability is the approach of explaining complex machine learning systems through the behaviour of their functional units (Kästner and Crook, 2024) by reverse-engineering them into their more elementary computations (Rai et al., 2025). The aim is not simply to know that a model gives the right answer, but to pull apart the underlying machinery and uncover the causal relationships between input and output. Think of it as neuroscience for neural networks, except we can read every neuron at any moment, rewind, replay, and intervene mid-thought.
Continue readingA timeline of sampling methods of diffusion models
When approaching the methods used in de-novo protein design, one is quickly confronted with a plethora of overlapping formulations of what looks superficially like “the same thing”. One paper trains an -prediction network with a simple MSE loss; another trains a score network with a stochastic-differential-equation justification; a third trains a clean-data predictor under yet another schedule. Each formulation carries its own notation, its own variance schedule, and its own sampler. Qualitatively, this zoo of formulations is doing the same thing: it starts from some unstructured noise and iteratively refines it to eventually produce a protein structure similar (but different!) to other proteins we have experimentally determined in the past. What is not immediately obvious to a newcomer is that all of these formulations are historical descendants of a small number of foundational ideas, and that essentially every architectural and algorithmic decision in a modern protein-design diffusion model has a specific paper of origin and a specific motivation for being there.
This post is my attempt to put these formulations onto a single timeline. I trace the trajectory of the field through four foundational works: DDPM (Ho et al., 2020), DDIM (Song et al., 2021a), the score-based SDE unification (Song et al., 2021b), and EDM (Karras et al., 2022), explaining at each step what specific problem with the previous formulation the next paper was attacking and how the new formulation generalises or simplifies the old one. The goal is coherent motivation rather than exhaustive coverage; the reader interested in implementation details is referred to the original papers and the references at the end.
Continue readingWill TurboQuant save us from the RAM apocalypse?
The LLM boom is causing a global shortage of the very same computer memory it needs to sustain itself. Reports suggest OpenAI’s Stargate project alone could consume up to 40% of global DRAM output. Frontier labs like Google DeepMind need to make their models more memory-efficient.
One such technique is TurboQuant, released by Google. TurboQuant is an example of an online “quantisation” method. LLMs represent information using large tensors of numerical values, where each number typically uses 64 or 32 bits. However, many values do not require full numerical precision, so we can “round” them using fewer bits and less memory. We can see this in the example below:

Some quantisation methods are applied offline before inference begins. TurboQuant is ‘online’ because it compresses the KV cache dynamically during inference.
Three Resources I Keep Coming Back to for Learning Deep Learning
There is no shortage of AI content online, but over time I have found myself returning to the same handful of resources again, and I wanted to share the three that have helped me the most.
This one I would recommend to anyone who is earlier in their journey. AI Summer at theaisummer.com is a free platform run by Sergios Karagiannakos and Nikolas Adaloglou, and it covers everything from the basics of neural networks through to building and deploying real ML systems. The tone is friendly and practical, and there are proper code examples throughout. It is one of those rare resources that manages to be beginner-friendly without feeling watered down.
Continue readingAnalyzing AlphaFold 3’s Diffusion Trajectory
A useful way to understand AlphaFold 3’s sampling behavior is to look not only at the final predicted structure, but at what happens along the reverse diffusion trajectory itself. If we track quantities such as the physical energy of samples, noise scale, and update magnitude over time, a very clear pattern emerges: structures remain physically imperfect for most of sampling, and only take proper global shape in the final low-noise steps.
This behavior is a result of the diffusion procedure implemented in Algorithm 18, Sample Diffusion, which follows an EDM-style sampler with churn. Rather than simply marching monotonically from noise to structure, the sampler repeatedly perturbs the current coordinates, denoises them, and then takes a Euler-like update step. Because of the churn mechanism, AlphaFold 3 deliberately injects additional noise during part of the trajectory, which encourages exploration but also delays local geometric convergence. This mechanism is shown in step 4 -7 of the Sample Diffusion Algorithm from Alphafold3 Supplementary Information.

No Pretraining, No Equivariant Architecture – Learning MLIPs without Explicit Equivariance

Machine-learned interatomic potentials (MLIPs) have become a cornerstone of modern computational chemistry, enabling simulations that approach quantum accuracy at a fraction of the cost of traditional methods such as density functional theory (DFT). However, a central challenge in designing MLIPs lies in respecting the fundamental symmetries of molecular systems, especially rotational and translational invariance, while maintaining scalability and flexibility.
In our recent work, we introduced TransIP, a novel framework that formulates how symmetry is incorporated into molecular models by learning symmetry directly in the latent space of an atomic transformer model, in which we treat atoms as tokens, instead of hard-coding equivariance into the neural network architecture.
At the core of TransIP is a simple yet powerful idea: instead of enforcing SO(3) equivariance through specialized layers, the model is trained with a contrastive objective that aligns representations of rotated molecular configurations. A learned transformation network maps latent embeddings under rotations, encouraging the model to discover symmetry-consistent representations implicitly. This design preserves the flexibility and scalability of standard Transformers while still capturing the geometric structure of molecular systems.
Continue readingSigmaDock: untwisting molecular docking with fragment-based SE(3) diffusion
Molecular docking sits at the heart of structure-based drug discovery. If we can reliably predict how a small molecule binds in a protein pocket, we can prioritize compounds faster, reason about interactions more clearly, and build better pipelines for hit discovery and lead optimization. But in practice, docking is still a difficult problem: classical methods are often robust but imperfect, while recent deep learning approaches have sometimes looked promising on headline metrics without consistently producing chemically plausible poses.
SigmaDock was built to address exactly that gap. Instead of treating docking as a problem of directly diffusing on torsion angles or unconstrained atomic coordinates, SigmaDock represents ligands as collections of rigid fragments and learns how to reassemble them inside the binding pocket using diffusion on . In plain English: rather than trying to “wiggle” every flexible degree of freedom in a tangled way, SigmaDock breaks the ligand into chemically meaningful rigid pieces and learns where those pieces should go, and how they should reorient, to recover a valid bound pose.

Building a “Second Brain” – A Functional Knowledge Stack with Obsidian

Whilst I always enjoy the acquisition of knowledge, I’ve always struggled with depositing it usefully. From pen and paper notes with a 20 colour theme which lost value with each additional colour, to OneNote or iPad GoodNotes based emulations of pen and paper, it’s been a constant quest for the optimal note taking schema. Personally there are 3 key objectives I need my note taking to achieve:
- It must be digitally compatible and accessible from any device.
- It must comfortably handle math and images.
- It must be something I look forward to – the software needs to be aesthetically clean, lightweight with none of the chunkiness of Microsoft apps, and highly customisable.
For me the solution to this was Obsidian, the perhaps more cultified sibling to Notion. Obsidian is a note taking application that uses markdown with a surprising amount flexibility, including the ability to partner it with an LLM which I’ll explore in this blog, alongside my vault organisation do or dies, and favourite customisations.
Continue readingNew DPhil/PhD Programme in Pharmaceutical Science Joint with GSK!
Many OPIGlets found their way into a DPhil in Protein Informatics through our Systems Approaches to Biomedical Sciences Industrial Doctoral Landscape Award, which was open to applicants 2009-2024. This innovative course, based at the MPLS Doctoral Training Centre (DTC), offered six months of intensive taught modules prior to starting PhD-level research, allowing students to upskill across a diverse range of subjects (coding, mathematics, structural biology, etc.) and to go on to do research in areas significantly distinct from their formal Undergraduate training. All projects also benefited from direct co-supervision from researchers working in the Pharmaceutical industry, ensuring DPhil projects in areas with drug discovery translation potential. Regrettably, having twice successfully applied for renewal of funding, we were unsuccessful in our bid to refund SABS in 2024.
Happily though, we can now formally announce that our bid for a direct successor to SABS, the Transformative Technologies in Pharmaceutical Sciences IDLA, has been backed by the BBSRC, and we will shortly be opening for applications for entry this October [2026]. As someone who benefited from the interdisciplinary training and industry-adjacency of SABS, I’m thrilled to be a co-director of this new Programme and to help deliver this course to a new generation of talented students.

