Category Archives: Machine Learning

New DPhil/PhD Programme in Pharmaceutical Science Joint with GSK!

Many OPIGlets found their way into a DPhil in Protein Informatics through our Systems Approaches to Biomedical Sciences Industrial Doctoral Landscape Award, which was open to applicants 2009-2024. This innovative course, based at the MPLS Doctoral Training Centre (DTC), offered six months of intensive taught modules prior to starting PhD-level research, allowing students to upskill across a diverse range of subjects (coding, mathematics, structural biology, etc.) and to go on to do research in areas significantly distinct from their formal Undergraduate training. All projects also benefited from direct co-supervision from researchers working in the Pharmaceutical industry, ensuring DPhil projects in areas with drug discovery translation potential. Regrettably, having twice successfully applied for renewal of funding, we were unsuccessful in our bid to refund SABS in 2024.

Happily though, we can now formally announce that our bid for a direct successor to SABS, the Transformative Technologies in Pharmaceutical Sciences IDLA, has been backed by the BBSRC, and we will shortly be opening for applications for entry this October [2026]. As someone who benefited from the interdisciplinary training and industry-adjacency of SABS, I’m thrilled to be a co-director of this new Programme and to help deliver this course to a new generation of talented students.

Continue reading

Democratising the Dark Arts: Writing Triton Kernels with Claude

Why would you ever want to leave the warm, fuzzy embrace of torch.nn? It works, it’s differentiable, and it rarely causes your entire Python session to segfault without a stack trace. The answer usually comes down to the “Memory Wall.” Modern deep learning is often less bound by how fast your GPU can do math (FLOPS) and more bound by how fast it can move data around (Memory Bandwidth). When you write a sequence of simple PyTorch operations, something like x = x * 2 + y the GPU often reads x from memory, multiplies it, writes it back, reads it again to add y, and writes it back again. It’s the computational equivalent of making five separate trips to the grocery store because you forgot the eggs, then the milk, then the bread. Writing a custom kernel lets you “fuse” these operations. You load the data once, perform a dozen mathematical operations on it while it sits in the ultra-fast chip registers, and write it back once. The performance gains can be massive (often 2x-10x for specific layers).But traditionally, the “cost” of accessing those gains, learning C++, understanding warp divergence, and manual memory management, was just too high for most researchers. That equation is finally changing.

Continue reading

What Molecular ML Can Learn from the Vision Community’s Representation Revolution

Something remarkable happened in computer vision in 2025: the fields of generative modeling and representation learning, which had developed largely independently, suddenly converged. Diffusion models started leveraging pretrained vision encoders like DINOv2 to dramatically accelerate training. Researchers discovered that aligning generative models to pretrained representations doesn’t just speed things up—it often produces better results.

As someone who works on generative models for (among other things) molecules and proteins, I’ve been watching this unfold with great interest. Could we do the same thing for molecular ML? We now have foundation models like MACE that learn powerful atomic representations. Could aligning molecular generative models to these representations provide similar benefits?

In this post, I’ll summarize what happened in vision (organized into four “phases”), and then discuss what I think are the key lessons for molecular machine learning. The punchline: many of these ideas are already starting to appear in our field, but we’re still in the early stages compared to vision.

For a more detailed treatment of the vision developments with full references and figures, see the extended blog post on my website.

Continue reading

Chemical Languages in Machine Learning

For more than a century, chemists have been trying to squeeze the beautifully messy, quantum-smeared reality of molecules into tidy digital boxes, “formats” such as line notations, connection tables, coordinate files, or even the vaguely hieroglyphic Wiswesser Line Notation. These formats weren’t designed for machine learning; some weren’t even designed for computers. And yet, they’ve become the wedged into the backbones of modern drug discovery, materials design and computational chemistry.

The emergent use of large language models and natural language processing in chemistry posits the immediate question: What does it mean for a molecule to have a “language,” and how should machines speak it?

if molecules are akin to words and sentences, what alphabet and grammatical rules should they follow?

What follows is a tour through the evolving world of chemical languages, why we use them, why our old representations keep breaking our shiny new models, and what might replace them.

Continue reading

An Introduction to the Basics of Reinforcement Learning

Reinforcement learning (RL) is pretty simple in theory – “take actions, get rewards, increase likelihood of high reward actions”. However, we can quickly runs into subtle problems that don’t show up in standard supervised learning. The aim of this post is to give a gentle, concrete introduction to what RL actually is, why we might want to use it instead of (or alongside) supervised learning, and some of the headaches (figure 1) that come with it: sparse rewards, credit assignment, and reward shaping.

Figure 1: I’d like to help take you from confusion/headache 🙁 (left) to having a least some clarity 🙂 (right) with regard to what reinforcement learning is and where its useful

Rather than starting with Atari or robot arms, we’ll work through a small toy environment: a paddle catching falling balls. It’s simple enough to understand visually, but rich enough to show how different reward designs can lead to completely different behaviours, even when the underlying environment and objective are the same. Along the way, we’ll connect the code to the standard RL formalism (MDPs, returns, policy gradients), so you can see how the equations map onto something you can actually run.

Continue reading

Is the molecule in the computer?

The Molecular Graphics and Modelling Society began life as the Molecular Graphics Society. It’s hard to imagine a time without computer graphics, but yes, it existed. The MGS was formed by the pioneers who made molecular graphics commonplace.

In 1994, the MGS organized an Art and Video Show (Goodsell et al., 1995), and I submitted some of my own work. One of the other images — inspired by Magritte‘s “Ceci n’est pas une pipe”, depicts a molecule with a remarkable similarity to a pipe — and to a molecule… It was submitted by Mike Hann (of GSK):

“Ceci n’est pas une molecule”, image by Mike Hann, 1994.
Continue reading

I Prompt, Therefore I Am: Is Artificial Intelligence the End of Human Thought? 

Welcome to a slightly different blog post than usual. Today I am sharing an insight into my life at Keble College, Oxford. I am the Chair of Cheese and Why?, which is a talk series we host in our common room during term. The format is simple: I provide cheese and wine, and a guest speaker provides the “why”—a short, thought-provoking talk to spark discussion for the evening.

To kick off the series, I opened with the question of artificial intelligence replacing human thought. I am sharing my spoken essay below. The aim of a Cheese and Why? talk is to generate questions rather than deliver answers, so I hope you’ll forgive me if what follows doesn’t quite adhere to the rigorous structure of a traditional Oxford humanities essay. For best reading, I recommend a glass of claret and a wedge of Stilton, to recreate the full Oxford common-room experience.

Continue reading

Is attention all you need for protein folding?

Researchers from Apple have released SimpleFold, a protein structure prediction model which uses exclusively standard Transformer layers. The results seem to show that SimpleFold is a little less accurate than methods such as AlphaFold2, but much faster and easier to integrate into standard LLM-like workflows. SimpleFold also shows very good scaling performance, in line with other Transformer models like ESM2. So what is powering this seemingly simple development?

Continue reading

Accelerating AlphaFold 3 for high-throughput structure prediction

Introduction

Recently, I have been conducting a project in which I need to predict the structures of a dataset comprising a few thousand protein sequences using AlphaFold 3. Taking a naive approach, it was taking an hour or two per entry to get a predicted structure. With a few thousand structures, it seemed that it would take months to be able to run…

In this blog post, I will go through some tips I found to help accelerate the structure predictions and make all of the predictions I needed in under a week. In general, following the tips in the AlphaFold 3 performance documentation is a useful starting place. Most of the tips I provide are related to accelerating the MSA generation portion of the predictions because this was the biggest bottleneck in my case.

Continue reading

How reliable are affinity datasets in practice?

The Data Bottleneck in AI-Powered Drug Discovery

The pharmaceutical industry is undergoing a profound transformation, driven by the promise of Artificial Intelligence (AI) and Machine Learning (ML). These technologies offer the potential to escape the industry’s persistent challenges of high costs, protracted development timelines, and staggering failure rates. From accelerating the identification of novel biological targets to optimizing the properties of lead compounds, AI is poised to enhance the precision and efficiency of drug discovery at nearly every stage

Yet, this revolutionary potential is constrained by a fundamental dependency. The power of modern AI, particularly the deep learning (DL) models that excel at complex pattern recognition, is directly proportional to the volume, diversity, and quality of the data they are trained on. This creates a critical bottleneck: the high-quality experimental data required to train these models—specifically, the protein-ligand binding affinity values that quantify the strength of an interaction—are notoriously scarce, expensive to generate, and often of inconsistent quality or locked within proprietary databases.

Continue reading