Category Archives: Small Molecules

New DPhil/PhD Programme in Pharmaceutical Science Joint with GSK!

Many OPIGlets found their way into a DPhil in Protein Informatics through our Systems Approaches to Biomedical Sciences Industrial Doctoral Landscape Award, which was open to applicants 2009-2024. This innovative course, based at the MPLS Doctoral Training Centre (DTC), offered six months of intensive taught modules prior to starting PhD-level research, allowing students to upskill across a diverse range of subjects (coding, mathematics, structural biology, etc.) and to go on to do research in areas significantly distinct from their formal Undergraduate training. All projects also benefited from direct co-supervision from researchers working in the Pharmaceutical industry, ensuring DPhil projects in areas with drug discovery translation potential. Regrettably, having twice successfully applied for renewal of funding, we were unsuccessful in our bid to refund SABS in 2024.

Happily though, we can now formally announce that our bid for a direct successor to SABS, the Transformative Technologies in Pharmaceutical Sciences IDLA, has been backed by the BBSRC, and we will shortly be opening for applications for entry this October [2026]. As someone who benefited from the interdisciplinary training and industry-adjacency of SABS, I’m thrilled to be a co-director of this new Programme and to help deliver this course to a new generation of talented students.

Continue reading →

What Molecular ML Can Learn from the Vision Community’s Representation Revolution

Something remarkable happened in computer vision in 2025: the fields of generative modeling and representation learning, which had developed largely independently, suddenly converged. Diffusion models started leveraging pretrained vision encoders like DINOv2 to dramatically accelerate training. Researchers discovered that aligning generative models to pretrained representations doesn’t just speed things up—it often produces better results.

As someone who works on generative models for (among other things) molecules and proteins, I’ve been watching this unfold with great interest. Could we do the same thing for molecular ML? We now have foundation models like MACE that learn powerful atomic representations. Could aligning molecular generative models to these representations provide similar benefits?

In this post, I’ll summarize what happened in vision (organized into four “phases”), and then discuss what I think are the key lessons for molecular machine learning. The punchline: many of these ideas are already starting to appear in our field, but we’re still in the early stages compared to vision.

For a more detailed treatment of the vision developments with full references and figures, see the extended blog post on my website.

Continue reading →

Chemical Languages in Machine Learning

For more than a century, chemists have been trying to squeeze the beautifully messy, quantum-smeared reality of molecules into tidy digital boxes, “formats” such as line notations, connection tables, coordinate files, or even the vaguely hieroglyphic Wiswesser Line Notation. These formats weren’t designed for machine learning; some weren’t even designed for computers. And yet, they’ve become the wedged into the backbones of modern drug discovery, materials design and computational chemistry.

The emergent use of large language models and natural language processing in chemistry posits the immediate question: What does it mean for a molecule to have a “language,” and how should machines speak it?

if molecules are akin to words and sentences, what alphabet and grammatical rules should they follow?

What follows is a tour through the evolving world of chemical languages, why we use them, why our old representations keep breaking our shiny new models, and what might replace them.

Continue reading →

Controlling the Diffusion Denoising Process: A Molecular Show

This blog post is supporting my poster at Young Modellers Forum and makes things way easier to see and understand. Underneath each GIF, is the explanation of what you should look for as things denoise throughout the diffusion trajectory. Click the GIFs for higher quality viewing!

Continue reading →

Design your very own drug: An introduction to structure-based small molecule drug design

Are you curious about how scientists design small molecules to treat disease using computational tools, but the words RDKit, docking, and QED mean nothing to you? Look no further than these tutorials for learning the fundamentals of computational small molecule drug design through interactive tutorials that introduce the key tools, concepts, and workflows. From generating compounds to evaluating their drug-likeness and binding potential, by the end you’ll be ready to explore how computational methods can result in the discovery of your very own (virtual) drug candidates to cure Zika!

Find the materials here: https://github.com/oxpig/dtc-struc-bio-smolecules/tree/main.

Continue reading →

Is the molecule in the computer?

The Molecular Graphics and Modelling Society began life as the Molecular Graphics Society. It’s hard to imagine a time without computer graphics, but yes, it existed. The MGS was formed by the pioneers who made molecular graphics commonplace.

In 1994, the MGS organized an Art and Video Show (Goodsell et al., 1995), and I submitted some of my own work. One of the other images — inspired by Magritte‘s “Ceci n’est pas une pipe”, depicts a molecule with a remarkable similarity to a pipe — and to a molecule… It was submitted by Mike Hann (of GSK):

“Ceci n’est pas une molecule”, image by Mike Hann, 1994.

Continue reading →

Fragment-to-Lead Successes in 2023

Back in 2021, I highlighted the annual fragment-to-lead (F2L) success stories from 2019 [Blog post] [Paper]. This is one of my favourite annual publications, and I’m delighted to see that it’s still going strong. In this post, I’ll discuss the 2023 edition that was published in at the start of 2025 [Paper].

Continue reading →

Extracting 3D Pharmacophore Points with RDKit

Pharmacophores are simplified representations of the key interactions ligands make with proteins, such as hydrogen bonds, charge interactions, and aromatic contacts. Think of them as the essential “bumps and grooves” on a key that allow it to fit its lock (the protein). These maps can be derived from ligands or protein–ligand complexes and are powerful tools for virtual screening and generative models. Here, we’ll see how to extract 3D pharmacophore points from a ligand using RDKit.
(Code adapted from Dr. Ruben Sanchez.)

Why pharmacophore “points”?

RDKit represents each pharmacophore feature (donor, acceptor, aromatic, etc.) as a point in 3D space, located at the feature center. These points capture the essential interaction motifs of a ligand without requiring the full atomic detail.

Continue reading →

How reliable are affinity datasets in practice?

The Data Bottleneck in AI-Powered Drug Discovery

The pharmaceutical industry is undergoing a profound transformation, driven by the promise of Artificial Intelligence (AI) and Machine Learning (ML). These technologies offer the potential to escape the industry’s persistent challenges of high costs, protracted development timelines, and staggering failure rates. From accelerating the identification of novel biological targets to optimizing the properties of lead compounds, AI is poised to enhance the precision and efficiency of drug discovery at nearly every stage

Yet, this revolutionary potential is constrained by a fundamental dependency. The power of modern AI, particularly the deep learning (DL) models that excel at complex pattern recognition, is directly proportional to the volume, diversity, and quality of the data they are trained on. This creates a critical bottleneck: the high-quality experimental data required to train these models—specifically, the protein-ligand binding affinity values that quantify the strength of an interaction—are notoriously scarce, expensive to generate, and often of inconsistent quality or locked within proprietary databases.

Continue reading →

ChatGPT can now use RDKit!

All chemistry LLM enthusiasts were treated to a pleasant surprise on Friday when Greg Brockman tweeted that ChatGPT now has access to RDKit. I’ve spent a few hours playing with the updated models and I have summarized some of my findings in this blog.

ChatGPT now can analyze, manipulate, and visualize molecules and chemical information via the RDKit library.

Useful for scientific work across health, biology, and chemistry. pic.twitter.com/WfbjuSMFVR
— Greg Brockman (@gdb) May 23, 2025

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Category Archives: Small Molecules

New DPhil/PhD Programme in Pharmaceutical Science Joint with GSK!

What Molecular ML Can Learn from the Vision Community’s Representation Revolution

Chemical Languages in Machine Learning

Controlling the Diffusion Denoising Process: A Molecular Show

Design your very own drug: An introduction to structure-based small molecule drug design

Is the molecule in the computer?

Fragment-to-Lead Successes in 2023

Extracting 3D Pharmacophore Points with RDKit

Why pharmacophore “points”?

How reliable are affinity datasets in practice?

The Data Bottleneck in AI-Powered Drug Discovery

ChatGPT can now use RDKit!