Category Archives: AI

Understand Large Codebases Faster Using GitIngest

Often as researchers we have to deal with large and ugly codebases – this is not new, I know. Alas, fear not, now we have large language models (LLMs) like ChatGPT and friends which make things a little faster! In this blogpost I will show you how to use GitIngest to do this even faster using your favourite LLM.

No more copy pasting files individually or writing a paragraph explaining the directory structure, or even worse, relying on an LLM to use web search to find the codebase. As the codebase grows, the unreliability of these methods does too. GitIngest makes any “whole” codebase, prompt friendly – one prompt will be all you need!

Continue reading

How reliable are affinity datasets in practice?

The Data Bottleneck in AI-Powered Drug Discovery

The pharmaceutical industry is undergoing a profound transformation, driven by the promise of Artificial Intelligence (AI) and Machine Learning (ML). These technologies offer the potential to escape the industry’s persistent challenges of high costs, protracted development timelines, and staggering failure rates. From accelerating the identification of novel biological targets to optimizing the properties of lead compounds, AI is poised to enhance the precision and efficiency of drug discovery at nearly every stage

Yet, this revolutionary potential is constrained by a fundamental dependency. The power of modern AI, particularly the deep learning (DL) models that excel at complex pattern recognition, is directly proportional to the volume, diversity, and quality of the data they are trained on. This creates a critical bottleneck: the high-quality experimental data required to train these models—specifically, the protein-ligand binding affinity values that quantify the strength of an interaction—are notoriously scarce, expensive to generate, and often of inconsistent quality or locked within proprietary databases.

Continue reading

GUI Slop

Previously, I wrote about writing GUI’s for controlling and monitoring experiments. For ML this might be useful for tracking model learning (e.g. the popular weights and biases platform), while in the wet-lab it is great for making experiments simpler and more reliable to run, monitor and record.

And as it turns out, AI is quite good at this!

I have been using VSCode CoPilot in agent mode with Gemini 2.5 Pro to create simple GUIs that can control my experiments, which has proved pretty effective. Although there is clearly a concern when interfacing AI generated code with real hardware (especially if you “vibe code”, that is, just run whatever it generates) in practice it has allowed me to quickly generate tools for testing purposes, cutting the time required for getting a project started from hours to minutes.

As an example, I recently needed to hook up a Helmholtz coil to some custom electronics, centred around a Teensy micro-controller and designed to output a precisely controlled current.

Continue reading

Can AI help us design better viruses?

Viruses are the most abundant biological entity on the planet. They infect virtually every kind of life form including (sort of) other viruses. Viruses are intensely efficient – some viruses contain as few as 4 genes. Their strategy is typically simple: infect a cell, use its machinery to produce more viruses, and spread to other cells.

Pathogenic human viruses are terrible, but there are many other viruses which are useful for humans. For instance, many modern vaccines use viral vectors to produce antigens of other pathogenic entities. There is also growing interest in using viruses to fight off bacterial infections.

Continue reading

Pose Prediction: Does Your Model Generalize? The Role of Data Similarity

In our recent work with the PoseBusters benchmark, we made a deliberate choice: to include both receptors seen during training and completely novel ones. Why? To explore an often-overlooked question: how much does receptor similarity to training data influence model performance?

Continue reading

Attention Is All You Need – A Moral Case

It turns out that giving neural networks attention gives you some pretty amazing results. The attention mechanism allowed neural language models to ingest vast amounts of data in a highly parallelised manner, efficiently learning what to pay the most attention to in a contextually aware manner. This computational breakthrough launched the LLM-powered AI revolution we’re living through. But what if attention isn’t just a computational trick? What if the same principle that allows transformers to focus on what matters from a sea of information also lies at the heart of consciousness, perception, and even morality itself? (Ok, maybe this is a bit of a stretch, but hear me out.)

To understand the connection, we need to look at how perception really works. Modern neuroscience reveals that experience is fundamentally subjective and generative. We’re not passive receivers of objective reality through our senses, we’re active constructors of our own experience. According to predictive processing theory, our minds constantly generate models of reality, and our sensory input is then used to provide an ‘error’ of these predictions. But the extraordinary point here is that we never ‘see’ these sensory inputs, only our mind’s best guess of how the world should be, updated by sensory feedback. As consciousness researcher Anil Seth puts it “Reality is a controlled hallucination… an action-oriented construction, rather than passive registration of an objective external reality”, or in the words of Anaïs Nin, half a century earlier, “We do not see things as they are, we see things as we are.”

Continue reading

ChatGPT can now use RDKit!

All chemistry LLM enthusiasts were treated to a pleasant surprise on Friday when Greg Brockman tweeted that ChatGPT now has access to RDKit. I’ve spent a few hours playing with the updated models and I have summarized some of my findings in this blog.

Continue reading

Attending LMRL @ ICLR 2025

I recently attended the Learning Meaningful Representations of Life (LMRL) workshop at ICLR 2025. The goal of LMRL is to highlight machine learning methods which extract meaningful or useful properties from unstructured biological data, with an eye towards building a virtual cell. I presented my paper which demonstrates how standard Transformers can learn to meaningfully represent 3D coordinates when trained on protein structures. Each paper submitted to LMRL had to include a “meaningfulness statement” – a short description of how the work presents a meaningful representation.

Continue reading

Featurisation is Key: One Version Change that Halved DiffDock’s Performance

1. Introduction 

Molecular docking with graph neural networks works by representing the molecules as featurized graphs. In DiffDock, each ligand becomes a graph of atoms (nodes) and bonds (edges), with features assigned to every atom using chemical properties such as atom type, implicit valence and formal charge. 
 
We recently discovered that a change in RDKit versions significantly reduces performance on the PoseBusters benchmark, due to changes in the “implicit valence” feauture. This post walks through: 

  • How DiffDock featurises ligands 
  • What happened when we upgraded RDKit 2022.03.3 → 2025.03.1 
  • Why training with zero-only features and testing on non-zero features is so bad 

TL:DR: Use the dependencies listed in the environment.yml file, especially in the case of DiffDock, or your performance could half!  

Continue reading

AI generated linkers™: a tutorial

In molecular biology cutting and tweaking a protein construct is an often under-appreciated essential operation. Some protein have unwanted extra bits. Some protein may require a partner to be in the correct state, which would be ideally expressed as a fusion protein. Some protein need parts replacing. Some proteins disfavour a desired state. Half a decade ago, toolkits exists to attempt to tackle these problems, and now with the advent of de novo protein generation new, powerful, precise and way less painful methods are here. Therefore, herein I will discuss how to generate de novo inserts and more with RFdiffusion and other tools in order to quickly launch a project into the right orbit.
Furthermore, even when new methods will have come out, these design principles will still apply —so ignore the name of the de novo tool used.

Continue reading