Handling OAS Scale Datasets Without The Drama

Working with Observed Antibody Space (OAS) dataset sometimes feels a bit like trying to cook dinner with the contents of the whole fridge emptied into the pan. There are countless CSVs, all of different sizes (some might not even fit onto your RAM), and you just want a clean, fast pipeline so you can get back to modelling. The trick is to stop treating the data like a giant spreadsheet you fully load into memory and start treating it like a columnar, on-disk database you stream through. That’s exactly what the 🤗 Datasets library gives you.

At the heart of 🤗 Datasets is Apache Arrow, which stores columns in a memory-mapped format (if you are curious about what that means there is a great explanation in another blog post here. In plain terms: the data mostly lives on disk, and you pull in just the slices you need. It feels interactive even when the dataset is huge. Instead of a single monolithic script that does everything (and takes forever), you layer small, composable steps—standardize a few columns, filter out junk, compute a couple of derived fields—and each step is cached automatically. Change one piece, and only that piece recomputes. Sounds great, right? But of course, the key question now is how to get OAS data into Datasets to begin with.

Continue reading

Exploding Barbers (Paradoxes — Part I)

Prelude

I came upon a traveller on a dust-swept road at dusk.
Along the cliff’s high edge it ran, where seabirds rode the gust;
Upon a stone he rested still, with gaze toward the deep,
As though the sea held secrets vast that mortals may not keep.
Behind us wound the ancient way through heather wild and wood,
To where a castle, firm and fair, upon the hilltop stood.

Continue reading

A guide to fixing broken AMBER MD trajectory files and visualisations.

You’ve just finished a week-long molecular dynamics simulation. You’re excited to see what happened to your protein complex, so you load up the trajectory in VMD and… your protein looks like it’s been through a blender. Pieces are scattered across the screen, water molecules are everywhere, and half your complex seems to have teleported to the other side of the simulation box. This chaos is caused by periodic boundary conditions (PBC).

PBC

PBC is a computational trick that simulates bulk behaviour by treating your simulation box like a repeating tile. When a molecule exits one side, it immediately reappears on the opposite side. This works perfectly for physics as your protein experiences realistic bulk water behaviour.

Continue reading

Understand Large Codebases Faster Using GitIngest

Often as researchers we have to deal with large and ugly codebases – this is not new, I know. Alas, fear not, now we have large language models (LLMs) like ChatGPT and friends which make things a little faster! In this blogpost I will show you how to use GitIngest to do this even faster using your favourite LLM.

No more copy pasting files individually or writing a paragraph explaining the directory structure, or even worse, relying on an LLM to use web search to find the codebase. As the codebase grows, the unreliability of these methods does too. GitIngest makes any “whole” codebase, prompt friendly – one prompt will be all you need!

Continue reading

ISMB/ECCB conference feedback 

The ISMB/ECCB conference took place in Liverpool this year. So, a couple of OPIGlets took the train up north to attend this biyearly joint conference. Here we will give some general feedback on the conference and highlight some interesting talks/posters. 

General feedback 

ISMB/ECCB is a 4.5 day conference starting on the Sunday evening and running until Thursday evening. The conference is attended by around 2500 people, mostly from academic groups around the world. With more than 20 different tracks, it is a broad conference with lots of tracks happening at the same time. As always, it is thus recommended to have a look at the schedule beforehand to not get too overwhelmed. Each day there is one keynote, two poster sessions, and three blocks of talks. These talks are often given by PIs, but also PostDocs and PhD students get the opportunity to present. There are also some smaller slots for highlighting posters which are presented that day. 

This year there was a very interesting line-up of Distinguished Keynote speakers. The conference was kicked off by John Jumper talking about AlphaFold2, with a focus on how the team went about the various problems during the process of going from the initial AlphaFold model to AlphaFold2. On Monday Prof. Amos Bairoch talked about biocuration and importance and challenges of public databases. He discussed the FAIR principles for Findable, Accessible, Interoperable, and Reusable for data management [1]. The next Keynote was by Prof. James Zou about computational biology in the age of AI agents (later more). On Wednesday we had our own Prof. Charlotte Deane (woo!) talking about structure-based drug discovery with a focus on the importance of baselines and benchmarking. The conference was ended by a short interview with Prof. David Baker, followed by a talk from Prof. Fabian Theis on decoding cellular systems. He discussed Cellflow [2], an AI tool that predicts how perturbations like drugs effect the cellular phenotype. 

Continue reading

How reliable are affinity datasets in practice?

The Data Bottleneck in AI-Powered Drug Discovery

The pharmaceutical industry is undergoing a profound transformation, driven by the promise of Artificial Intelligence (AI) and Machine Learning (ML). These technologies offer the potential to escape the industry’s persistent challenges of high costs, protracted development timelines, and staggering failure rates. From accelerating the identification of novel biological targets to optimizing the properties of lead compounds, AI is poised to enhance the precision and efficiency of drug discovery at nearly every stage

Yet, this revolutionary potential is constrained by a fundamental dependency. The power of modern AI, particularly the deep learning (DL) models that excel at complex pattern recognition, is directly proportional to the volume, diversity, and quality of the data they are trained on. This creates a critical bottleneck: the high-quality experimental data required to train these models—specifically, the protein-ligand binding affinity values that quantify the strength of an interaction—are notoriously scarce, expensive to generate, and often of inconsistent quality or locked within proprietary databases.

Continue reading

Conference feedback: Protein Society Annual Symposium

Recently, a couple of OPIG members had the opportunity to attend and present at the 39th Annual Symposium of the Protein Society—a not-for-profit scholarly society founded in 1985 that focuses on protein structure, function, and design—held in San Francisco.

The PS39 schedule was well designed, offering a balance between plenary talks, themed parallel sessions, and networking opportunities. A wide range of topics was covered, including transient protein states, supramolecular assemblies, proteostasis, and circadian clocks. This allowed us to follow areas of personal interest, both related and unrelated to our research, while exploring unfamiliar fields. Although many talks were biology-heavy, they were generally pitched at an accessible level for those from other disciplines (ie. the small molecules side of OPIG). Presentations almost always included results from both in silico and experimental approaches, with relatively few focusing exclusively on one or the other; a very nifty thing to see as people who mostly just dream of experimental validation! In contrast to our generalisable-model-focus, many of the researchers presenting had dedicated years to studying a single protein or system, uncovering its nuances in a way that made for some neat storytelling.

Continue reading

GPT-5 achieves state-of-the-art chemical intelligence

I have run ChemIQ (our chemical reasoning benchmark) on GPT-5. The model achieves state-of-the-art performance with substantial improvements in the ability to interpret SMILES strings. Read my analysis and initial findings below. Scroll to the end for some cool demos.

Figure 1: Success rates for each model on the ChemIQ reasoning benchmark. Horizontal brackets between adjacent bars indicate the result of a two-tailed McNemar’s test comparing paired outcomes for the same questions. Significance levels are shown as: n.s. (not significant, p ≥ 0.05), * (p < 0.05), ** (p < 0.01), and *** (p < 0.001).

Continue reading

Taming the Trajectory Beast: A Simpler Way to Sample Your MD Simulations

If you’ve ever run a molecular dynamics (MD) simulation, you know the feeling. You spend days, weeks, or even months of precious compute time watching your favourite molecule wiggle and jiggle. The result? A trajectory file bursting with thousands, or even millions, of frames. It’s a treasure trove of data, but it’s also a monster…

Analyzing every single frame is often impossible and, let’s be honest, usually pointless. Many adjacent frames are nearly identical. What we really want are the key representative structures that capture the important shapes, or conformations, your molecule adopted. So, how do we find them?

Continue reading