Category Archives: Code

A guide to fixing broken AMBER MD trajectory files and visualisations.

You’ve just finished a week-long molecular dynamics simulation. You’re excited to see what happened to your protein complex, so you load up the trajectory in VMD and… your protein looks like it’s been through a blender. Pieces are scattered across the screen, water molecules are everywhere, and half your complex seems to have teleported to the other side of the simulation box. This chaos is caused by periodic boundary conditions (PBC).

PBC

PBC is a computational trick that simulates bulk behaviour by treating your simulation box like a repeating tile. When a molecule exits one side, it immediately reappears on the opposite side. This works perfectly for physics as your protein experiences realistic bulk water behaviour.

Continue reading

Understand Large Codebases Faster Using GitIngest

Often as researchers we have to deal with large and ugly codebases – this is not new, I know. Alas, fear not, now we have large language models (LLMs) like ChatGPT and friends which make things a little faster! In this blogpost I will show you how to use GitIngest to do this even faster using your favourite LLM.

No more copy pasting files individually or writing a paragraph explaining the directory structure, or even worse, relying on an LLM to use web search to find the codebase. As the codebase grows, the unreliability of these methods does too. GitIngest makes any “whole” codebase, prompt friendly – one prompt will be all you need!

Continue reading

GPT-5 achieves state-of-the-art chemical intelligence

I have run ChemIQ (our chemical reasoning benchmark) on GPT-5. The model achieves state-of-the-art performance with substantial improvements in the ability to interpret SMILES strings. Read my analysis and initial findings below. Scroll to the end for some cool demos.

Figure 1: Success rates for each model on the ChemIQ reasoning benchmark. Horizontal brackets between adjacent bars indicate the result of a two-tailed McNemar’s test comparing paired outcomes for the same questions. Significance levels are shown as: n.s. (not significant, p ≥ 0.05), * (p < 0.05), ** (p < 0.01), and *** (p < 0.001).

Continue reading

GUI Slop

Previously, I wrote about writing GUI’s for controlling and monitoring experiments. For ML this might be useful for tracking model learning (e.g. the popular weights and biases platform), while in the wet-lab it is great for making experiments simpler and more reliable to run, monitor and record.

And as it turns out, AI is quite good at this!

I have been using VSCode CoPilot in agent mode with Gemini 2.5 Pro to create simple GUIs that can control my experiments, which has proved pretty effective. Although there is clearly a concern when interfacing AI generated code with real hardware (especially if you “vibe code”, that is, just run whatever it generates) in practice it has allowed me to quickly generate tools for testing purposes, cutting the time required for getting a project started from hours to minutes.

As an example, I recently needed to hook up a Helmholtz coil to some custom electronics, centred around a Teensy micro-controller and designed to output a precisely controlled current.

Continue reading

Estimating uncertainty in MD observables using block averaging

When running molecular dynamics (MD) simulations, we are usually interested in measuring an ensemble average of some metric (e.g., RMSD, RMSF, radius of gyration, …) and use this to draw conclusions about the investigated system. While calculating the average value of a metric is straightforward (we can simply measure the metric in each frame and average it) calculating a statistical uncertainty is a little more tricky and often forgotten. The main challange when trying to calculate an uncertainty of MD oveservables is that individual frames of the simulation are not samped independently but they are time correlated (i.e., frame N depends on frame N-1). In this blog post, I will breifly introduce block averaging, a statistical technique to estimate uncertainty in correlated data.

Continue reading

Memory Efficient Clustering of Large Protein Trajectory Ensembles

Molecular dynamics simulations have grown increasingly ambitious, with researchers routinely generating trajectories containing hundreds of thousands or even millions of frames. While this wealth of data offers unprecedented insights into protein dynamics, it also presents a formidable computational challenge: how do you extract meaningful conformational clusters from datasets that can easily exceed available system memory?

Traditional approaches to trajectory clustering often stumble when faced with large ensembles. Loading all pairwise distances into memory simultaneously can quickly consume tens or hundreds of gigabytes of RAM, while conventional PCA implementations require the entire dataset to fit in memory before decomposition can begin. For many researchers, this means either downsampling their precious simulation data or investing in expensive high-memory computing resources.

The solution lies in recognizing that we don’t actually need to hold all our data in memory simultaneously. By leveraging incremental algorithms and smart memory management, we can perform sophisticated dimensionality reduction and clustering on arbitrarily large trajectory datasets using modest computational resources. Let’s explore how three key strategies—incremental PCA, mini-batch clustering, and intelligent memory management—can transform your approach to analyzing large protein ensembles.

Continue reading

Demystifying Git and Merge Conflicts 

Git can be an incredibly effective coding tool, but it can also be an incredibly frustrating one. It has a steep learning curve, but you’ll be a lot better off understanding how it works rather than copying and pasting commands from Stack Overflow or ChatGPT. I’ve been there, and things can go very, very wrong.

What is Git?

Git is a version control system which tracks changes in files within a repository. It lets you maintain different versions of that codebase. Not to be confused with GitHub, which is a Git server, or a remote location which serves as the host to codebases in Git. GitHub provides a user-friendly front-end for managing changes and issue tracking. There are plenty of other Git servers, such as Bitbucket and GitLab.

Tips for using Git

Git is massive, and there’s plenty of tutorials and guides out there to help you learn it. This is far from a comprehensive guide, but these are the commands I use on a regular basis. I’ll go over some of the basics, and then some of the niche ones that I’ve found particularly useful. 

Continue reading

Cheating at Spelling Bee using the command line


The New York Times Spelling Bee is a free online word game, where players must construct as many words as possible using the letters provided in the Bee’s grid, always including the middle letter. Bonus points for using all the letters and creating a pangram.


However, this is the kind of thing which computers are very good at. If you’ve become frustrated trying to decipher the abstruse ways of the bee, let the command line help.

tl;dr:

grep -iP "^[adlokec]{6,}$" /usr/share/dict/words |grep a | awk '{ print length, $0 }' |sort -n |cut -f2 -d" "
Continue reading

Slurm and Snakemake: a match made in HPC heaven

Snakemake is an incredibly useful workflow management tool that allows you to run pipelines in an automated way. Simply put, it allows you to define inputs and outputs for different steps that depend on each other, Snakemake will then run jobs only when the required inputs have been generated by previous steps. A previous blog post by Tobias is a good introduction to it – https://www.blopig.com/blog/2021/12/snakemake-better-workflows-with-your-code/

However, often pipelines are computationally intense and we would like to run them on our HPC. Snakemake allows us to do this on slurm using an extension package called snakemake-executor-plugin-slurm.

Continue reading