When running molecular dynamics (MD) simulations, we are usually interested in measuring an ensemble average of some metric (e.g., RMSD, RMSF, radius of gyration, …) and use this to draw conclusions about the investigated system. While calculating the average value of a metric is straightforward (we can simply measure the metric in each frame and average it) calculating a statistical uncertainty is a little more tricky and often forgotten. The main challange when trying to calculate an uncertainty of MD oveservables is that individual frames of the simulation are not samped independently but they are time correlated (i.e., frame N depends on frame N-1). In this blog post, I will breifly introduce block averaging, a statistical technique to estimate uncertainty in correlated data.
Continue readingCategory Archives: Python
Memory Efficient Clustering of Large Protein Trajectory Ensembles
Molecular dynamics simulations have grown increasingly ambitious, with researchers routinely generating trajectories containing hundreds of thousands or even millions of frames. While this wealth of data offers unprecedented insights into protein dynamics, it also presents a formidable computational challenge: how do you extract meaningful conformational clusters from datasets that can easily exceed available system memory?
Traditional approaches to trajectory clustering often stumble when faced with large ensembles. Loading all pairwise distances into memory simultaneously can quickly consume tens or hundreds of gigabytes of RAM, while conventional PCA implementations require the entire dataset to fit in memory before decomposition can begin. For many researchers, this means either downsampling their precious simulation data or investing in expensive high-memory computing resources.
The solution lies in recognizing that we don’t actually need to hold all our data in memory simultaneously. By leveraging incremental algorithms and smart memory management, we can perform sophisticated dimensionality reduction and clustering on arbitrarily large trajectory datasets using modest computational resources. Let’s explore how three key strategies—incremental PCA, mini-batch clustering, and intelligent memory management—can transform your approach to analyzing large protein ensembles.
Continue readingSlurm and Snakemake: a match made in HPC heaven
Snakemake is an incredibly useful workflow management tool that allows you to run pipelines in an automated way. Simply put, it allows you to define inputs and outputs for different steps that depend on each other, Snakemake will then run jobs only when the required inputs have been generated by previous steps. A previous blog post by Tobias is a good introduction to it – https://www.blopig.com/blog/2021/12/snakemake-better-workflows-with-your-code/.
However, often pipelines are computationally intense and we would like to run them on our HPC. Snakemake allows us to do this on slurm using an extension package called snakemake-executor-plugin-slurm.
Continue readingAI generated linkers™: a tutorial
In molecular biology cutting and tweaking a protein construct is an often under-appreciated essential operation. Some protein have unwanted extra bits. Some protein may require a partner to be in the correct state, which would be ideally expressed as a fusion protein. Some protein need parts replacing. Some proteins disfavour a desired state. Half a decade ago, toolkits exists to attempt to tackle these problems, and now with the advent of de novo protein generation new, powerful, precise and way less painful methods are here. Therefore, herein I will discuss how to generate de novo inserts and more with RFdiffusion and other tools in order to quickly launch a project into the right orbit.
Furthermore, even when new methods will have come out, these design principles will still apply —so ignore the name of the de novo tool used.
NVIDIA Reimagines CUDA for Python Developers
According to GitHub’s Open Source Survey, Python has officially become the world’s most popular programming language in 2024 – ultimately surpassing JavaScript. Due to its exceptional popularity, NVIDIA announced Python support for its CUDA toolkit at last year’s GTC conference, marking a major leap in the accessibility of GPU computing. With the latest update (https://nvidia.github.io/cuda-python/latest/) and for the first time, developers can write Python code that runs directly on NVIDIA GPUs without the need for intermediate C or C++ code.
Historically tied to C and C++, CUDA has found its way into Python code through third-party wrappers and libraries. Now, the arrival of native support means a smoother, more intuitive experience.
This paradigm shift opens the door for millions of Python programmers – including our scientific community – to build powerful AI and scientific tools without having to switch languages or learn legacy syntax.
Continue readingGUI Science
There comes a point in every software-inclined lab based grad student’s life, where they think: now is the time to write a GUI for my software, to make it fast, easy to use, generalised so that others will use it too, the new paradigm for how to do research, etc. etc.
Of course such delusions of grandeur are rarely indulged, but when executed they certainly can produce useful outputs, as a well designed (or even, designed) GUI can improve an experimentalist’s life profoundly by simplifying, automating and standardising data acquisition, and by reducing the time to see results, allowing for shorter iteration cycles (this is known in engineering as “Design, Build, Test, Learn” cycle, in software it’s called “Coding”).
Having written a few GUIs in my time, I thought it might be helpful to share some experience I have, though it is by no means broad.
Continue readingCombining Multiple Comparisons Similarity plots for statistical tests
Following on from my previous blopig post, Garrett gave the very helpful suggestion of combining Multiple Comparisons Similarity (MCSim) plots to reduce information redundancy. For example, this an MCSim plot from my previous blog post:

This plot shows effect sizes from a statistical test (specifically Tukey HSD) between mean absolute error (MAE) scores for different molecular featurization methods on a benchmark dataset. Red shows that the method on the y-axis has a greater average MAE score than the method on the x-axis; blue shows the inverse. There is redundancy in this plot, as the same information is displayed in both the upper and lower triangles. Instead, we could plot both the effect size and the p-values from a test on the same MCSim.
Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data
I’m delighted to report our collaboration (Ísak Valsson, Matthew Warren, Aniket Magarkar, Phil Biggin, & Charlotte Deane), on “Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data”, has been published in Nature’s Communications Chemistry (https://doi.org/10.1038/s42004-025-01428-y).

During his MSc dissertation project in the Department of Statistics, University of Oxford, OPIG member Ísak Valsson developed an attention-based GNN to predict protein-ligand binding affinity called “AEV-PLIG”. It featurizes a ligand’s atoms using Atomic Environment Vectors to describe the Protein-Ligand Interactions found in a 3D protein-ligand complex. AEV-PLIG is free and open source (BSD 3-Clause), available from GitHub at https://github.com/oxpig/AEV-PLIG, and forked at https://github.com/bigginlab/AEV-PLIG.
Molecule Networks: data visualization using PyVis
Over the past few years I have explored different data visualization strategies with the goal of rapidly communicating information to medicinal chemists. I have recently fallen in love with “molecule networks” as an intuitive and interactive data visualization strategy. This blog gives a brief tutorial on how to start generating your own molecule networks.

Making pretty, interactive graphs the simple way – Use Plotly.
Using an ESP8266 and some DS18B20 one-wire temperature sensors, I have been automatically recording temperature data from various parts of my pond, to see how it fluctuated with air temperature, depth and filter configuration.

Despite the help I was receiving from the feline fish monitor, I was getting a bit irked at the quality of the graphs I was getting using matplotlib.
Matplotlib has been around since 2003, more than 20 years now. It’s arguably the defacto method of producing graphs in python and it’s not going away. However, it’s also a pain to use and by default produces some quite ugly plots unless you put in the mileage. In fact, when attempting to quickly explore data, Michael L. Waskom’s frustrations with matplotlib were directly related to the production of the seaborn library. “By producing complete graphics from a single function call with minimal
arguments, seaborn facilitates rapid prototyping and exploratory data analysis.”
Seaborn makes use of matplotlib and integrates tightly with pandas provide a neat wrapper for matplotlib functions, allowing you to avoid a lot of the data herding needed to view a graph.
You may think “OK, so seaborn finally tames matplotlib, why should I use anything else?” In short, interactivity. Seaborn and Matplotlib may produce graphs, but a graph alone doesn’t really let you explore the data. If you look at a graph you’re limited to the scale the author thought made sense, you can’t zoom in or out and if one line is behind another, you’re kind of stuck.
Where plotly really shines is with just two lines you can generate your figure and then either save it as the image below, or as an interactive HTML graph such as this.


