Monthly Archives: May 2025

Estimating uncertainty in MD observables using block averaging

When running molecular dynamics (MD) simulations, we are usually interested in measuring an ensemble average of some metric (e.g., RMSD, RMSF, radius of gyration, …) and use this to draw conclusions about the investigated system. While calculating the average value of a metric is straightforward (we can simply measure the metric in each frame and average it) calculating a statistical uncertainty is a little more tricky and often forgotten. The main challange when trying to calculate an uncertainty of MD oveservables is that individual frames of the simulation are not samped independently but they are time correlated (i.e., frame N depends on frame N-1). In this blog post, I will breifly introduce block averaging, a statistical technique to estimate uncertainty in correlated data.

Continue reading

Memory Efficient Clustering of Large Protein Trajectory Ensembles

Molecular dynamics simulations have grown increasingly ambitious, with researchers routinely generating trajectories containing hundreds of thousands or even millions of frames. While this wealth of data offers unprecedented insights into protein dynamics, it also presents a formidable computational challenge: how do you extract meaningful conformational clusters from datasets that can easily exceed available system memory?

Traditional approaches to trajectory clustering often stumble when faced with large ensembles. Loading all pairwise distances into memory simultaneously can quickly consume tens or hundreds of gigabytes of RAM, while conventional PCA implementations require the entire dataset to fit in memory before decomposition can begin. For many researchers, this means either downsampling their precious simulation data or investing in expensive high-memory computing resources.

The solution lies in recognizing that we don’t actually need to hold all our data in memory simultaneously. By leveraging incremental algorithms and smart memory management, we can perform sophisticated dimensionality reduction and clustering on arbitrarily large trajectory datasets using modest computational resources. Let’s explore how three key strategies—incremental PCA, mini-batch clustering, and intelligent memory management—can transform your approach to analyzing large protein ensembles.

Continue reading

Open Source Pharma: From Idealism to Pragmatic Solutions

In an industry dominated by patents, proprietary data, and the race to get a first-in-class drug, the concept of open source drug development once seemed like an impossible dream. Yet as traditional pharma continues to leave many global health needs unaddressed—particularly for diseases affecting low and middle income countries1,2—the open source model has evolved from idealistic theory to pragmatic reality. In this post, I’ll lead us through how open source drug development has overcome key obstacles of funding and intellectual property (IP) management to deliver real-world solutions.

Continue reading

An insight into mega-conferences – attending ESCMID Global 2025

I suppose it really hit me when the Viennese border control officer asked, “Ah, you must be here for the conference?” That’s when I realised: this wasn’t just any event. ESCMID Global isn’t your average gathering of lab coat enthusiasts, but rather one of the largest clinical infectious disease conferences in the world. Over 16,000 attendees packed into Vienna for their 35th annual congress.

So, was flying across Europe to attend the Glastonbury of conferences, minus the mud, plus the microbes, worth it?

Well… it depends on what you’re hoping to get out of it. If you’re an academic, you might find that a lot of the sessions lean heavily towards the clinical side of things. On the plus side, it made it easier to narrow down my schedule – with over a dozen sessions happening at any one time, a bit of decisiveness goes a long way. Personally, I found the big-name, high-level keynotes and annual updates from organisations like EUCAST the most accessible and informative.

Continue reading

Attending LMRL @ ICLR 2025

I recently attended the Learning Meaningful Representations of Life (LMRL) workshop at ICLR 2025. The goal of LMRL is to highlight machine learning methods which extract meaningful or useful properties from unstructured biological data, with an eye towards building a virtual cell. I presented my paper which demonstrates how standard Transformers can learn to meaningfully represent 3D coordinates when trained on protein structures. Each paper submitted to LMRL had to include a “meaningfulness statement” – a short description of how the work presents a meaningful representation.

Continue reading

Interested in Research Software Engineering? We’re hiring!

I’ve been working in OPIG as a Research Software Engineer for several years now, and it’s been a fantastic experience. Sadly, my time here is coming to an end, which means we’re looking to hire a new Research Software Engineer to take over! As a computational research group, we write a lot of scientific software and strive to ensure everything we do is open-source and as accessible as possible to both academic and industrial users. Many of these tools are still in use long after the people who wrote them have left the group, and are actively maintained to ensure they remain useful to researchers. Supporting the development and deployment of all of these tools are our Research Software Engineers. This is a great opportunity to work at the intersection of academia and industry, where you will be able to both contribute to world-leading research and maximise the impact of that research by ensuring the tools produced are both accessible and sustainable.

This is a full-time, permanent position in OPIG, based in the Department of Statistics at the University of Oxford. For more details, or to apply, you can find the job details here.

Demystifying Git and Merge Conflicts 

Git can be an incredibly effective coding tool, but it can also be an incredibly frustrating one. It has a steep learning curve, but you’ll be a lot better off understanding how it works rather than copying and pasting commands from Stack Overflow or ChatGPT. I’ve been there, and things can go very, very wrong.

What is Git?

Git is a version control system which tracks changes in files within a repository. It lets you maintain different versions of that codebase. Not to be confused with GitHub, which is a Git server, or a remote location which serves as the host to codebases in Git. GitHub provides a user-friendly front-end for managing changes and issue tracking. There are plenty of other Git servers, such as Bitbucket and GitLab.

Tips for using Git

Git is massive, and there’s plenty of tutorials and guides out there to help you learn it. This is far from a comprehensive guide, but these are the commands I use on a regular basis. I’ll go over some of the basics, and then some of the niche ones that I’ve found particularly useful. 

Continue reading