Category Archives: Data Science

Nice TCR processing libraries

As someone who works with T cell antigen receptor (TCR) and peptide-major histocompatibility complex (pMHC) data, I have found several Python packages to be very useful for eliminating tedious steps in data cleaning and feature engineering stages.

Continue reading →

New DPhil/PhD Programme in Pharmaceutical Science Joint with GSK!

Many OPIGlets found their way into a DPhil in Protein Informatics through our Systems Approaches to Biomedical Sciences Industrial Doctoral Landscape Award, which was open to applicants 2009-2024. This innovative course, based at the MPLS Doctoral Training Centre (DTC), offered six months of intensive taught modules prior to starting PhD-level research, allowing students to upskill across a diverse range of subjects (coding, mathematics, structural biology, etc.) and to go on to do research in areas significantly distinct from their formal Undergraduate training. All projects also benefited from direct co-supervision from researchers working in the Pharmaceutical industry, ensuring DPhil projects in areas with drug discovery translation potential. Regrettably, having twice successfully applied for renewal of funding, we were unsuccessful in our bid to refund SABS in 2024.

Happily though, we can now formally announce that our bid for a direct successor to SABS, the Transformative Technologies in Pharmaceutical Sciences IDLA, has been backed by the BBSRC, and we will shortly be opening for applications for entry this October [2026]. As someone who benefited from the interdisciplinary training and industry-adjacency of SABS, I’m thrilled to be a co-director of this new Programme and to help deliver this course to a new generation of talented students.

Continue reading →

Finding 250GB of Missing Storage On My Mac: A Warning For Large Dataset Users

I recently faced a puzzling issue: my 1TB MacBook Pro showed only 150GB free, but disk analyzers could only account for about 500GB of used space. After hours of troubleshooting, I discovered that Spotlight’s search index had balooned to 233GB, hundreds of times larger than normal.

The Problem

Standard disk analyzers showed that my mac had 330GB of “Inaccessible Disk Space” and 66GB of “Purgeable Disk Space” but no clear explanation for where my storage went. Removing the purgeable space was easy enough with sudo purge but none of the recommended fixes from ChatGPT like clearing Time Machine snapshots, clearing unused conda packages with pip cache purge and conda clean --all, and restarting the computer had any effect on the inaccessible disk space.

Continue reading →

Exploring the Protein Data Bank programmatically

The Worldwide Protein Data Bank (wwPDB or just the PDB to its friends) is a key resource for structural biology, providing a single central repository of protein and nucleic acid structure data. Most researchers interact with the PDB either by downloading and parsing individual entries as mmCIF files (or as legacy PDB files), or by downloading aggregated data, such as the RCSB‘s collection in a single FASTA file of all polymer entity sequences. All too often, researchers end up laboriously writing their own file parsers to digest these files. In recent years though, more sophisticated tools have been made available that make it much easier to access only the data that you need.

Continue reading →

A more robust way to split data for protein-ligand tasks?

As I was recently reading through the paper on the PLINDER dataset while preparing for my next project, one of the aspects of the dataset that caught my attention was how the dataset splits were done to ensure minimal leakage for various protein-ligand tasks that PLINDER could be used for. They had task-specific splits as the notion of data leakage differed from task to task. For instance, in rigid body docking, having a similar protein in the train and test may not be considered leakage if the binding pocket location, conformation, or pocket interactions with a ligand are significantly different. On the other hand, in the case of co-folding, having similar proteins in the train and test sets would be considered data leakage, as predicted protein structures play a significant role in accuracy scoring. The effort that went into creating task-specific splits resonates strongly with OPIG’s view on ensuring minimal data leakage for validating the generalisability of protein-ligand models. However, it may become tedious to create task-specific dataset splits for every protein-ligand task when dealing with a large suite of such tasks. This had me thinking of potential avenues to streamline the dataset split process across the tasks, and one way to do this is by using protein-ligand interaction fingerprints or PLIFs.

Continue reading →

Memory Efficient Clustering of Large Protein Trajectory Ensembles

Molecular dynamics simulations have grown increasingly ambitious, with researchers routinely generating trajectories containing hundreds of thousands or even millions of frames. While this wealth of data offers unprecedented insights into protein dynamics, it also presents a formidable computational challenge: how do you extract meaningful conformational clusters from datasets that can easily exceed available system memory?

Traditional approaches to trajectory clustering often stumble when faced with large ensembles. Loading all pairwise distances into memory simultaneously can quickly consume tens or hundreds of gigabytes of RAM, while conventional PCA implementations require the entire dataset to fit in memory before decomposition can begin. For many researchers, this means either downsampling their precious simulation data or investing in expensive high-memory computing resources.

The solution lies in recognizing that we don’t actually need to hold all our data in memory simultaneously. By leveraging incremental algorithms and smart memory management, we can perform sophisticated dimensionality reduction and clustering on arbitrarily large trajectory datasets using modest computational resources. Let’s explore how three key strategies—incremental PCA, mini-batch clustering, and intelligent memory management—can transform your approach to analyzing large protein ensembles.

Continue reading →

Slurm and Snakemake: a match made in HPC heaven

Snakemake is an incredibly useful workflow management tool that allows you to run pipelines in an automated way. Simply put, it allows you to define inputs and outputs for different steps that depend on each other, Snakemake will then run jobs only when the required inputs have been generated by previous steps. A previous blog post by Tobias is a good introduction to it – https://www.blopig.com/blog/2021/12/snakemake-better-workflows-with-your-code/.

However, often pipelines are computationally intense and we would like to run them on our HPC. Snakemake allows us to do this on slurm using an extension package called snakemake-executor-plugin-slurm.

Continue reading →

Confidence in ML models

Recently, I have been interested in adding a confidence metric to the predictions made by a machine learning model I have been working on. In this blog post, I will outline a few strategies I have been exploring to do this. Powerful deep learning models like AlphaFold are great, not only for the predictions they make, but they also generate confidence measures to give the user a sense of how much to trust the prediction.

Continue reading →

Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data

I’m delighted to report our collaboration (Ísak Valsson, Matthew Warren, Aniket Magarkar, Phil Biggin, & Charlotte Deane), on “Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data”, has been published in Nature’s Communications Chemistry (https://doi.org/10.1038/s42004-025-01428-y).

During his MSc dissertation project in the Department of Statistics, University of Oxford, OPIG member Ísak Valsson developed an attention-based GNN to predict protein-ligand binding affinity called “AEV-PLIG”. It featurizes a ligand’s atoms using Atomic Environment Vectors to describe the Protein-Ligand Interactions found in a 3D protein-ligand complex. AEV-PLIG is free and open source (BSD 3-Clause), available from GitHub at https://github.com/oxpig/AEV-PLIG, and forked at https://github.com/bigginlab/AEV-PLIG.

Continue reading →

Estimating the Generalisability of Machine Learning Models in Drug Discovery

Machine learning (ML) has significantly advanced key computational tasks in drug discovery, including virtual screening, binding affinity prediction, protein-ligand structure prediction (co-folding), and docking. However, the extent to which these models generalise beyond their training data is often overestimated due to shortcomings in benchmarking datasets. Existing benchmarks frequently fail to account for similarities between the training and test sets, leading to inflated performance estimates. This issue is particularly pronounced in tasks where models tend to memorise training examples rather than learning generalisable biophysical principles. The figure below demonstrates two examples of model performance decreasing with increased dissimilarity between training and test data, for co-folding (left) and binding affinity prediction (right).

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Category Archives: Data Science

Nice TCR processing libraries

New DPhil/PhD Programme in Pharmaceutical Science Joint with GSK!

Finding 250GB of Missing Storage On My Mac: A Warning For Large Dataset Users

The Problem

Exploring the Protein Data Bank programmatically

A more robust way to split data for protein-ligand tasks?

Memory Efficient Clustering of Large Protein Trajectory Ensembles

Slurm and Snakemake: a match made in HPC heaven

Confidence in ML models

Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data

Estimating the Generalisability of Machine Learning Models in Drug Discovery