Category Archives: Code

Cheating at Spelling Bee using the command line

The New York Times Spelling Bee is a free online word game, where players must construct as many words as possible using the letters provided in the Bee’s grid, always including the middle letter. Bonus points for using all the letters and creating a pangram.

However, this is the kind of thing which computers are very good at. If you’ve become frustrated trying to decipher the abstruse ways of the bee, let the command line help.

tl;dr:

grep -iP "^[adlokec]{6,}$" /usr/share/dict/words |grep a | awk '{ print length, $0 }' |sort -n |cut -f2 -d" "

Continue reading →

Slurm and Snakemake: a match made in HPC heaven

Snakemake is an incredibly useful workflow management tool that allows you to run pipelines in an automated way. Simply put, it allows you to define inputs and outputs for different steps that depend on each other, Snakemake will then run jobs only when the required inputs have been generated by previous steps. A previous blog post by Tobias is a good introduction to it – https://www.blopig.com/blog/2021/12/snakemake-better-workflows-with-your-code/.

However, often pipelines are computationally intense and we would like to run them on our HPC. Snakemake allows us to do this on slurm using an extension package called snakemake-executor-plugin-slurm.

Continue reading →

AI generated linkers™: a tutorial

In molecular biology cutting and tweaking a protein construct is an often under-appreciated essential operation. Some protein have unwanted extra bits. Some protein may require a partner to be in the correct state, which would be ideally expressed as a fusion protein. Some protein need parts replacing. Some proteins disfavour a desired state. Half a decade ago, toolkits exists to attempt to tackle these problems, and now with the advent of de novo protein generation new, powerful, precise and way less painful methods are here. Therefore, herein I will discuss how to generate de novo inserts and more with RFdiffusion and other tools in order to quickly launch a project into the right orbit.
Furthermore, even when new methods will have come out, these design principles will still apply —so ignore the name of the de novo tool used.

Continue reading →

NVIDIA Reimagines CUDA for Python Developers

According to GitHub’s Open Source Survey, Python has officially become the world’s most popular programming language in 2024 – ultimately surpassing JavaScript. Due to its exceptional popularity, NVIDIA announced Python support for its CUDA toolkit at last year’s GTC conference, marking a major leap in the accessibility of GPU computing. With the latest update (https://nvidia.github.io/cuda-python/latest/) and for the first time, developers can write Python code that runs directly on NVIDIA GPUs without the need for intermediate C or C++ code.

Historically tied to C and C++, CUDA has found its way into Python code through third-party wrappers and libraries. Now, the arrival of native support means a smoother, more intuitive experience.

This paradigm shift opens the door for millions of Python programmers – including our scientific community – to build powerful AI and scientific tools without having to switch languages or learn legacy syntax.

Continue reading →

Debugging code for science: Fantastic Bugs and Where to Find Them.

The simulation results make no sense … My proteins are moving through walls and this dihedral angle is negative; my neural network won’t learn anything, I’ve tried for days to install this software and I still get an error.

Feel familiar? Welcome to scientific programming. Bugs aren’t just annoying roadblocks – they’re mysterious phenomena that make you question your understanding of reality itself. If you’ve ever found yourself debugging scientific code, you know it’s a different beast compared to traditional software engineering. In the commercial software world, a bug might mean a button doesn’t work or data isn’t saved correctly. In scientific computing, a bug might mean your climate model predicts an ice age next Tuesday, or your protein folding algorithm creates molecular structures that couldn’t possibly exist in our universe (cough).

Continue reading →

GUI Science

There comes a point in every software-inclined lab based grad student’s life, where they think: now is the time to write a GUI for my software, to make it fast, easy to use, generalised so that others will use it too, the new paradigm for how to do research, etc. etc.

Of course such delusions of grandeur are rarely indulged, but when executed they certainly can produce useful outputs, as a well designed (or even, designed) GUI can improve an experimentalist’s life profoundly by simplifying, automating and standardising data acquisition, and by reducing the time to see results, allowing for shorter iteration cycles (this is known in engineering as “Design, Build, Test, Learn” cycle, in software it’s called “Coding”).

Having written a few GUIs in my time, I thought it might be helpful to share some experience I have, though it is by no means broad.

Continue reading →

Combining Multiple Comparisons Similarity plots for statistical tests

Following on from my previous blopig post, Garrett gave the very helpful suggestion of combining Multiple Comparisons Similarity (MCSim) plots to reduce information redundancy. For example, this an MCSim plot from my previous blog post:

This plot shows effect sizes from a statistical test (specifically Tukey HSD) between mean absolute error (MAE) scores for different molecular featurization methods on a benchmark dataset. Red shows that the method on the y-axis has a greater average MAE score than the method on the x-axis; blue shows the inverse. There is redundancy in this plot, as the same information is displayed in both the upper and lower triangles. Instead, we could plot both the effect size and the p-values from a test on the same MCSim.

Continue reading →

Geometric Deep Learning meets Forces & Equilibrium

Introduction

Graphs provide a powerful mathematical framework for modelling complex systems, from molecular structures to social networks. In many physical and geometric problems, nodes represent particles, and edges encode interactions, often acting like springs. This perspective aligns naturally with Geometric Deep Learning, where learning algorithms leverage graph structures to capture spatial and relational patterns.

Understanding energy functions and the forces derived from them is fundamental to modelling such systems. In physics and computational chemistry, harmonic potentials, which penalise deviations from equilibrium positions, are widely used to describe elastic networks, protein structures, and even diffusion processes. The Laplacian matrix plays a key role in these formulations, linking energy minimisation to force computations in a clean and computationally efficient way.

By formalising these interactions using matrix notation, we gain not only a compact representation but also a foundation for more advanced techniques such as Langevin dynamics, normal mode analysis, and graph-based neural networks for physical simulations.

Continue reading →

The Good (and limitations) of using a Local CoPilot with Ollama

Interactive code editors have been around for a while now, and tools like GitHub Copilot have woven their way into most development pipelines, and for good reason. They’re easy to use, exceptionally helpful (at certain tasks), and have undeniably made life as a developer smoother. Recently, I decided to switch away from relying on GitHub Copilot in favour of a local model for a few key reasons. While I don’t use it all the time, it has proven to be a useful option in many situations. In this blog post, I’ll go over why I made the switch, how I set it up, and share a bit about my experience so far.

Continue reading →

Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data

I’m delighted to report our collaboration (Ísak Valsson, Matthew Warren, Aniket Magarkar, Phil Biggin, & Charlotte Deane), on “Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data”, has been published in Nature’s Communications Chemistry (https://doi.org/10.1038/s42004-025-01428-y).

During his MSc dissertation project in the Department of Statistics, University of Oxford, OPIG member Ísak Valsson developed an attention-based GNN to predict protein-ligand binding affinity called “AEV-PLIG”. It featurizes a ligand’s atoms using Atomic Environment Vectors to describe the Protein-Ligand Interactions found in a 3D protein-ligand complex. AEV-PLIG is free and open source (BSD 3-Clause), available from GitHub at https://github.com/oxpig/AEV-PLIG, and forked at https://github.com/bigginlab/AEV-PLIG.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Category Archives: Code

Cheating at Spelling Bee using the command line

Slurm and Snakemake: a match made in HPC heaven

AI generated linkers™: a tutorial

NVIDIA Reimagines CUDA for Python Developers

Debugging code for science: Fantastic Bugs and Where to Find Them.

GUI Science

Combining Multiple Comparisons Similarity plots for statistical tests

Geometric Deep Learning meets Forces & Equilibrium

Introduction

The Good (and limitations) of using a Local CoPilot with Ollama

Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data