Category Archives: Uncategorized

On the Joys of vim-like Browsing

Reflections on Pointlessness

One of the great delights in this life is pointless optimisation. Point-ful optimisation has its place of course; it is right and proper and sensible, and, well, useful, and it also does, when first achieved, yield considerable satisfaction. But I have found I soon adjust to the newly more efficient (and equally drab) normality, and so the spell fades quickly.

Not so with pointless optimisation. Pointless optimisation, once attained, is a preternaturally persistent source of joy that keeps on giving indefinitely. Particularly if it involves acquiring a skill of some description; if the task optimised is frequent; and if the time so saved could not possibly compensate for the time and effort sunk into the optimisation process. Words cannot convey the triumph of completing a common task with hard-earned skill and effortless efficiency, knowing full-well it makes no difference whatsoever in the grand scheme of things.

Continue reading

The Sprawl: Slogs in Scribing and Software

“Dead shopping malls rise like mountains beyond mountains. And there’s no end in sight.”

Régine Chassagne

Sometimes I wonder would my PhD have been simpler if I had broken up the findings into three smaller papers. In the end there were 7 main figures, 7 supplementary figures, 5 supplementary tables and one supplementary data section in one solitary publication. The contents of a 3 year 3 month tour through the helper T cell response to the inner proteins of the flu virus. The experimental worked comprised crystal structures, cell assays, tetramer staining and TCR sequencing. During the following years as it was batted back and forth between last authors, different journals and reviewers I continually reworked the figures and added extra bioinformatic analyses. I was fortunate that others in the lab kindly performed some in vivo experiments which helped cement the findings. It all started in January 2014, but the paper wasn’t published until July 2020. There are many terms which could be used to describe how the process of writing and re-writing felt as it dragged on through my 3 year post doc, for the purpose of this very public blog I will refer to it as, “a slog.

Continue reading

De novo protein padlocks

Binding a desired protein tightly is important for biotechnology. Recent advances in deep learning have allowed the de novo design of (mostly α-helical) binding protein, sidestepping the laborious process of raising antibodies or nanobodies or evolving affibodies, darpins or similar. These deep learning designed binders will bind with okay affinity, but what if the affinity required were much stronger?
<Enter autocatalytic isopeptide bonds>

Continue reading

Generating Haikus with Llama 3.2

At the recent OPIG retreat, I was tasked with writing the pub quiz. The quiz included five rounds, and it’s always fun to do a couple “how well do you know your group?” style rounds. Since I work with Transformers, I thought it would be fun to get AI to create Haiku summaries of OPIGlet research descriptions from the website.

AI isn’t as funny as it used to be, but it’s a lot easier to get it to write something coherent. There are also lots of knobs you can turn like temperature, top_p, and the details of the prompt. I decided to use Meta’s new Llama 3.2-3B-Instruct model which is publicly available on Hugging Face. I ran it locally using vllm, and instructed it to write a haiku for each member’s description using a short script which parses the html from the website.

Continue reading

Testing python (or any!) command line applications

Through our work in OPIG, many of our projects come in the form of code bases written in Python. These can be many different things like databases, machine learning models, and other software tools. Often, the user interface for these tools is developed as both a web app and a command line application. Here, I will discuss one of my favourite tools for testing command-line applications: prysk!

Continue reading

The XChem trove of protein–small-molecules structures not in the PDB

The XChem facility at Diamond Light Source is truly impressive feat of automation in fragment-based drug discovery, where visitors comes clutching a styrofoam ice box teeming with apo-form protein crystals, which the shifter soaks with compounds from one or more fragment libraries and a robot at the i04-1 beamline kindly processes each of the thousands of crystal-laden pins, while the visitor enjoys the excellent food in the Diamond canteen (R22). I would especially recommend the jambalaya. Following data collection, the magic of data processing happens: the PanDDA method is used to find partial occupancy in the density, which is processed semi-automatedly and most open targets are uploaded in the Fragalysis web app allowing the ligand binding to be studied and further compounds elaborated. This collection of targets bound to hundreds of small molecules is a true treasure trove of data as many have yet to be deposited in the PDB, making it a perfect test set for algorithm design: fragments are notorious fickle to model and deep learning models cannot cheat by remembering these from the protein database.

Continue reading

Tracking the change in ML performance for popular small molecule benchmarks

The power of machine learning (ML) techniques has captivated the field of small molecule drug discovery. Increasingly, researchers and organisations have employed ML to create more accurate algorithms to improve the efficiency of the discovery process.

To be published, methods have to prove they have improved upon others. Often, methods are tested against the same benchmarks within a field, allowing us to track progress over time. To explore the rate of improvement, I curated the performance on three popular benchmarks. The first benchmark is CASF 2016, used to test the accuracy of methods that predict the binding affinity of experimental determined protein-ligand complexes. Accuracy was measured using the Pearson’s R value between predicted and experimental affinity values.

Continue reading

Working with PDB Structures in Pandas

Pandas is one of my favourite data analysis tools working in Python! The data frames offer a lot of power and organization to any data analysis task. Here at OPIG we work with a lot of protein structure data coming from PDB files. In the following article I will go through an example of how I use pandas data frames to analyze PDB data.

Continue reading

The workings of Fragmenstein’s RDKit neighbour-aware minimisation

Fragmenstein is a Python module that combine hits or position a derivative following given templates by being very strict in obeying them. This is done by creating a “monster”, a compound that has the atomic positions of the templates, which then reanimated by very strict energy minimisation. This is done in two steps, first in RDKit with an extracted frozen neighbourhood and then in PyRosetta within a flexible protein. The mapping for both combinations and placements are complicated, but I will focus here on a particular step the minimisation, primarily in answer to an enquiry, namely how does the RDKit minimisation work.

Continue reading

Le Tour de Farce 2023

16:30 BST 27/06/2023 Oxford, UK. A large number of scientists were spotting riding bicycles across town, to the consternation of onlookers. The event was the Oxford Protein Informatics Group (OPIG) “tour de farce” 2023. A circular bike ride from the Department of Statistics, to The Up in Arms (Marston), The Trout Inn (Godstow), The Perch (Port Meadow) and The Holly Bush (Osney Island). This spurred great bystander-anxiety due to one of a multitude of factors: the impressive size of the jovial horde, the erraticism of the cycling, the deplorable maintenance of certain bikes, and the unchained bizarrerie of the overheard dialogue.

Dissociated Press.
Continue reading