Monthly Archives: July 2025

Antibody developability datasets

Next to binding the antigen with high affinity, antibodies for therapeutic purposes need to be developable. These developability properties includes high expression, high stability, low aggregation, low immunogenicity, and low non-specificity [1]. These properties are often linked and therefore optimising for one property might be at the expense of another. Machine learning methods have been build to guide the optimistation process of one or multiple developability properties.

Performance of these methods is often limited by the amount and type of data available for training. These dataset contain experimental determined scores of biophysical assays related to developability. Some common experimental assays are described in a previous blog post by Matthew Raybould [2]. Here I will discuss some (commonly) used and new dataset related to antibody developability. This list is not exhaustive but might help you start understanding more about antibody developability.

Continue reading →

Publishing 101

Scientists pride themselves on clear, logical and concise communication. So naturally, the process for publishing our research involves an absurd number of formalities, like coming up with 700 slightly different ways to ‘thank the reviewer for their insightful comment’. Nevertheless, I’m told this is all a necessary part of spreading your beautiful researcher butterfly wings—and frankly, I’m enough years into my DPhil to stop questioning every quirk of academia. However, the current protocol for new researchers wanting to learn the moves to this bizarre dance seems to be begging postdocs/ old timers for examples of cover letters, marked-up manuscripts, and reviewer responses. To attempt to save everyone some time, I thought I’d provide some guidance and templates here.

Continue reading →

A more robust way to split data for protein-ligand tasks?

As I was recently reading through the paper on the PLINDER dataset while preparing for my next project, one of the aspects of the dataset that caught my attention was how the dataset splits were done to ensure minimal leakage for various protein-ligand tasks that PLINDER could be used for. They had task-specific splits as the notion of data leakage differed from task to task. For instance, in rigid body docking, having a similar protein in the train and test may not be considered leakage if the binding pocket location, conformation, or pocket interactions with a ligand are significantly different. On the other hand, in the case of co-folding, having similar proteins in the train and test sets would be considered data leakage, as predicted protein structures play a significant role in accuracy scoring. The effort that went into creating task-specific splits resonates strongly with OPIG’s view on ensuring minimal data leakage for validating the generalisability of protein-ligand models. However, it may become tedious to create task-specific dataset splits for every protein-ligand task when dealing with a large suite of such tasks. This had me thinking of potential avenues to streamline the dataset split process across the tasks, and one way to do this is by using protein-ligand interaction fingerprints or PLIFs.

Continue reading →

GUI Slop

Previously, I wrote about writing GUI’s for controlling and monitoring experiments. For ML this might be useful for tracking model learning (e.g. the popular weights and biases platform), while in the wet-lab it is great for making experiments simpler and more reliable to run, monitor and record.

And as it turns out, AI is quite good at this!

I have been using VSCode CoPilot in agent mode with Gemini 2.5 Pro to create simple GUIs that can control my experiments, which has proved pretty effective. Although there is clearly a concern when interfacing AI generated code with real hardware (especially if you “vibe code”, that is, just run whatever it generates) in practice it has allowed me to quickly generate tools for testing purposes, cutting the time required for getting a project started from hours to minutes.

As an example, I recently needed to hook up a Helmholtz coil to some custom electronics, centred around a Teensy micro-controller and designed to output a precisely controlled current.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Monthly Archives: July 2025

Antibody developability datasets

Publishing 101

A more robust way to split data for protein-ligand tasks?

GUI Slop