The business of health: research and funding from academia to big pharma

In a world in which the probability of clinical success is just 10%-20% for new medicines, pharmaceutical multinationals increasingly turn to academia and biotech as a source of “de-risked” technology for their pipelines. This exchange of ideas, equity and capital depends on firm relationships between entities having apparently divergent interests: from not-for-profit research to international commerce.

As a former pharma contract negotiator, I spent much of my past life attempting to find common ground with university researchers and biotech leadership teams. In 2021, I had the privilege of returning to academia in the UK after a prolonged hiatus, and – more recently – of working with start-ups. In this blog, I will comment on some of the surprising trends I have observed in how pharma, biotech and academics balance the conduct of meaningful research with the requirements of their respective sectors. The views herein are entirely my own.

Continue reading

Therapeutic antibodies and their function

Last week during a poster session in the Department of Statistics, I had an interesting discussing with Martin Buttenschoen (working on the other side of the group) regarding the difference between small molecules and antibodies as therapeutics. This discussion made me realise that even though I’m working on antibodies engineering and developability, I could use a little refresher on approved therapeutic antibodies and their mechanisms of action. 

In case you also need this bigger picture, or want to get excited about therapeutic antibodies yourself, I will summarise the target, the development process, the molecular function, and the administration for three successful therapeutic antibodies.

Continue reading

AI in Academic Writing: Ethical Considerations and Best Practices

I don’t need to tell you how popular AI, in particular LLMs, have become in recent years. Alongside this rapid growth comes uncharted territory, especially with respect to plagiarism and integrity. As we adapt to a rapidly changing technological climate, we become increasingly reliant on AI. Need some help phrasing an email? Ask ChatGPT. Need to create a packing list for an upcoming camping trip? Get an AI-based task manager. So naturally when we’re faced with the daunting, and admittedly tedious task of writing papers, we question whether we can offload some of that work to a machine. As with many things, the question is not simply whether or not you should use AI in your writing process, it’s how you choose to do so.

When To Use AI

  1. Grammar and readability
    Particularly useful for those who are writing in a second language, or for those who struggle to write in their first language (my high school English teachers would place me firmly in this category), LLMs can be used beneficially to identify awkward phrasing, flag excessively complex sentences, and identify and fix grammatical errors.
  2. Formatting and structure
    LLMs can take care of some of the tedious, repetitive work with respect to formatting and structure. They are particularly useful for generating LaTeX templates for figures, tables, equations, and general structure. You can also use them to check that you’re matching a specific journal’s standards. For example, you can give an LLM a sample of articles from a target publication, and ask it to note the structure of these papers. Then, give it your work and ask it to make general, larger-scale suggestions to ensure that your work aligns structurally with articles typical of that journal.
  3. Reference management
    Although the references should be read and cited by an author, various management tasks like creating properly formatted BibTeX entries can be handled by LLMs. Additionally, you can use LLMs to do a sanity check to ensure that your references are an accurate reflection of the source material they are referring to. However, they should not to be used to summarise the source and create references on their own. 
  4. Summarising large volumes of literature
    If you’re reviewing large volumes of literature, LLMs can help summarise papers efficiently and point you in the right direction. Although you should always cite and refer back to the original source, LLMs can distill key points from long, dense papers, organise notes, and extract important takeaways from datasets and figures.

Regardless of how you use AI, it is importance to keep a record of all instances of such AI use throughout your research, including use during coding, Some journals will make you explicitly declare the use of AI tools, but even if it is not required this kind of record-keeping is considered good practice. 

When Not to Use AI

  1. Big-picture thinking and narrative development
    Academic papers are not solely about presenting information, they are about constructing an argument, building a narrative flow, and presenting a compelling case. LLMs are not particularly good at replicating human creativity, that work is best left to the authors. Additionally, it is dishonest to claim these important aspects of writing as your own if they are not written directly by you.
  2. Direct copy-paste
    Although AI tools may suggest minor edits, you should never directly copy-and-paste larger selections of AI-generated text. If the ethical concerns described in (1) do not persuade you as they should, there are now plenty of tools being used to detect AI-generated text by various academic institutions and journals. Although some scholars do tend to lean on AI as a more collaborative tool, transparency is key. 
  3. Source of knowledge
    LLMs don’t actually “know” anything; they generate responses based on probability. As a result, they have a tendency to “hallucinate,” or present false information as fact, misrepresent or oversimplify complex concepts, and do not have precise technical accuracy. They may also be biased based on the sources they were trained on. Peer-reviewed sources should be the go-to for actual information. If you use LLMs to summarise something, always refer back to the original text when using that information in your work.
  4. Full citation generation
    As discussed above, although AI can be used to summarise sources, it is not a reliable source of direct citations. All references should be created by hand and verified manually.
  5. General over-reliance
    From the research design process to the final stages of writing and editing, you should generally refrain from an over-reliance on AI. Although LLMs can be powerful tools that can be used to automate various lower-level tasks, they are not a substitute for critical thinking, originality, or domain expertise, and they are not a full-fledged co-author of your work. The intellectual contribution and ownership of ideas remains in the hands of the human authors. 

For further and more official guidance, check out the ethical framework for the use of AI in academic research published in Nature Machine Intelligence. This framework outlines three criteria for the responsible use of LLMs in scholarship, summarised as follows:

  1. Human vetting and guaranteeing the accuracy and integrity 
  2. Substantial human contribution across all areas of the work
  3. Acknowledgement and transparency of AI use

Confidence in ML models

Recently, I have been interested in adding a confidence metric to the predictions made by a machine learning model I have been working on. In this blog post, I will outline a few strategies I have been exploring to do this. Powerful deep learning models like AlphaFold are great, not only for the predictions they make, but they also generate confidence measures to give the user a sense of how much to trust the prediction.

Continue reading

The Sprawl: Slogs in Scribing and Software

“Dead shopping malls rise like mountains beyond mountains. And there’s no end in sight.”

Régine Chassagne

Sometimes I wonder would my PhD have been simpler if I had broken up the findings into three smaller papers. In the end there were 7 main figures, 7 supplementary figures, 5 supplementary tables and one supplementary data section in one solitary publication. The contents of a 3 year 3 month tour through the helper T cell response to the inner proteins of the flu virus. The experimental worked comprised crystal structures, cell assays, tetramer staining and TCR sequencing. During the following years as it was batted back and forth between last authors, different journals and reviewers I continually reworked the figures and added extra bioinformatic analyses. I was fortunate that others in the lab kindly performed some in vivo experiments which helped cement the findings. It all started in January 2014, but the paper wasn’t published until July 2020. There are many terms which could be used to describe how the process of writing and re-writing felt as it dragged on through my 3 year post doc, for the purpose of this very public blog I will refer to it as, “a slog.

Continue reading

Combining Multiple Comparisons Similarity plots for statistical tests

Following on from my previous blopig post, Garrett gave the very helpful suggestion of combining Multiple Comparisons Similarity (MCSim) plots to reduce information redundancy. For example, this an MCSim plot from my previous blog post:

This plot shows effect sizes from a statistical test (specifically Tukey HSD) between mean absolute error (MAE) scores for different molecular featurization methods on a benchmark dataset. Red shows that the method on the y-axis has a greater average MAE score than the method on the x-axis; blue shows the inverse. There is redundancy in this plot, as the same information is displayed in both the upper and lower triangles. Instead, we could plot both the effect size and the p-values from a test on the same MCSim.

Continue reading

Geometric Deep Learning meets Forces & Equilibrium

Introduction

Graphs provide a powerful mathematical framework for modelling complex systems, from molecular structures to social networks. In many physical and geometric problems, nodes represent particles, and edges encode interactions, often acting like springs. This perspective aligns naturally with Geometric Deep Learning, where learning algorithms leverage graph structures to capture spatial and relational patterns.

Understanding energy functions and the forces derived from them is fundamental to modelling such systems. In physics and computational chemistry, harmonic potentials, which penalise deviations from equilibrium positions, are widely used to describe elastic networks, protein structures, and even diffusion processes. The Laplacian matrix plays a key role in these formulations, linking energy minimisation to force computations in a clean and computationally efficient way.

By formalising these interactions using matrix notation, we gain not only a compact representation but also a foundation for more advanced techniques such as Langevin dynamics, normal mode analysis, and graph-based neural networks for physical simulations.

Continue reading

The Good (and limitations) of using a Local CoPilot with Ollama

Interactive code editors have been around for a while now, and tools like GitHub Copilot have woven their way into most development pipelines, and for good reason. They’re easy to use, exceptionally helpful (at certain tasks), and have undeniably made life as a developer smoother. Recently, I decided to switch away from relying on GitHub Copilot in favour of a local model for a few key reasons. While I don’t use it all the time, it has proven to be a useful option in many situations. In this blog post, I’ll go over why I made the switch, how I set it up, and share a bit about my experience so far.

Continue reading

Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data

I’m delighted to report our collaboration (Ísak Valsson, Matthew Warren, Aniket Magarkar, Phil Biggin, & Charlotte Deane), on “Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data”, has been published in Nature’s Communications Chemistry (https://doi.org/10.1038/s42004-025-01428-y).


During his MSc dissertation project in the Department of Statistics, University of Oxford, OPIG member Ísak Valsson developed an attention-based GNN to predict protein-ligand binding affinity called “AEV-PLIG”. It featurizes a ligand’s atoms using Atomic Environment Vectors to describe the Protein-Ligand Interactions found in a 3D protein-ligand complex. AEV-PLIG is free and open source (BSD 3-Clause), available from GitHub at https://github.com/oxpig/AEV-PLIG, and forked at https://github.com/bigginlab/AEV-PLIG.

Continue reading

Estimating the Generalisability of Machine Learning Models in Drug Discovery

Machine learning (ML) has significantly advanced key computational tasks in drug discovery, including virtual screening, binding affinity prediction, protein-ligand structure prediction (co-folding), and docking. However, the extent to which these models generalise beyond their training data is often overestimated due to shortcomings in benchmarking datasets. Existing benchmarks frequently fail to account for similarities between the training and test sets, leading to inflated performance estimates. This issue is particularly pronounced in tasks where models tend to memorise training examples rather than learning generalisable biophysical principles. The figure below demonstrates two examples of model performance decreasing with increased dissimilarity between training and test data, for co-folding (left) and binding affinity prediction (right).

Continue reading