Monthly Archives: May 2022

GitHub actions can be useful

GitHub actions is a (relatively) novel GitHub feature that allows you to run code on GitHub when a predefined event is triggered. The most widespread use case for GitHub actions is for Continuous Integration, because it allows you to automatically test your code on any machine immediately after each push. For a great tutorial on how to use it for this see here.

But you can do so much more with them!! Basically you can set up any workflow to run after any event. An event is basically when a specific activity on GitHub happens, while a workflow is basically the script you want to run after the event has happened. For a full list of the events you can use see here. Workflow scripts are written in a .yml file and should be saved within the .github/worflows directory within your repository. I am incapable of writing a better tutorial for these than what is already on their documentation, but I will show a copy of a workflow script I recently put together and walk you through it.

In one of my previous blog posts I wrote about how to upload your code to PyPI. Hopefully I convinced you that this is quite easy, but it does require a few steps that you may not want to be doing every time you come up with a new feature (find a bug) and have to re-upload it. Luckily, you don’t have to!! Just stick the code into a GitHub actions workflow so it will automatically re-upload it for you. Here is the script I use for this:

Continue reading

CryoEM is now the dominant technique for solving antibody structures

Last year, the Structural Antibody Database (SAbDab) listed a record-breaking 894 new antibody structures, driven in no small part by the continued efforts of the researchers to understand SARS-CoV-2.

Fig. 1: The aggregate growth in antibody structure data (all methods) over time. Taken from http://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/stats/ on 25th May 2022.

In this blog post I wanted to highlight the major driving force behind this curve – the huge increase in cryo electron microscopy (cryoEM) data – and the implications of this for the field of structure-based antibody informatics.

Continue reading

Why can a man not lift himself by pulling up on his bootstrap hypothesis test?

This blogpost highlights a typical mistake when performing the bootstrap hypothesis test. Bootstrapping is a method of resampling data to estimate measures of variability, such as confidence intervals or variance. 

In the simplest form of the bootstrap, assume you have a set of values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. You want to estimate the mean and variability of the mean using these data. The recipe is as follows:

Continue reading

Linux Horror stories vol II: Automatic drivers update

As promised, I will tell you about another Linux Horror Story: The Nvidia driver automatic update that breaks your machine. This is a recurrent problem that I have suffered so many times that I tend to disable all Nvidia updates just to avoid it. Unfortunately, I forgot to do so on my new laptop, so it happened once more. 

It all started when I tried to connect my dual monitor to my laptop, as I have been doing for the last 8 months. But the SO did not recognize the monitor. After unplugging and plugging my monitor a few times and rebooting my machine several times, I started thinking that it may be a drivers-related problem, so I just executed the command nvidia-smi to check if the GPU drivers were working. A familiar error message confirmed my fears: 

NVIDIA-SMI has failed because it couldn’t communicate with the Nvidia driver. 

 Make sure that the latest NVIDIA driver is installed and running. 

If you are lucky enough, this is a consequence of the driver update and rebooting the machine will make it work again. Unfortunately, it was not my case, so I started the process of uninstalling and reinstalling the drivers. To do so, in an Ubuntu machine, you only need to use the following two commands.

Continue reading

Make your code do more, with less

When you wrangle data for a living, you start to wonder why everything takes so darn long. Through five years of introspection, I have come to conclude that two simple factors limit every computational project. One is, of course, your personal productivity. Your time of focused work, minus distractions (and yes, meetings figure here), times your energy and mental acuity. All those things you have little control over, unfortunately. But the second is the productivity of your code and tools. And this, in principle, is a variable that you have full control over.

Even quick calculations, when applied to tens of millions of sequences, can take quite some time!

This is a post about how to increase your productivity, by helping you navigate all those instances when the progress bar does not seem to go fast enough. I want to discuss actionable tools to make your code run faster, and generate more results, with less effort, in less time. Instructions to tinker less and think more, so you can do the science that you truly want to be doing. And, above all, I want to give out advice that is so counter-intuitive that you should absolutely consider following it.

Continue reading

From code to molecules: The future of chemical synthesis

In June, after I finish my PhD, I will be joining Chemify, a new startup based in Glasgow that aims to make chemical synthesis universally accessible, reproducible and fully automated using AI and robotics. After previously talking about “Why you should care about startups as a researcher” and a quick guide on “Commercialising your research: Where to start?” on this blog, I have now joined a science-based startup fresh out of university myself.

Chemify is a spinout from the University of Glasgow originating from the group of Prof. Lee Cronin. The core of the technology is the chemical programming language χDL (pronounced “chi DL”) that, in combination with a natural language processing AI that reads and understands chemical synthesis procedures, can be used to plan and autonomously executed chemical reactions on robotic hardware. The Cronin group has also already build the modular robotic hardware needed to carry out almost any chemical reaction, the “Chemputer”. Due to the flexibility of both the Chemputer and the χDL language, Chemify has already shown that the applications go way beyond simple synthesis and can be applied to drug formulation, the discovery of new materials or the optimisation of reaction conditions.

Armed with this transformational software and hardware, Chemify is now fully operational and is hiring exceptional talent into their labs in Glasgow. I am excited to see how smart, AI-driven automation techniques like Chemify will change how small scale chemical synthesis and chemical discovery more broadly is done in the future. I’m super excited to be part of the journey.

Paper review: “EquiBind”

Molecular docking helps us understand how small-molecules interact with proteins. This is especially useful in early drug development stages such as target identification and compound screening. Quick and accurate docking software allows researchers to focus their attention on a smaller set of lead molecules for further testing. Traditionally, docking software has employed first principles from physics and chemistry. Recently, deep learning has become all the rage for molecular docking, maybe motivated by the successful application of deep learning to molecular folding.

Method

EquiBind is a deep learning unconstrained docking method which models a fixed receptor and a ligand with selected rotatable bonds. It predicts the binding pocket and the ligand’s conformation within the pocket in one go. Under the hood, EquiBind employs two great ideas from a recent ICLR 2022 Paper: a SE3-invariant graph neural network based architecture and the idea to generate fixed sets of matching key points to define a rotation and translation between receptor and ligand. In addition, the authors innovate a fast method to project a deformed ligand onto the space spanned by the rotatable bonds of a pre-generated ligand conformation.

Continue reading

Better Models Through Molecular Standardization

“Cheminformatics is hard.”

— Paul Finn

I would add: “Chemistry is nuanced”… Just as there are many different ways of drawing the same molecule, SMILES is flexible enough to allow us to write the same molecule in different ways. While canonical SMILES can resolve this problem, we sometimes have different problem. In some situations, e.g., in machine learning, we need to map all these variants back to the same molecule. We also need to make sure we clean up our input molecules and eliminate invalid or incomplete structures.

Different Versions of the Same Molecule: Salt, Neutral or Charged?

Sometimes, a chemical supplier or compound vendor provides a salt of the compound, e.g., sodium acetate, but all we care about is the organic anion, i.e., the acetate. Very often, our models are built on the assumption we have only one molecule as input—but a salt will appear as two molecules (the sodium ion and the acetate ion). We might also have been given just the negatively-charged acetate instead of the neutral acetic acid.

Tautomers

Another important chemical phenomenon exists where apparently different molecules with identical heavy atoms and a nearby hydrogen can be easily interconverted: tautomers. By moving just one hydrogen atom and exchanging adjacent bond orders, the molecule can convert from one form to another. Usually, one tautomeric form is most stable. Warfarin, a blood-thinning drug, can exist in solution in 40 distinct tautomeric forms. A famous example is keto-enol tautomerism: for example, ethenol (not ethanol) can interconvert with the ketone form. When one form is more stable than the other form(s), we need to make sure we convert the less stable form(s) into the most stable form. Ethenol, a.k.a. vinyl alcohol, (SMILES: ‘C=CO[H]’), will be more stable when it is in the ketone form (SMILES: ‘CC(=O)([H])’):

from IPython.display import SVG # to use Scalar Vector Graphics (SVG) not bitmaps, for cleaner lines

import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw # to draw molecules
from rdkit.Chem.Draw import IPythonConsole # to draw inline in iPython
from rdkit.Chem import rdDepictor  # to generate 2D depictions of molecules
from rdkit.Chem.Draw import rdMolDraw2D # to draw 2D molecules using vectors

AllChem.ReactionFromSmarts('[C:1]-[C:2](-[O:3]-[H:4])>>[C:1]-[C:2](=[O:3])(-[H:4])')
Continue reading

Feeding a drove of hungry OPIGlets

In preparing for battle I have always found that plans are useless, but planning is indispensable.

Dwight D. Eisenhower

Following the previous post about OPIG retreat 2022, and having received numerous requests for recipes, I thought I’d document the process of ensuring that 24 people are kept fed and happy. Recipes at the foot of the post.

Disclaimer – these recipes are entirely my own interpretations, adapted where necessary to suit a range of dietary requirements. They are in no way authentic to any national cuisines and are not intended to be.

Disclaimer II: The Disclaiming – all measurements are approximate. I rarely write down recipes or use precise measurements. Taste as you go, and don’t be afraid to add more salt.

Continue reading

MM(PB/GB)SA – a quick start guide

The MMPBSA.py program distributed Open Source in the AmberTools21 package is a powerful tool for end-point free energy calculations on molecular dynamics simulations. In its most simple application, MMPBSA.py is used to calculate the free energy difference between the bound and unbound states of a protein-ligand complex. In order to use it, however, you need to have an Amber-compliant trajectory file, which means you need to setup and run your simulation fairly carefully.

While the Amber Manual and the MMPBSA tutorial provide lots of helpful information, putting everything together into a full pipeline taking you from structure to a free energy is another story. The goal for this guide is to provide a schematic you can follow to get started. This guide assumes you are familiar with molecular dynamics simulations and the theory of MMPBSA.

The easiest way I have found to do this, using only Open Source software, is:

(1) Download your raw PDB file. If you are lucky and it contains a complete set of heavy atoms (excepting perhaps a terminal OXT here and there, which tleap will add for you in step 3) you are good to go.

(2) Use the H++ webserver to determine the protonation states of each residue and add hydrogens as needed. This webserver is particularly convenient because it will allow you to directly download a PQR file that you can use to generate your starting topology and coordinates. Note that you have various options to choose the pH and internal/external dielectric constants for the calculation.

(3) Use tleap to generate your topology (prmtop) and coordinate (mdcor) files for your simulations. Do not forget that you will need not only the prmtop for the solvated complex, but also a dry prmtop for each of the complex, receptor, and ligand. Load the PQR file from H++ and do not forget to set PBRadii *to the same value for all prmtops*. A typical tleap script for setting up your solvated complex would look something like:

Continue reading