From code to molecules: The future of chemical synthesis

In June, after I finish my PhD, I will be joining Chemify, a new startup based in Glasgow that aims to make chemical synthesis universally accessible, reproducible and fully automated using AI and robotics. After previously talking about “Why you should care about startups as a researcher” and a quick guide on “Commercialising your research: Where to start?” on this blog, I have now joined a science-based startup fresh out of university myself.

Chemify is a spinout from the University of Glasgow originating from the group of Prof. Lee Cronin. The core of the technology is the chemical programming language χDL (pronounced “chi DL”) that, in combination with a natural language processing AI that reads and understands chemical synthesis procedures, can be used to plan and autonomously executed chemical reactions on robotic hardware. The Cronin group has also already build the modular robotic hardware needed to carry out almost any chemical reaction, the “Chemputer”. Due to the flexibility of both the Chemputer and the χDL language, Chemify has already shown that the applications go way beyond simple synthesis and can be applied to drug formulation, the discovery of new materials or the optimisation of reaction conditions.

Armed with this transformational software and hardware, Chemify is now fully operational and is hiring exceptional talent into their labs in Glasgow. I am excited to see how smart, AI-driven automation techniques like Chemify will change how small scale chemical synthesis and chemical discovery more broadly is done in the future. I’m super excited to be part of the journey.

Paper review: “EquiBind”

Molecular docking helps us understand how small-molecules interact with proteins. This is especially useful in early drug development stages such as target identification and compound screening. Quick and accurate docking software allows researchers to focus their attention on a smaller set of lead molecules for further testing. Traditionally, docking software has employed first principles from physics and chemistry. Recently, deep learning has become all the rage for molecular docking, maybe motivated by the successful application of deep learning to molecular folding.

Method

EquiBind is a deep learning unconstrained docking method which models a fixed receptor and a ligand with selected rotatable bonds. It predicts the binding pocket and the ligand’s conformation within the pocket in one go. Under the hood, EquiBind employs two great ideas from a recent ICLR 2022 Paper: a SE3-invariant graph neural network based architecture and the idea to generate fixed sets of matching key points to define a rotation and translation between receptor and ligand. In addition, the authors innovate a fast method to project a deformed ligand onto the space spanned by the rotatable bonds of a pre-generated ligand conformation.

Continue reading

Better Models Through Molecular Standardization

“Cheminformatics is hard.”

— Paul Finn

I would add: “Chemistry is nuanced”… Just as there are many different ways of drawing the same molecule, SMILES is flexible enough to allow us to write the same molecule in different ways. While canonical SMILES can resolve this problem, we sometimes have different problem. In some situations, e.g., in machine learning, we need to map all these variants back to the same molecule. We also need to make sure we clean up our input molecules and eliminate invalid or incomplete structures.

Different Versions of the Same Molecule: Salt, Neutral or Charged?

Sometimes, a chemical supplier or compound vendor provides a salt of the compound, e.g., sodium acetate, but all we care about is the organic anion, i.e., the acetate. Very often, our models are built on the assumption we have only one molecule as input—but a salt will appear as two molecules (the sodium ion and the acetate ion). We might also have been given just the negatively-charged acetate instead of the neutral acetic acid.

Tautomers

Another important chemical phenomenon exists where apparently different molecules with identical heavy atoms and a nearby hydrogen can be easily interconverted: tautomers. By moving just one hydrogen atom and exchanging adjacent bond orders, the molecule can convert from one form to another. Usually, one tautomeric form is most stable. Warfarin, a blood-thinning drug, can exist in solution in 40 distinct tautomeric forms. A famous example is keto-enol tautomerism: for example, ethenol (not ethanol) can interconvert with the ketone form. When one form is more stable than the other form(s), we need to make sure we convert the less stable form(s) into the most stable form. Ethenol, a.k.a. vinyl alcohol, (SMILES: ‘C=CO[H]’), will be more stable when it is in the ketone form (SMILES: ‘CC(=O)([H])’):

from IPython.display import SVG # to use Scalar Vector Graphics (SVG) not bitmaps, for cleaner lines

import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw # to draw molecules
from rdkit.Chem.Draw import IPythonConsole # to draw inline in iPython
from rdkit.Chem import rdDepictor  # to generate 2D depictions of molecules
from rdkit.Chem.Draw import rdMolDraw2D # to draw 2D molecules using vectors

AllChem.ReactionFromSmarts('[C:1]-[C:2](-[O:3]-[H:4])>>[C:1]-[C:2](=[O:3])(-[H:4])')
Continue reading

Feeding a drove of hungry OPIGlets

In preparing for battle I have always found that plans are useless, but planning is indispensable.

Dwight D. Eisenhower

Following the previous post about OPIG retreat 2022, and having received numerous requests for recipes, I thought I’d document the process of ensuring that 24 people are kept fed and happy. Recipes at the foot of the post.

Disclaimer – these recipes are entirely my own interpretations, adapted where necessary to suit a range of dietary requirements. They are in no way authentic to any national cuisines and are not intended to be.

Disclaimer II: The Disclaiming – all measurements are approximate. I rarely write down recipes or use precise measurements. Taste as you go, and don’t be afraid to add more salt.

Continue reading

MM(PB/GB)SA – a quick start guide

The MMPBSA.py program distributed Open Source in the AmberTools21 package is a powerful tool for end-point free energy calculations on molecular dynamics simulations. In its most simple application, MMPBSA.py is used to calculate the free energy difference between the bound and unbound states of a protein-ligand complex. In order to use it, however, you need to have an Amber-compliant trajectory file, which means you need to setup and run your simulation fairly carefully.

While the Amber Manual and the MMPBSA tutorial provide lots of helpful information, putting everything together into a full pipeline taking you from structure to a free energy is another story. The goal for this guide is to provide a schematic you can follow to get started. This guide assumes you are familiar with molecular dynamics simulations and the theory of MMPBSA.

The easiest way I have found to do this, using only Open Source software, is:

(1) Download your raw PDB file. If you are lucky and it contains a complete set of heavy atoms (excepting perhaps a terminal OXT here and there, which tleap will add for you in step 3) you are good to go.

(2) Use the H++ webserver to determine the protonation states of each residue and add hydrogens as needed. This webserver is particularly convenient because it will allow you to directly download a PQR file that you can use to generate your starting topology and coordinates. Note that you have various options to choose the pH and internal/external dielectric constants for the calculation.

(3) Use tleap to generate your topology (prmtop) and coordinate (mdcor) files for your simulations. Do not forget that you will need not only the prmtop for the solvated complex, but also a dry prmtop for each of the complex, receptor, and ligand. Load the PQR file from H++ and do not forget to set PBRadii *to the same value for all prmtops*. A typical tleap script for setting up your solvated complex would look something like:

Continue reading

Einops: Powerful library for tensor operations in deep learning

Tobias and I recently gave a talk at the OPIG retreat on tips for using PyTorch. For this we created a tutorial on Google Colab notebook (link can be found here). I remember rambling about the advantages of implementing your own models against using other peoples code. Well If I convinced you, einops is for you!!

Basically, einops lets you perform operations on tensors using the Einstein Notation. This package comes with a number of advantages a few of which I will try and summarise here:

Continue reading

OPIG Retreat 2022

Finally, after two years of social distancing, we were able to continue the tradition of OPIGtreat – a 2-3 day escape to the countryside for a packed schedule of talks and fun.

This year, the lovely YHA Wilderhope Manor in Shropshire was chosen by Lewis, our trip organizer. With a hostel in the middle of nowhere, with no phone signal, this trip promised to be an exciting get-away from our plugged-in lives at the university.

Continue reading

Women in Computing: past, present and what we can do to improve the future.

Computing is one of the only scientific fields which was once female-dominated. In the 30s and 40s, women made up the bulk of the workforce doing complex, tedious calculations in the fields including ballistics, astrophysics, aeronautics (think Hidden Figures) and code-breaking. Engineers themselves found that the female computers were far more reliable than themselves in doing such calculations [9]. As computing machines became available, there was no precedent set for the gender of a computer operator, and so the women previously doing the computing became the computer operators [10].

However, this was not to last. As computing became commercialised in the 50s, the skill required for computing work was starting to be recognised. As written in [1]:

“Software company System Development Corp. (SDC) contracted psychologists William Cannon and Dallis Perry to create an aptitude assessment for optimal programmers. Cannon and Perry interviewed 1,400 engineers — 1,200 of them men — and developed a “vocational interest scale,” a personality profile to predict the best potential programmers. Unsurprisingly given their male-dominated test group, Cannon and Perry’s assessment disproportionately identified men as the ideal candidates for engineering jobs. In particular, the test tended to eliminate extroverts and people who have empathy for others. Cannon and Perry’s paper concluded that typical programmers “don’t like people,” forming today’s now pervasive stereotype of a nerdy, anti-social coder.”

Continue reading

OpenMM Setup: Start Simulating Proteins in 5 Minutes

Molecular dynamics (MD) simulations are a good way to explore the dynamical behaviour of a protein you might be interested in. One common problem is that they often have a relatively steep learning curve when using most MD engines.

What if you just want to run a simple, one-off simulation with no fancy enhanced sampling methods? OpenMM Setup is a useful tool for exactly this. It is built on the open-source OpenMM engine and provides an easy to install (via conda) GUI that can have you running a simulation in less than 5 minutes. Of course, running a simulation requires careful setting of parameters and being familiar with best practices and while this is beyond the scope of this post, there are many guides out there that can easily be found. Now on to the good stuff: using OpenMM Setup!

When you first run OpenMM Setup, you’ll be greeted by a browser window asking you to choose a structure to use. This can be a crystal structure or a model. Remember, sometimes these will have problems that need fixing like missing density or charged, non-physiological termini that would lead to artefacts, so visual inspection of the input is key! You can then choose the force field and water model you want to use, and tell OpenMM to do some cleaning up of the structure. Here I am running the simulation on hen egg-white lysozyme:

Continue reading

How to prepare a molecule for RDKit

RDKit is very fussy when it comes to inputs in SDF format. Using the SDMolSupplier, we get a significant rate of failure even on curated datasets such as the PDBBind refined set. Pymol has no such scruples, and with that, I present a function which has proved invaluable to me over the course of my DPhil. For reasons I have never bothered to explore, using pymol to convert from sdf, into mol2 and back to sdf format again (adding in missing hydrogens along the way) will almost always make a molecule safe to import using RDKit:

from pathlib import Path
from pymol import cmd

def py_mollify(sdf, overwrite=False):
    """Use pymol to sanitise an SDF file for use in RDKit.

    Arguments:
        sdf: location of faulty sdf file
        overwrite: whether or not to overwrite the original sdf. If False,
            a new file will be written in the form <sdf_fname>_pymol.sdf
            
    Returns:
        Original sdf filename if overwrite == False, else the filename of the
        sanitised output.
    """
    sdf = Path(sdf).expanduser().resolve()
    mol2_fname = str(sdf).replace('.sdf', '_pymol.mol2')
    new_sdf_fname = sdf if overwrite else str(sdf).replace('.sdf', '_pymol.sdf')
    cmd.load(str(sdf))
    cmd.h_add('all')
    cmd.save(mol2_fname)
    cmd.reinitialize()
    cmd.load(mol2_fname)
    cmd.save(str(new_sdf_fname))
    return new_sdf_fname