Tag Archives: machine learning

CryoEM is now the dominant technique for solving antibody structures

Last year, the Structural Antibody Database (SAbDab) listed a record-breaking 894 new antibody structures, driven in no small part by the continued efforts of the researchers to understand SARS-CoV-2.

Fig. 1: The aggregate growth in antibody structure data (all methods) over time. Taken from http://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/stats/ on 25th May 2022.

In this blog post I wanted to highlight the major driving force behind this curve – the huge increase in cryo electron microscopy (cryoEM) data – and the implications of this for the field of structure-based antibody informatics.

Continue reading

From code to molecules: The future of chemical synthesis

In June, after I finish my PhD, I will be joining Chemify, a new startup based in Glasgow that aims to make chemical synthesis universally accessible, reproducible and fully automated using AI and robotics. After previously talking about “Why you should care about startups as a researcher” and a quick guide on “Commercialising your research: Where to start?” on this blog, I have now joined a science-based startup fresh out of university myself.

Chemify is a spinout from the University of Glasgow originating from the group of Prof. Lee Cronin. The core of the technology is the chemical programming language χDL (pronounced “chi DL”) that, in combination with a natural language processing AI that reads and understands chemical synthesis procedures, can be used to plan and autonomously executed chemical reactions on robotic hardware. The Cronin group has also already build the modular robotic hardware needed to carry out almost any chemical reaction, the “Chemputer”. Due to the flexibility of both the Chemputer and the χDL language, Chemify has already shown that the applications go way beyond simple synthesis and can be applied to drug formulation, the discovery of new materials or the optimisation of reaction conditions.

Armed with this transformational software and hardware, Chemify is now fully operational and is hiring exceptional talent into their labs in Glasgow. I am excited to see how smart, AI-driven automation techniques like Chemify will change how small scale chemical synthesis and chemical discovery more broadly is done in the future. I’m super excited to be part of the journey.

Paper review: “EquiBind”

Molecular docking helps us understand how small-molecules interact with proteins. This is especially useful in early drug development stages such as target identification and compound screening. Quick and accurate docking software allows researchers to focus their attention on a smaller set of lead molecules for further testing. Traditionally, docking software has employed first principles from physics and chemistry. Recently, deep learning has become all the rage for molecular docking, maybe motivated by the successful application of deep learning to molecular folding.

Method

EquiBind is a deep learning unconstrained docking method which models a fixed receptor and a ligand with selected rotatable bonds. It predicts the binding pocket and the ligand’s conformation within the pocket in one go. Under the hood, EquiBind employs two great ideas from a recent ICLR 2022 Paper: a SE3-invariant graph neural network based architecture and the idea to generate fixed sets of matching key points to define a rotation and translation between receptor and ligand. In addition, the authors innovate a fast method to project a deformed ligand onto the space spanned by the rotatable bonds of a pre-generated ligand conformation.

Continue reading

3 Key Questions to Think About When Designing Proteins Computationally

We have reached the era of design, not just ‘hunting’. Particularly exciting to me is the de novo design of proteins, which have a wide and ever increasing range of applications from therapeutics to consumer products, biomanufacturing to biomaterials. Protein design has been a) enabled by decades of research that contributed to our understanding of protein sequence, structure & function and b) accelerated by computational advances – capturing the information we have learned from proteins and representing it for computers and machine learning algorithms.

In this blog post, I will discuss three key methodological considerations for computational protein design:

  1. Sequence- vs structure-based design
  2. ML- vs physics-based design
  3. Target-agnostic vs target-aware design
Continue reading

How to turn a SMILES string into a molecular graph for Pytorch Geometric

Despite some of their technical issues, graph neural networks (GNNs) are quickly being adopted as one of the state-of-the-art methods for molecular property prediction. The differentiable extraction of molecular features from low-level molecular graphs has become a viable (although not always superior) alternative to classical molecular representation techniques such as Morgan fingerprints and molecular descriptor vectors.

But molecular data usually comes in the sequential form of labeled SMILES strings. It is not obvious for beginners how to optimally transform a SMILES string into a structured molecular graph object that can be used as an input for a GNN. In this post, we show how to convert a SMILES string into a molecular graph object which can subsequently be used for graph-based machine learning. We do so within the framework of Pytorch Geometric which currently is one of the best and most commonly used Python-based GNN-libraries.

We divide our task into three high-level steps:

  1. We define a function that maps an RDKit atom object to a suitable atom feature vector.
  2. We define a function that maps an RDKit bond object to a suitable bond feature vector.
  3. We define a function that takes as its input a list of SMILES strings and associated labels and then uses the functions from 1.) and 2.) to create a list of labeled Pytorch Geometric graph objects as its output.
Continue reading

Issues with graph neural networks: the cracks are where the light shines through

Deep convolutional neural networks have lead to astonishing breakthroughs in the area of computer vision in recent years. The reason for the extraordinary performance of convolutional architectures in the image domain is their strong ability to extract informative high-level features from visual data. For prediction tasks on images, this has lead to superhuman performance in a variety of applications and to an almost universal shift from classical feature engineering to differentiable feature learning.

Unfortunately, the picture is not quite as rosy yet in the area of molecular machine learning. Feature learning techniques which operate directly on raw molecular graphs without intermediate feature-engineering steps have only emerged in the last few years in the form of graph neural networks (GNNs). GNNs, however, still have not managed to definitively outcompete and replace more classical non-differentiable molecular representation methods such as extended-connectivity fingerprints (ECFPs). There is an increasing awareness in the computational chemistry community that GNNs have not quite lived up to the initial hype and still suffer from a number of technical limitations.

Continue reading

AlphaFold 2 is here: what’s behind the structure prediction miracle

Nature has now released that AlphaFold 2 paper, after eight long months of waiting. The main text reports more or less what we have known for nearly a year, with some added tidbits, although it is accompanied by a painstaking description of the architecture in the supplementary information. Perhaps more importantly, the authors have released the entirety of the code, including all details to run the pipeline, on Github. And there is no small print this time: you can run inference on any protein (I’ve checked!).

Have you not heard the news? Let me refresh your memory. In November 2020, a team of AI scientists from Google DeepMind  indisputably won the 14th Critical Assessment of Structural Prediction competition, a biennial blind test where computational biologists try to predict the structure of several proteins whose structure has been determined experimentally but not publicly released. Their results were so astounding, and the problem so central to biology, that it took the entire world by surprise and left an entire discipline, computational biology, wondering what had just happened.

Continue reading

Out-of-distribution generalisation and scaffold splitting in molecular property prediction

The ability to successfully apply previously acquired knowledge to novel and unfamiliar situations is one of the main hallmarks of successful learning and general intelligence. This capability to effectively generalise is amongst the most desirable properties a prediction model (or a mind, for that matter) can have.

In supervised machine learning, the standard way to evaluate the generalisation power of a prediction model for a given task is to randomly split the whole available data set X into two sets – a training set X_{\text{train}} and a test set X_{\text{test}}. The model is then subsequently trained on the examples in the training set X_{\text{train}} and afterwards its prediction abilities are measured on the untouched examples in the test set X_{\text{test}} via a suitable performance metric.

Since in this scenario the model has never seen any of the examples in X_{\text{test}} during training, its performance on X_{\text{test}} must be indicative of its performance on novel data X_{\text{new}} which it will encounter in the future. Right?

Continue reading

NeurIPS 2020: Chemistry / Biology papers

Another blog post, another look at accepted papers for a major ML conference. NeurIPS joins the other major machine learning conferences (and others) in moving virtual this year, running from 6th – 12th December 2020. In a continuation of past posts (ICML 2020, NeurIPS 2019), I will highlight several of potential interest to the chem-/bio-informatics communities

The list of accepted papers can be found here, with 1,903 papers accepted out of 9,467 submissions (20% acceptance rate).

In addition to the main conference, there are several workshops highly related to the type of research undertaken in OPIG: Machine Learning in Structural Biology and Machine Learning for Molecules.

The usual caveat: given the large number of papers, these were selected either by “accident” (i.e. I stumbled across them in one way or another) or through a basic search (e.g. Ctrl+f “molecule”). If you find any I have missed, please reach out and I will update accordingly.

Continue reading

ICML 2020: Chemistry / Biology papers

ICML is one of the largest machine learning conferences and, like many other conferences this year, is running virtually from 12th – 18th July.

The list of accepted papers can be found here, with 1,088 papers accepted out of 4,990 submissions (22% acceptance rate). Similar to my post on NeurIPS 2019 papers, I will highlight several of potential interest to the chem-/bio-informatics communities. As before, given the large number of papers, these were selected either by “accident” (i.e. I stumbled across them in one way or another) or through a basic search (e.g. Ctrl+f “molecule”).

Continue reading