Category Archives: Cheminformatics

Post-processing for molecular docking: Assigning the correct bond order using RDKit.

AutoDock4 and AutoDock Vina are the most commonly used open-source software for protein-ligand docking. However, they both rely on a derivative of the “PDB” (Protein Data Base) file format: the “PDBQT” file (Protein Data Bank, Partial Charge (Q), & Atom Type (T)). In addition to the information contained in normal PDB files, PDBQT files have an additional column that lists the partial charge (Q) and the assigned AutoDock atom type (T) for each atom in the molecule. AutoDock atom types offer a more granular differentiation between atoms such as listing aliphatic carbons and aromatic carbons as separate AutoDock atom types.

The biggest drawback about the PDBQT format is that it does not encode for the bond order in molecules explicitly. Instead, the bond order is inferred based on the atom type, distance and angle to nearby atoms in the molecule. For normal sp3 carbons and molecules with mostly single bonds this system works fine, however, for more complex structures containing for example aromatic rings, conjugated systems and hypervalent atoms such as sulphur, the bond order is often not displayed correctly. This leads to issues downstream in the screening pipeline when molecules suddenly change their bond order or have to be discarded after docking because of impossible bond orders.

The solution to this problem is included in RDKit: The AssignBondOrdersFromTemplate function. All you need to do is load the original molecule used for docking as a template molecule and the docked pose PDBQT file into RDKIT as a PDB, without the bond order information. Then assign the original bond order from your template molecule. The following code snippet covers the necessary functions and should help you build a more accurate and reproducible protein-ligand docking pipeline:

#import RDKit AllChem
from rdkit import Chem
from rdkit.Chem import AllChem


#load original molecule from smiles
SMILES_STRING = "CCCCCCCCC" #the smiles string of your ligand
template = Chem.MolFromSmiles(SMILES_STRING)

#load the docked pose as a PDB file
loc_of_docked_pose = "docked_pose_mol.pdb" #file location of the docked pose converted to PDB file
docked_pose = AllChem.MolFromPDBFile(loc_of_docked_pose)

#Assign the bond order to force correct valence
newMol = AllChem.AssignBondOrdersFromTemplate(template, docked_pose)

#Add Hydrogens if desired. "addCoords = True" makes sure the hydrogens are added in 3D. This does not take pH/pKa into account. 
newMol_H = Chem.AddHs(newMol, addCoords=True)

#save your new correct molecule as a sdf file that encodes for bond orders correctly
output_loc = "docked_pose_assigned_bond_order.sdf" #output file name
Chem.MolToMolFile(newMol_H, output_loc)

How to turn a SMILES string into a molecular graph for Pytorch Geometric

Despite some of their technical issues, graph neural networks (GNNs) are quickly being adopted as one of the state-of-the-art methods for molecular property prediction. The differentiable extraction of molecular features from low-level molecular graphs has become a viable (although not always superior) alternative to classical molecular representation techniques such as Morgan fingerprints and molecular descriptor vectors.

But molecular data usually comes in the sequential form of labeled SMILES strings. It is not obvious for beginners how to optimally transform a SMILES string into a structured molecular graph object that can be used as an input for a GNN. In this post, we show how to convert a SMILES string into a molecular graph object which can subsequently be used for graph-based machine learning. We do so within the framework of Pytorch Geometric which currently is one of the best and most commonly used Python-based GNN-libraries.

We divide our task into three high-level steps:

  1. We define a function that maps an RDKit atom object to a suitable atom feature vector.
  2. We define a function that maps an RDKit bond object to a suitable bond feature vector.
  3. We define a function that takes as its input a list of SMILES strings and associated labels and then uses the functions from 1.) and 2.) to create a list of labeled Pytorch Geometric graph objects as its output.
Continue reading

Benchmarks in De Novo Drug Design

I recently came across a review of “De novo molecular drug design benchmarking” by Lauren L. Grant and Clarissa S. Sit where they highlighted the recently proposed benchmarking methods including Fréchet ChemNet Distance [1], GuacaMol [2], and Molecular Sets (MOSES) [3] together with its current and future potential applications as well as the steps moving forward in terms of validation of benchmarking methods [4].

From this review, I particularly wanted to note about the issues with current benchmarking methods and the points we should be aware of when using these methods to benchmark our own de novo molecular design methods. Goal-directed models are referring to de novo molecular design methods optimizing for a particular scoring function [2].

Continue reading

Using normalized SuCOS scores.

If you are working in cheminformatics or utilise protein-ligand docking, then you should be aware of the SuCOS score, an open-source shape and chemical feature overlap metric designed by a former member of OPIG: Susan Leung.

The metric compares the 3D conformers of two ligands based on their shape overlap as well as their chemical feature overlap using the RDKit toolkit. Leung et al. show that SuCOS is able to select fewer false positives and false negatives when doing re-docking studies than other scoring metrics such as RMSD or Protein Ligand Interaction Fingerprints (PLIF) similarity scores and performs better at differentiating actives from decoys when tested on the DUD-E dataset.

Most importantly, SuCOS was designed with fragment based drug discovery in focus, where a smaller fragment ligand is elaborated or combined with other fragments to create a larger molecule, with hopefully stronger binding affinity. Unlike for example RMSD, SuCOS is able to quickly calculate an overlap score between a small fragment and a larger molecule, giving chemists an idea on how the fragment elaboration might interact with the protein. However, the original SuCOS algorithm was not normalized and could create scores of > 1 for some cases.

I’ve uploaded a normalised version of the original SuCOS algorithm as a GitHub fork of Susan’s original repository. You can find the normalised SuCOS algorithm here.

Hopefully this is helpful for anyone using the SuCOS algorithm and for all docking enthusiasts who are interested in an alternative way to evaluate their docked poses.

Getting the PDB structures of compounds in ChEMBL

Recently I was dealing with a set of compounds with known target activities from the ChEMBL database, and I wanted to find out which of them also had PDB  crystal structures in complex with that target.

Referencing this manually is very easy for cases where we are interested in 2-3 compounds, but for any larger number, using the ChEMBL and PDB web services greatly reduces the number of clicks.

Continue reading

Issues with graph neural networks: the cracks are where the light shines through

Deep convolutional neural networks have lead to astonishing breakthroughs in the area of computer vision in recent years. The reason for the extraordinary performance of convolutional architectures in the image domain is their strong ability to extract informative high-level features from visual data. For prediction tasks on images, this has lead to superhuman performance in a variety of applications and to an almost universal shift from classical feature engineering to differentiable feature learning.

Unfortunately, the picture is not quite as rosy yet in the area of molecular machine learning. Feature learning techniques which operate directly on raw molecular graphs without intermediate feature-engineering steps have only emerged in the last few years in the form of graph neural networks (GNNs). GNNs, however, still have not managed to definitively outcompete and replace more classical non-differentiable molecular representation methods such as extended-connectivity fingerprints (ECFPs). There is an increasing awareness in the computational chemistry community that GNNs have not quite lived up to the initial hype and still suffer from a number of technical limitations.

Continue reading

How to interact with small molecules in Jupyter Notebooks

The combination of Python and the cheminformatics toolkit RDKit has opened up so many ways to explore chemistry on a computer. Jupyter — named for the three languages, Julia, Python, and R — ties interactivity and visualization together, creating wonderful environments (Notebooks and JupyterLab) to carry out, share and reproduce research, including:

“data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.”

—https://jupyter.org

At this year’s annual RDKit UGM (User Group Meeting), Cédric Bouysset shared a tutorial explaining how to create a grid of molecules that you can interact with, using his “mols2grid“:

Continue reading

Watch out when using PDBbind!

Now that PDBbind 2020 has been released, I want to draw some attention to an issue with using the SDF files that are supplied in the PDBbind refined set 2020.

Normally, SDF files save the chirality information of compounds in the atom block of the file which is shown belowas a snipped of the full sdf file for the ligand of PDB entry 4qsv. The column that defines chirality is marked in red.

As you can see, all columns shown here are 0. The SDF files supplied by PDBbind for some reason do NOT encode chirality information explicitly. This will be a problem when using RDKit to read the molecule and transform it into a smiles string. By using the following commands to read the ligand for 4qsv from PDBBind 2020 and write a SMILES string, we get:

Continue reading

Out-of-distribution generalisation and scaffold splitting in molecular property prediction

The ability to successfully apply previously acquired knowledge to novel and unfamiliar situations is one of the main hallmarks of successful learning and general intelligence. This capability to effectively generalise is amongst the most desirable properties a prediction model (or a mind, for that matter) can have.

In supervised machine learning, the standard way to evaluate the generalisation power of a prediction model for a given task is to randomly split the whole available data set X into two sets – a training set X_{\text{train}} and a test set X_{\text{test}}. The model is then subsequently trained on the examples in the training set X_{\text{train}} and afterwards its prediction abilities are measured on the untouched examples in the test set X_{\text{test}} via a suitable performance metric.

Since in this scenario the model has never seen any of the examples in X_{\text{test}} during training, its performance on X_{\text{test}} must be indicative of its performance on novel data X_{\text{new}} which it will encounter in the future. Right?

Continue reading

Automated intermolecular interaction detection using the ODDT Python Module

Detecting intermolecular interactions is often one of the first steps when assessing the binding mode of a ligand. This usually involves the human researcher opening up a molecular viewer and checking the orientations of the ligand and protein functional groups, sometimes aided by the viewer’s own interaction detecting functionality. For looking at single digit numbers of structures, this approach works fairly well, especially as more experienced researchers can spot cases where the automated interaction detection has failed. When analysing tens or hundreds of binding sites, however, an automated way of detecting and recording interaction information for downstream processing is needed. When I had to do this recently, I used an open-source Python module called ODDT (Open Drug Discovery Toolkit, its full documentation can be found here).

My use case was fairly standard: starting with a list of holo protein structures as pdb files and their corresponding ligands in .sdf format, I wanted to detect any hydrogen bonds between a ligand and its native protein crystal structure. Specifically, I needed the number and name of the the interacting residue, its chain ID, and the name of the protein atom involved in the interaction. A general example on how to do this can be found in the ODDT documentation. Below, I show how I have used the code on PDB structure 1a9u.

Continue reading