Tag Archives: RDKit

Interesting Jupyter and IPython Notebooks

Here’s a treasure trove of interesting Jupyter and iPython notebooks, with lots of diverse examples relevant to OPIG, including an RDKit notebook, but also:

Entire books or other large collections of notebooks on a topic (covering Introductory Tutorials; Programming and Computer Science; Statistics, Machine Learning and Data Science; Mathematics, Physics, Chemistry, Biology; Linguistics and Text Mining; Signal Processing; Scientific computing and data analysis with the SciPy Stack; General topics in scientific computing; Machine Learning, Statistics and Probability; Physics, Chemistry and Biology; Data visualization and plotting; Mathematics; Signal, Sound and Image Processing; Natural Language Processing; Pandas for data analysis); General Python Programming; Notebooks in languages other than Python (Julia; Haskell; Ruby; Perl; F#; C#); Miscellaneous topics about doing various things with the Notebook itself; Reproducible academic publications; and lots more!  


Using RDKit to load ligand SDFs into Pandas DataFrames

If you have downloaded lots of ligand SDF files from the PDB, then a good way of viewing/comparing all their properties would be to load it into a Pandas DataFrame.

RDKit has a very handy function just for this – it’s found under the PandasTool module.

I show an example below within Jupypter-notebook, in which I load in the SDF file, view the table of molecules and perform other RDKit functions to the molecules.

First import the PandasTools module:

from rdkit.Chem import PandasTools

Read in the SDF file:

SDFFile = "./Ligands_noHydrogens_noMissing_59_Instances.sdf"
BRDLigs = PandasTools.LoadSDF(SDFFile)

You can see the whole table by calling the dataframe:


The ligand properties in the SDF file are stored as columns. You can view what these properties are, and in my case I have loaded 59 ligands each having up to 26 properties:


It is also very easy to perform other RDKit functions on the dataframe. For instance, I noticed there is no heavy atom column, so I added my own called ‘NumHeavyAtoms’:

BRDLigs['NumHeavyAtoms']=BRDLigs.apply(lambda x: x['ROMol'].GetNumHeavyAtoms(), axis=1)

Here is the column added to the table, alongside columns containing the molecules’ SMILES and RDKit molecule:


CCP4 Study Weekend 2017: From Data to Structure

This year’s CCP4 study weekend focused on providing an overview of the process and pipelines available, to take crystallographic diffraction data from spot intensities right through to structure. Therefore sessions included; processing diffraction data, phasing through molecular replacement and experimental techniques, automated model building and refinement. As well as updates to CCP4 and where is crystallography going to take us in the future?

Surrounding the meeting there was also a session for Macromolecular (MX) crystallography users of Diamond Light Source (DLS), which gave an update on the beamlines, and scientific software, as well as examples of how fragment screening at DLS has been used. The VMXi (Versatile Macromolecular X-tallography in-situ) beamline is being developed to image crystals that are forming in situ crystallisation plates. This should allow for crystallography to be optimized, as crystallization conditions can be screened, and data collected on experiments as they crystallise, especially helpful in cases where crystallisation has routinely led to non-diffracting crystals. VXMm is a micro/nanofocus MX beamline, which is in development, with a focus to get crystallographic from very small crystals (~300nm to 10 micron diameters, with a bias to the smaller size), thereby allowing crystallography of targets that have previously been hard to get sufficient crystals. Other updates included how technology developed for fast solid state data collection on x-ray free electron lasers (XFEL) can be used on synchrotron beamlines.

A slightly more in-depth discussion of two tools presented that were developed for use alongside and within CCP4, which might be of interest more broadly:

ConKit: A python interface for contact prediction tools

Contact prediction for proteins, at its simplest, involves estimating which residues within a certain certain spatial proximity of each other, given the sequence of the protein, or proteins (for complexes and interfaces). Two major types of contact prediction exist:

  • Evolutionary Coupling
  • Supervised machine learning
    • Using ab initio structure prediction tools, without sequence homologues, to predict which contacts exist, but with a much lower accuracy than evolutionary coupling.


ConKit is a python interface (API) for contact prediction tools, consisting of three major modules:

  • Core: A module for constructing hierarchies, thereby storing necessary data such as sequences in a parsable format.
    • Providing common functionality through functions that for example declare a contact as a false positive.
  • Application: Python wrappers for common contact prediction and sequence alignment applications
  • I/O: I/O interface for file reading, writing and conversions.

Contact prediction can be used in the crystallographic structure determination field, during unconventional molecular replacement, using a tool such as AMPLE. Molecular replacement is a computational strategy to solve the phase problem. In the typical case, by using homologous structures to determine an estimate a model of the protein, which best fits the experimental diffraction intensities, and thus estimate the phase. AMPLE utilises ab initio modeling (using Rosetta) to generate a model for the protein, contact prediction can provide input to this ab initio modeling, thereby making it more feasible to generate an appropriate structure, from which to solve the phase problem. Contact prediction can also be used to analyse known and unknown structures, to identify potential functional sites.

For more information: Talk given at CCP4 study weekend (Felix Simkovic), ConKit documentation

ACEDRG: Generating Crystallographic Restraints for Ligands

Small molecule ligands are present in many crystallographic structures, especially in drug development campaigns. Proteins are formed (almost exclusively) from a sequence containing a selection of 20 amino acids, this means there are well known restraints (for example: bond lengths, bond angles, torsion angles and rotamer position) for model building or refinement of amino acids. As ligands can be built from a much wider selection of chemical moieties, they have not previously been restrained as well during MX refinement. Ligands found in PDB depositions can be used as models for the model building/ refinement of ligands in new structures, however there are a limited number of ligands available (~23,000). Furthermore, the resolution of the ligands is limited to the resolution of the macro-molecular structure from which they are extracted.

ACEDRG utilises the crystallorgraphy open database (COD), a library of (>300,000) small molecules usually with atomic resolution data (often at least 0.84 Angstrom), to generate a dictionary of restraints to be used in refining the ligand. To create these restraints ACEDRG utilises the RDkit chemoinformatics package, generating a detailed descriptor of each atom of the ligands in COD. The descriptor utilises properties of each atom including the element name, number of bonds, environment of nearest neighbours, third degree neighbours that are aromatic ring systems. The descriptor, is stored alongside the electron density values from the COD.  When a ACEDRG query is generated, for each atom in the ligand, the atom type is compared to those for which a COD structure is available, the nearest match is then used to generate a series of restraints for the atom.

ACEDRG can take a molecular description (SMILES, SDF MOL, SYBYL MOL2) of your ligand, and generate appropriate restraints for refinement, (atom types, bond lengths and angles, torsion angles, planes and chirality centers) as a mmCIF file. These restraints can be generated for a number of different probable conformations for the ligand, such that it can be refined in these alternate conformations, then the refinement program  can use local scoring criteria to select the ligand conformation that best fits the observed electron density. ACEDRG can accessed through the CCP4i2 interface, and as a command line interface.

Hopefully a useful insight to some of the tools presented at the CCP4 Study weekend. For anyone looking for further information on the CCP4 Study weekend: Agenda, Recording of Sessions, Proceedings from previous years.

Viewing 3D molecules interactively in Jupyter iPython notebooks

Greg Landrum, curator of the invaluable open source cheminformatics API, RDKit, recently blogged about viewing molecules in a 3D window within a Jupyter-hosted iPython notebook (as long as your browser supports WebGL, that is).

The trick is to use py3Dmol. It’s easy to install:

pip install py3Dmol

This is built on the object-oriented, webGL based JavaScript library for online molecular visualization 3Dmol.js (Rego & Koes, 2015); here's a nice summary of the capabilities of 3Dmol.js. It's features include:

  • support for pdb, sdf, mol2, xyz, and cube formats
  • parallelized molecular surface computation
  • sphere, stick, line, cross, cartoon, and surface styles
  • atom property based selection and styling
  • labels
  • clickable interactivity with molecular data
  • geometric shapes including spheres and arrows

I tried a simple example and it worked beautifully:

import py3Dmol
view = py3Dmol.view(query='pdb:1hvr')


The 3Dmol.js website summarizes how to view molecules, along with how to choose representations, how to embed it, and even how to develop with it.


Nicholas Rego & David Koes (2015). “3Dmol.js: molecular visualization with WebGL”.
Bioinformatics, 31 (8): 1322-1324. doi:10.1093/bioinformatics/btu829

Advances in Conformer Generation: ETKDG and ETDG

Predicting the possible shapes a small molecule can adopt is essential to understanding its chemistry and the possible biological roles of candidate drugs. Thus, conformer generation, the process of converting a topological description of a molecule into a set of 3D positions of its constituent atoms, is an essential component of computational drug discovery.

A former member of OPIG, Dr Jean-Paul Ebejer, myself, and Prof. Charlotte Deane, compared the ability of freely-available conformer generation methods:

“(i) to identify which tools most accurately reproduce experimentally determined structures;
(ii) to examine the diversity of the generated conformational set; and
(iii) to benchmark the computational time expended.”

in Ebejer et al., 2012.  JP assembled a set of 708 crystal structures of drug-like molecules from the OMEGA validation set and the Astex Diverse Set with which to test the various methods. We found that RDKit, combining its Distance Geometry (DG) algorithm with energy minimization  using the MMFF94 force field proved to be a “valid free alternative to commercial, proprietary software”, and was able to generate “a diverse and representative set of conformers which also contains a close conformer to the known structure”.

Following on from our work at InhibOx, and building on the same benchmark set JP assembled, Greg Landrum and Sereina Riniker recently described (Riniker & Landrum, 2015) two new conformer generation methods, ETDG and ETKDG, that improve upon the classical distance geometry (DG) algorithm. They do this by combining DG with knowledge of preferred torsional angles derived from experimentally determined crystal structures (ETDG), and also by further adding constraints from chemical knowledge, such as ‘aromatic rings are be flat’, or ‘bonds connected to triple bonds are colinear’ (ETKDG). They compared DG, ETDG, ETKDG, and a knowledge-based method, CONFECT, and found:

“ETKDG was found to outperform standard DG and the knowledge-based conformer generator CONFECT in reproducing crystal conformations from both small-molecule crystals (CSD data set) and protein−ligand complexes (PDB data set). With ETKDG, 84% of a set of 1290 small-molecule crystal structures from the CSD could be reproduced within an RMSD of 1.0 Å and 38% within an RMSD of 0.5 Å. The experimental torsional-angle preferences or the K terms alone each performed better than standard DG but were not sufficient to obtain the full performance of ETKDG.

Comparison of ETKDG with the DG conformers optimized using either the Universal Force Field (UFF) or the Merck Molecular Force Field (MMFF) showed different results for the two data sets. While FF-optimized DG performed better on the CSD data set, the two approaches were comparable for the PDB data set.”

They also showed (Fig. 13) that their ETKDG method was faster than DG followed by energy minimization, but not quite as accurate in reproducing the crystal structure.

ETKDG takes 3 times as long as DG. The addition of the K terms, i.e., generating ETKDG embeddings instead of ETDG embeddings, increases runtime by only 10% over ETDG (results not shown). Despite the longer runtime per conformer, ETKDG requires on average one-quarter of the number of conformers to achieve performance similar to DG (Figure S12 and Table S4 in the Supporting Information). This results in a net performance improvement, at least when it comes to reproducing crystal conformers.

As measured by performance in reproducing experimental crystal structures, ETKDG is a viable alternative to plain DG followed by a UFF-optimization, so it is of interest how their runtimes compare. Figure 13 (right) plots the runtime for ETKDG versus the runtime for DG + UFF-optimization. The median ratio of the DG + UFF optimization and ETKDG runtimes is 1.97, i.e., DG + UFF optimization takes almost twice as long as ETKDG. Thus, although ETKDG is significantly slower than DG on a per-conformer basis, when higher-quality conformations are required it can provide structures that are the equivalent of those obtained using DG + UFF-optimization in about half the time.

ETKDG looks like a great addition to the RDKit toolbox for conformer generation (and it was great to see JP thanked in the Acknowledgments!).



Ebejer, J. P., G. M. Morris and C. M. Deane (2012). “Freely available conformer generation methods: how good are they?” J Chem Inf Model, 52(5): 1146-1158. 10.1021/ci2004658.

Riniker, S. and G. A. Landrum (2015). “Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation.” J Chem Inf Model, 55(12): 2562-2574. 10.1021/acs.jcim.5b00654.