Seventh Joint Sheffield Conference on Cheminformatics Part 1 (#ShefChem16)

In early July I attended the the Seventh Joint Sheffield Conference on Cheminformatics. There was a variety of talks with speakers at all stages of their career. I was lucky enough to be invited to speak at the conference, and gave my first conference talk! I have written two blog posts about the conference: part 1 briefly describes a talk that I found interesting and part 2 describes the work I spoke about at the conference.

One of the most interesting parts of the conference was the active twitter presence. #ShefChem16. All of the talks were live tweeted which provided a summary of each talk and also included links to software or references. It also allowed speakers to gain insight and feedback on their talk instantly.

One of the talks I found most interesting presented the Protein-Ligand Interaction Profiler (PLIP). It is a method for the detection of protein-ligand interactions. PLIP is open-source and has a web-based online tool and a command-line tool. Unlike PyMol which only calculates polar contacts, and not the type of interaction, PLIP calculates 8 different types of interactions: hydrogen bonding, hydrophobic, $π- π stacking, π-cation interactions, salt bridges, water bridges, halogen bonds, metal complexes. For a given pdb file the interactions are calculated and shown in a publication quality figure shown here.$

$The display can also be downloaded as a PyMol session so the display can be modified.$

This tool is an extremely useful way to calculate protein-ligand interactions and can be used to find the types of interactions formed by the protein-ligand complex.

PLIP can be found here: https://projects.biotec.tu-dresden.de/plip-web/plip/

Tracked changes in LaTeX

Maybe people keep telling you Word is great but you are just too emotionally attached to LaTeX to consider using anything else. It just looks so beautiful. Besides, you would have to leave your beloved linux environment (maybe that’s just me), so you stick with what you know. You work for many weeks long and hard, finally producing a draft of a paper that gets the all clear from your supervisor to submit to journal X. Eventually you hear back and the reviewers have responded with some good ideas and a few pedantic points. Apparently this time the journal wants a tracked changes version to go with your revised manuscript.

Highlighting every change sounds like a lot of bother, and besides, you’d have to process the highlighted version to generate the clean version they want you to submit alongside it. There must be a better way, and one that doesn’t involve converting your document to Word.

Thankfully, the internet has an answer! Check out this little package changes which will do just what you need. As long as you annotate using \deleted{}, \replaced{} and \added{} along the way, you will have to change just one word of your tex source file in order to produce the highlighted and final versions. It even comes with a handy bash script to get rid of the resulting mess when you’re happy with the result, leaving you with a clean final tex source file.

Screenshot from 2016-07-12 19-45-12

The die-hard Word fans won’t be impressed, but you will be very satisfied that you have found a nice little solution that does just the job you want it to. It’s actually capable of much more, including comments by multiple authors, customisation of colours and styles, and an automatically generated summary of changes. I have heard good things about ShareLaTeX for collaboration, but this simple package will get you a long way if you are not keen to start paying money yet.

Rational Design of Antibody Protease Inhibitors

On the theme of my research area and my last presentation I talked at group meeting about a another success story in structurally designing an antibody by replicating a general protein binding site using grafted fragments involved in the original complex. The paper by Liu et al is important to me for two major reasons . Firstly they used an unconventional antibody for protein design, namely a bovine antibody which is known to have an extended CDR H3. Secondly the fragment was not grafted at the anchor points of the CDR loop.

SFTI-1 is a cyclic peptide and a known trypsin inhibitor. It’s structure is stabilised by a disulphide bridge. The bovine antibody is known to have an extended H3 loop which is essentially a long beta strand stalk with a knob domain at the end. Liu et al removed the knob domain and a portion of the B strand and grafted the acyclic version of the SFTI-1 to it. As I said above this result is very important because it shows we can graft a fragment/loop at places different then the anchor points of the CDR. This opens up the possibility for more diverse fragments to be grafted because of new anchor points, and also because the fragment will sit further away from the other CDRs and the framework allowing more conformational space. To see how the designed antibody compares to the original peptide they measured the Kd and found a 4 fold increase (9.57 vs 13.3). They hypothesise that this is probably due to the fact that the extended beta strand on the antibody keeps the acyclic SFTI-1 peptide in a more stable conformation.

The problem with the bovine antibody is that if inserted in a human subject it would probably elicit an immune response from the native immune system. To humanise this antibody they found the human framework which shares the greatest sequence identity to the bovine antibody and then grafted the fragment on it. The human antibody does not have an extended CDR H3 and to decide what is the best place of grafting they tried various combinations again showing again that the fragments do not need to grafted exactly at the anchor points. Some of the resulting human antibodies showed even more impressive Kds.

The best designed human antibody had a 0.79nM Kd, another 10-fold gain . Liu et al hypothesised that this might be due to the fact that the cognate protein forms contacts with residues on the other CDRs even though there is no crystal structure to show this. In order to test this hypothesis they mutated surface residues on the H2 and L1 loop to Alanine which resulted in a 6.7 fold decrease in affinity. The only comment I would have to this is that the mutations to the other CDRs might have destabilized the other CDRs on the antibody which could be the reason for the decrease in affinity.

Quantifying dispersion under varying instrument precision

Experimental errors are common at the moment of generating new data. Often this type of errors are simply due to the inability of the instrument to make precise measurements. In addition, different instruments can have different levels of precision, even-thought they are used to perform the same measurement. Take for example two balances and an object with a mass of 1kg. The first balance, when measuring this object different times might record values of 1.0083 and 1.0091, and the second balance might give values of 1.1074 and 0.9828. In this case the first balance has a higher precision as the difference between its measurements is smaller than the difference between the measurements of balance two.

In order to have some control over this error introduced by the level of precision of the different instruments, they are labelled with a measure of their precision $1/\sigma_i^2$ or equivalently with their dispersion $\sigma_i^2$ .

Let’s assume that the type of information these instruments record is of the form $X_i=C + \sigma_i Z$ , where $Z \sim N(0,1)$ is an error term, $X_i$ its the value recorded by instrument $i$ and where $C$ is the fixed true quantity of interest the instrument is trying to measure. But, what if $C$ is not a fixed quantity? or what if the underlying phenomenon that is being measured is also stochastic like the measurement $X_i$ . For example if we are measuring the weight of cattle at different times, or the length of a bacterial cell, or concentration of a given drug in an organism, in addition to the error that arises from the instruments; there is also some noise introduced by dynamical changes of the object that is being measured. In this scenario, the phenomenon of interest, can be given by a random variable $Y \sim N(\mu,S^2)$ . Therefore the instruments would record quantities of the form $X_i=Y + \sigma_i Z$ .

Under this case, estimating the value of $\mu$ , the expected state of the phenomenon of interest is not a big challenge. Assume that there are $x_1,x_2,...,x_n$ values observed from realisations of the variables $X_i \sim N(\mu, \sigma_i^2 + S^2)$ , which came from $n$ different instruments. Here $\sum x_i /n$ is still a good estimation of $\mu$ as $E(\sum X_i /n)=\mu$ . Now, a more challenging problem is to infer what is the underlying variability of the phenomenon of interest $Y$ . Under our previous setup, the problem is reduced to estimating $S^2$ as we are assuming $Y \sim N(\mu,S^2)$ and that the instruments record values of the from $X_i=Y + \sigma_i Z$ .

To estimate $S^2$ a standard maximum likelihood approach could be used, by considering the likelihood function:

$f(x_1,x_2,..,x_n)= \prod e^{-1/2 \times (x_i-\mu)^2 /(\sigma_i^2+S^2)} \times 1/\sqrt{2 \pi (\sigma_i^2+S^2) }$ ,

from which the maximum likelihood estimator of $S^2$ is given by the solution to

$\sum [(X_i- \mu)^2 - (\sigma_i^2 + S^2)] /(\sigma_i^2 + S^2)^2 = 0$ .

Another more naive approach could use the following result

$E[\sum (X_i-\sum X_i/n)^2] = (1-1/n) \sum \sigma_i^2 + (n-1) S^2$

from which $\hat{S^2}= (\sum (X_i-\sum X_i/n)^2 - ( (1-1/n ) \sum(\sigma_i^2) ) ) / (n-1)$ .

Here are three simulation scenarios where 200 $X_i$ values are taken from instruments of varying precision or variance $\sigma_i^2, i=1,2,...,200$ and where the variance of the phenomenon of interest $S^2=1500$ . In the first scenario $\sigma_i^2$ are drawn from $[10,1500^2]$ , in the second from $[10,1500^2 \times 3]$ and in the third from $[10,1500^2 \times 5]$ . In each scenario the value of $S_2$ is estimated 1000 times taking each time another 200 realisations of $X_i$ . The values estimated via the maximum likelihood approach are plotted in blue, and the values obtained by the alternative method are plotted in red. The true value of the $S^2$ is given by the red dashed line across all plots.

First simulation scenario where $\sigma_i^2, i=1,2,...,200$ in $[10,1500^2]$ . The values of $\sigma_i^2$ plotted in the histogram to the right. The 1000 estimations of $S$ are shown by the blue (maximum likelihood) and red (alternative) histograms.

First simulation scenario where $\sigma_i^2, i=1,2,...,200$ in $[10,1500^2 \times 3]$ . The values of $\sigma_i^2$ plotted in the histogram to the right. The 1000 estimations of $S$ are shown by the blue (maximum likelihood) and red (alternative) histograms.

First simulation scenario where $\sigma_i^2, i=1,2,...,200$ in $[10,1500^2 \times 5]$ . The values of $\sigma_i^2$ plotted in the histogram to the right. The 1000 estimations of $S$ are shown by the blue (maximum likelihood) and red (alternative) histograms.

For recent advances in methods that deal with this kind of problems, you can look at:

Delaigle, A. and Hall, P. (2016), Methodology for non-parametric deconvolution when the error distribution is unknown. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78: 231–252. doi: 10.1111/rssb.12109

Improving the precision of local community detection

Communities are everywhere. You can find groups of connected people on Facebook who went to university together, communities of people on youtube that like the same band, sets of products on amazon that tend to be purchased together, or modules of proteins that act in the same biological process in the human body. With the amount of data we can collect on these networks (facebook friendship, protein interactions, etc), you would think we can easily recover these groups. You would think wrong…

There are a lot of methods out there to find communities in networks. As a paper that compares community detection methods by Lancichinetti and Fortunato put it: “A thorough analysis of all existing methods would require years of work (during which many new methods would appear), and we do not believe it is feasible”. Thus, clearly we have not found the answer.

So what do people generally do to find these communities? Well… very different things. To explain a few of the approaches, it helps to break them down into broad categories such as global and local methods.

Global Methods

Global community detection methods generally aim to optimized an objective function over the entire network. For example, I could decide that I will score the network partition based on a function I will call “greatness”, G. I will define the greatness of a community by the sum of the number of nodes each node is connected with that are in the community (the sum of the node in-degree). Or simply:

Here, A_ij is the adjacency matrix that contains a 1 where nodes i and j interact, and 0 otherwise. The sigmas in the equation correspond to the community identifiers of nodes i and j respectively, and thus the delta function is only 1 when sigma_i and sigma_j are equal, meaning nodes i and j are in the same community, and 0 otherwise.

If you now come to the conclusion that this is a pretty stupid thing to define, then you’ve understood the concept of a global method (alternatively… get rid of that negativity!). A global community detection method would now look how to partition the network so that we can maximize G. In the case of greatness defined here, everything will be put into the same community, as there’s no penalty for putting two nodes together that shouldn’t belong together. Adding a further node to a community can never decrease the greatness score!

If you want a real overview over these and other methods, check out this 100-page review article.

Local Methods

In contrast to global methods that attempt to gather information on the whole network, local community detection methods dive into the network to try to optimize the community partitioning. This means on the one hand they can be very fast at assigning nodes to communities, as they take into account less information. However, on the other hand using less information in this assignment means they can be less precise.

An example of a local method is link clustering, which I described in an earlier post. Link clustering assesses the similarity of two edges that share a node, rather than comparing the nodes themselves. The similarity of these edges is calculated based on how similar the surroundings of the nodes on the endpoints are. Using this score, all edges that are more similar than a cutoff value, S, can be clustered together to define the edge-communities at resolution S. This partitioning translates to nodes by taking the nodes that the edges in the edge-communities connect.

Improving local community detection

Link clustering is a pretty nice and straightforward example of how you can successfully find communities. However, the authors admit it works better on dense than on sparse networks. This is an expected side-effect of taking into account less information to be faster as mentioned above. So you can improve it with a pretty simple idea: how much more information can I take into account without basically looking at the whole network?

Local community detection methods such as link clustering are generally extended by including more information. One approach is to include information other than network topology in the local clustering process (like this). This approach requires information to be present apart from the network. In my work, I like using orthogonal data sets to validate community detection, rather than integrate it into the process and be stuck without enough of an idea if it actually worked. So, instead I’m trying to include more of the local surroundings of the node to increase the sensitivity of the similarity measure between edges. At some point, this will blow up the complexity of the problem and turn it into a global method. So I’m now walking a line between precision and speed. A line that everyone always seems to be walking. On this line, you can’t really win… but maybe it is possible to optimize the trade-off to suit my purposes. And besides… longer algorithm run times give me more time for lunch breaks ;).

Conformational Variation of Protein Loops

Something many structural biologists (including us here in OPIG!) are guilty of is treating proteins as static, rigid structures. In reality, proteins are dynamic molecules, transitioning constantly from conformation to conformation. Proteins are therefore more accurately represented by an ensemble of structures instead of just one.

In my research, I focus on loop structures, the regions of a protein that connect elements of secondary structure (α-helices and β-sheets). There are many examples in the PDB of proteins with identical sequences, but whose loops have different structures. In many cases, a protein’s function depends on the ability of its loops to adopt different conformations. For example, triosephosphate isomerase, which is an important enzyme in the glycolysis pathway, changes conformation upon ligand binding, shielding the active site from solvent and stabilising the intermediate compound so that catalysis can occur efficiently. Conformational variability helps triosephosphate isomerase to be what is known as a ‘perfect enzyme’; catalysis is limited only by the diffusion rate of the substrate.

Structure of the triosephosphate isomerase enzyme. When the substrate binds, a loop changes from an ‘open’ conformation (pink, PDB entry 1TPD) to a ‘closed’ one (green, 1TRD), which prevents solvent access to the active site and stabilises the intermediate compound of the reaction.

An interesting example, especially for some of us here at OPIG, is the antibody SPE7. SPE7 is multispecific, meaning it is able to bind to multiple unrelated antigens. It achieves this through conformational diversity. Four binding site conformations have been found, two of which can be observed in its unbound state in equilibrium – one with a flat binding site, and another with a deep, narrow binding site [1].

An antibody that exists as two different structures in equilibrium - one with a shallow binding site (left, blue, PDB code 1OAQ) and one with a deep, narrow cleft (right, green, PDB 1OCW). Complementary determining regions are coloured in each case.

SPE7; an antibody that exists as two different structures in equilibrium – one with a shallow binding site (left, blue, PDB code 1OAQ) and one with a deep, narrow cleft (right, green, PDB 1OCW). Complementary determining regions are coloured in each case.

So when you’re dealing with crystal structures, beware! X-ray structures are averages – each atom position is an average of its position across all unit cells. In addition, thanks to factors such as crystal packing, the conformation that we see may not be representative of the protein in solution. The examples above demonstrate that the sequence/structure relationship is not as clear cut as we are often lead to believe. It is important to consider dynamics and conformational diversity, which may be required for protein function. Always bear in mind that the static appearance of an X-ray structure is not the reality!

[1] James, L C, Roversi, P and Tawfik, D S. Antibody Multispecificity Mediated by Conformational Diversity. Science (2003), 299, 1362-1367.

“Identifying Allosteric Hotspots with Dynamics”: Using molecular simulation to find critical sites for the functional motions in proteins

Allosteric (allo-(“other”) + steric (repulsion of atoms due to closeness or arrangement)) sites regulate protein function from a position other than the active site or binding site. Consider the latch on a pair of gardening scissors (Figure 1): depending on the position of the latch (allosteric site) the blades are prevented from cutting things at the other end (active site).

Figure 1 Allostery explained: A safety latch in gardening scissors.

Due to the non-trivial positions of allosteric sites in proteins their identification has been challenging. Selected well characterised systems such as GPCRs have known allosteric sites that are being used as targets in drug development. However, large scale identification of allosteric sites across the Protein Data Base (PDB) has not yet been feasible, partly because of the lack of tools.

To tackle this problem the Gerstein Lab developed a computational protocol based on various molecular simulations and network methods to find allosteric hotspots in proteins across the PDB. They introduce two different pipelines; one for identifying allosteric residues on the surface (surface-critical) and one for buried residues (interior-critical).

For the search of exterior-critical residues they us MC simulations to repeatedly probe the surface of the protein with short Monte Carlo (MC) with a short peptide. Based on hard spheres and simple energy calculations this method seems to be an efficient way of detecting possible binding pockets. Once the binding pockets have been found, the collective motions of the structure are simulated using an elastic mass-and-spring network (an anisotropic network model [ANM]). Binding pockets that undergo significant deformation during these simulations are considered to be surface-critical.

For interior-critical residues they start by weighting residue-residue contacts on the basis of collective movement. Communities within the weighted network are then identified and the residues with the highest betweenness interactions between communities are chosen as interior-critical residues. Thus, interior-critical residues have the highest information flow between two densely inter-connected groups of residues.

The protocol as been implemented in STRESS (STRucturally identified ESSential residues) and is freely available at stress.molmovdb.org.

Publication: http://www.ncbi.nlm.nih.gov/pubmed/27066750

Comp Chem Kitchen

I recently started “Comp Chem Kitchen” with Richard Cooper and Rob Paton in the Department of Chemistry here in Oxford. It’s a regular forum and seminar series for molecular geeks and hackers, in the original, untarnished sense of the word: people using and developing computational methods to tackle problems in chemistry, biochemistry and drug discovery. Our hope is that we will share best practices, even code snippets and software tools, and avoid re-inventing wheels.

In addition to local researchers, we invite speakers from industry and non-profits from time to time, and occasionally organize software demos and tutorials.

We also provide refreshments including free beer. (We are grateful to Prof. Phil Biggin and the MRC Proximity to Discovery Fund for offering to support CCK.)

CCK-1

Our first meeting, CCK-1, was held in the Abbot’s Kitchen on May 24, 2016, at 5 pm, and was a great success—standing room only, in fact! The Abbot’s Kitchen—originally a laboratory—is a beautiful stone building built in 1860 in the Victorian Gothic style, alongside the Natural History Museum, at a time when Chemistry was first recognized as a discipline.

We heard a fascinating talk from Jerome Wicker from the Department of Chemistry who spoke about “Machine learning for classification of solid form data extracted from CSD and ZINC”, and described a method that could successfully discriminate (~80%) whether a small molecule would crystallize or not. The software tools he discussed included RDKit, CSD, and scikit-learn. There were also two lightning talks, each 5 minutes long, one from OPIG member Hannah Patel, from the Department of Statistics, on “Novelty Score: Prioritising compounds that potentially form novel protein-ligand interactions and novel scaffolds using an interaction centric approach”, who briefly described her Django-based web interface to her RDKit-based tool to analyse structures of ligands bound to proteins and help guide future medicinal chemistry to find novel compounds. We also had a talk from Dr Michael Charlton from InhibOx spoke about “Antibacterial Drug Discovery and Machine Learning”.

CCK-2

Our next Comp Chem Kitchen, CCK-2, will be held next Tuesday (June 14th, 2016), and you can register free for CCK-2 here.

We will have talks from:

Mike Bodkin (Vice President, Research Informatics, Evotec), “Chemical space and how to warp drive discovery”.
Jonathan Yates (Department of Materials, University of Oxford), Lightning talk, “A brief introduction to the Collaborative Computational Project for NMR Crystallography (CCP-NC)”.
Jonny Brooks-Bartlett (Elspeth Garman Group, Department of Biochemistry); Lightning talk: “The Julia Programming Language”.
Fernanda Duarte (Rob Paton Group, Department of Chemistry, University of Oxford): Lightning talk: “Exploring biochemical systems using the Empirical Valence Bond (EVB) approach”.
Matteo Degiacomi (Justin Benesch Group, Department of Chemistry, University of Oxford): Lightning talk: “The Python package BiobOx: a collection of data structures, tools and methods for biomolecular modelling” BiobOx is used for manipulation, measurement, analysis and assembly of atomistic and super coarse-grain structures as well as EM maps.

Hope to see you there! (Did I say free beer?)

RDKit User Group Meeting 2016

If you’re interested in small molecules and cheminformatics, you might like to know that registration for the 2016 RDKit User Group Meeting is now open. The meeting will be held from the 26th-28th of October in Basel, Switzerland. From their announcement:

This year’s RDKit User Group Meeting will take place from 26-28 October at the Novartis Campus in Basel, Switzerland and is co-organized by people from Roche and Novartis.

Registration for the RDKit UGM is free:

https://www.eventbrite.com/e/5th-rdkit-user-group-meeting-tickets-22539677783

The previous years’ format seemed to work pretty well and the feedback was positive, so we will stick to the same format this year:

Days 1 and 2: Talks, lightning talks, roundtable(s), discussion, and talktorials. For those who haven’t attended before, talktorials are somewhere between a talk and a tutorial, they cover something interesting done with the RDKit and include the code used to do the work. During the presentation you’ll give an overview of what you did and also show the pieces of the code that are central to the work. The idea is to mix the science up with the tutorial aspects.

Day 3 will be a sprint: those who choose to stay will spend an intense day working in small groups to produce useful artifacts: new bits of code, KNIME nodes, KNIME workflows, tutorials, documentation, IPython notebooks, etc. We will once again try to structure this a bit by collecting a bunch of ideas for things to work on in advance.

There will also be, of course, social activities.

Looks like fun!

Advances in Conformer Generation: ETKDG and ETDG

Predicting the possible shapes a small molecule can adopt is essential to understanding its chemistry and the possible biological roles of candidate drugs. Thus, conformer generation, the process of converting a topological description of a molecule into a set of 3D positions of its constituent atoms, is an essential component of computational drug discovery.

A former member of OPIG, Dr Jean-Paul Ebejer, myself, and Prof. Charlotte Deane, compared the ability of freely-available conformer generation methods:

“(i) to identify which tools most accurately reproduce experimentally determined structures;
(ii) to examine the diversity of the generated conformational set; and
(iii) to benchmark the computational time expended.”

in Ebejer et al., 2012. JP assembled a set of 708 crystal structures of drug-like molecules from the OMEGA validation set and the Astex Diverse Set with which to test the various methods. We found that RDKit, combining its Distance Geometry (DG) algorithm with energy minimization using the MMFF94 force field proved to be a “valid free alternative to commercial, proprietary software”, and was able to generate “a diverse and representative set of conformers which also contains a close conformer to the known structure”.

Following on from our work at InhibOx, and building on the same benchmark set JP assembled, Greg Landrum and Sereina Riniker recently described (Riniker & Landrum, 2015) two new conformer generation methods, ETDG and ETKDG, that improve upon the classical distance geometry (DG) algorithm. They do this by combining DG with knowledge of preferred torsional angles derived from experimentally determined crystal structures (ETDG), and also by further adding constraints from chemical knowledge, such as ‘aromatic rings are be flat’, or ‘bonds connected to triple bonds are colinear’ (ETKDG). They compared DG, ETDG, ETKDG, and a knowledge-based method, CONFECT, and found:

“ETKDG was found to outperform standard DG and the knowledge-based conformer generator CONFECT in reproducing crystal conformations from both small-molecule crystals (CSD data set) and protein−ligand complexes (PDB data set). With ETKDG, 84% of a set of 1290 small-molecule crystal structures from the CSD could be reproduced within an RMSD of 1.0 Å and 38% within an RMSD of 0.5 Å. The experimental torsional-angle preferences or the K terms alone each performed better than standard DG but were not sufficient to obtain the full performance of ETKDG.

Comparison of ETKDG with the DG conformers optimized using either the Universal Force Field (UFF) or the Merck Molecular Force Field (MMFF) showed different results for the two data sets. While FF-optimized DG performed better on the CSD data set, the two approaches were comparable for the PDB data set.”

They also showed (Fig. 13) that their ETKDG method was faster than DG followed by energy minimization, but not quite as accurate in reproducing the crystal structure.

ETKDG takes 3 times as long as DG. The addition of the K terms, i.e., generating ETKDG embeddings instead of ETDG embeddings, increases runtime by only 10% over ETDG (results not shown). Despite the longer runtime per conformer, ETKDG requires on average one-quarter of the number of conformers to achieve performance similar to DG (Figure S12 and Table S4 in the Supporting Information). This results in a net performance improvement, at least when it comes to reproducing crystal conformers.

As measured by performance in reproducing experimental crystal structures, ETKDG is a viable alternative to plain DG followed by a UFF-optimization, so it is of interest how their runtimes compare. Figure 13 (right) plots the runtime for ETKDG versus the runtime for DG + UFF-optimization. The median ratio of the DG + UFF optimization and ETKDG runtimes is 1.97, i.e., DG + UFF optimization takes almost twice as long as ETKDG. Thus, although ETKDG is significantly slower than DG on a per-conformer basis, when higher-quality conformations are required it can provide structures that are the equivalent of those obtained using DG + UFF-optimization in about half the time.

ETKDG looks like a great addition to the RDKit toolbox for conformer generation (and it was great to see JP thanked in the Acknowledgments!).

References

Ebejer, J. P., G. M. Morris and C. M. Deane (2012). “Freely available conformer generation methods: how good are they?” J Chem Inf Model, 52(5): 1146-1158. 10.1021/ci2004658.

Riniker, S. and G. A. Landrum (2015). “Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation.” J Chem Inf Model, 55(12): 2562-2574. 10.1021/acs.jcim.5b00654.

Oxford Protein Informatics Group

or "OPIG" to friends