Author Archives: Garrett

Mol2vec: Finding Chemical Meaning in 300 Dimensions

2D projections (t-SNE) of Mol2vec vectors of amino acids (bold arrows). These vectors were obtained by summing the vectors of the Morgan substructures (small arrows) present in the respective molecules (amino acids in the present example). The directions of the vectors provide a visual representation of similarities. Magnitudes reflect importance, i.e. more meaningful words. [Figure from Ref. 1]

Natural Language Processing (NLP) algorithms are usually used for analyzing human communication, often in the form of textual information such as scientific papers and Tweets. One aspect, coming up with a representation that clusters words with similar meanings, has been achieved very successfully with the word2vec approach. This involves training a shallow, two-layer artificial neural network on a very large body of words and sentences — the so-called corpus — to generate “embeddings” of the constituent words into a high-dimensional space. By computing the vector from “woman” to “queen”, and adding it to the position of “man” in this high-dimensional space, the answer, “king”, can be found.

A recent publication of one of my former InhibOx-colleagues, Simone Fulle, and her co-workers, Sabrina Jaeger and Samo Turk, shows how we can embed molecular substructures and chemical compounds into a similarly high-dimensional, continuous vectorial representation, which they dubbed “mol2vec“.¹ They also released a Python implementation, available on Samo Turk’s GitHub repository.

Continue reading →

Rasmus Fonseca and GetContacts

We welcomed Rasmus Fonseca to last week’s OPIG Group Meeting. Rasmus is currently a Visiting Scholar at Stanford. He gave a fascinating talk about the interaction analysis of molecular structures and ensembles using the GetContacts package, one of many projects that he has contributed to that you can find on his GitHub repo.

Rasmus was kind enough to share his slides with us:

https://docs.google.com/presentation/d/1HmN9AuU4gL-jMlJdR6cMleueQ-nRWOE_hiWtO8OQEoo/edit?usp=sharing.

He is looking for new users (and developers), so if you have questions, he would be very happy to help get you started.

Seeing the Mesoscale

There’s a range of scales that is really hard for us to see. Techniques like X-ray crystallography and increasingly, cryo-electron microscopy, let us see molecules to atomic level-of-detail. Microscopes reveal organelles in cells, but seeing the molecular ‘trees’ in the cellular ‘forest’ requires a synthesis of knowledge. David Goodsell was one of the first to show us the emergent beauty of the cell at the molecular level, and work carried out in the Molecular Graphics Laboratory at The Scripps Research Institute under the direction of Art Olson has led to a 3D molecular modeling tools like ePMV, autoPACK and cellPACK.

One of the fruits of this labor is the Visual Guide to the Cell, part of the Allen Cell Explorer. It’s well worth a look at how you can explore 3D representations of the cell in a web browser.

Protein Engineering and Structure Determination

Sometimes it can be advantageous to combine two proteins into one. One such technique was described by Jennifer Padilla, Christos Colovos, and Todd Yeates back in 2001 (Padilla, et al., 2001). By connecting two proteins, one that dimerized, and another that trimerized, they were able to design synthetic ‘nanohedra’. The way they achieved this was by extending a C-terminal α-helix at the end of one protein by another α-helix ‘linker’, directly into the N-terminal α-helix of another protein:

Continue reading →

Experimental Binding Modes of Small Molecules in Protein-Ligand Docking

Protein-ligand docking tends to be very good at generating binding modes that resemble experimental binding modes from X-ray crystallography and other methods (assuming we have a high quality structure…); but it is also very good at generating plausible models for ligands that don’t bind. These so-called “false positives” lead to reduced accuracy in structure-based virtual screening campaigns.

Structure-based methods are not the only way of approaching virtual screening: when all we know is the chemical structure of an active molecule, but nothing about its target (or targets), we can use ligand-based virtual screening methods, which operate on the principle of molecular similarity (Maggiora et al., 2014).

But what if we combine both methods?

Continue reading →

Interesting Jupyter and IPython Notebooks

Here’s a treasure trove of interesting Jupyter and iPython notebooks, with lots of diverse examples relevant to OPIG, including an RDKit notebook, but also:

Entire books or other large collections of notebooks on a topic (covering Introductory Tutorials; Programming and Computer Science; Statistics, Machine Learning and Data Science; Mathematics, Physics, Chemistry, Biology; Linguistics and Text Mining; Signal Processing; Scientific computing and data analysis with the SciPy Stack; General topics in scientific computing; Machine Learning, Statistics and Probability; Physics, Chemistry and Biology; Data visualization and plotting; Mathematics; Signal, Sound and Image Processing; Natural Language Processing; Pandas for data analysis); General Python Programming; Notebooks in languages other than Python (Julia; Haskell; Ruby; Perl; F#; C#); Miscellaneous topics about doing various things with the Notebook itself; Reproducible academic publications; and lots more!

The Emerging Disorder-Function Paradigm

It’s rare to find a paper that connects all of the diverse areas of research of OPIG, but “The rules of disorder or why disorder rules” by Gsponer and Babu (2009) is one such paper. Protein folding, protein-protein interaction networks, protein loops (Schlessinger et al., 2007), and drug discovery all play a part in this story. What’s great about this paper is that it gives numerous examples of proteins and the evidence supporting that they are partially or completely unstructured. These are the so-called intrinsically unstructured proteins or IUPs, although more recently they are also being referred to as intrinsically disordered proteins, or IDPs. Intrinsically disordered regions (IDRs) “are polypeptide segments that do not contain sufficient hydrophobic amino acids to mediate co-operative folding” (Babu, 2016).

Such proteins contradict the classic “lock and key” hypothesis of Fischer, and challenge Continue reading →

Viewing 3D molecules interactively in Jupyter iPython notebooks

Greg Landrum, curator of the invaluable open source cheminformatics API, RDKit, recently blogged about viewing molecules in a 3D window within a Jupyter-hosted iPython notebook (as long as your browser supports WebGL, that is).

The trick is to use py3Dmol. It’s easy to install:

pip install py3Dmol

This is built on the object-oriented, webGL based JavaScript library for online molecular visualization 3Dmol.js (Rego & Koes, 2015); here's a nice summary of the capabilities of 3Dmol.js. It's features include:

support for pdb, sdf, mol2, xyz, and cube formats
parallelized molecular surface computation
sphere, stick, line, cross, cartoon, and surface styles
atom property based selection and styling
labels
clickable interactivity with molecular data
geometric shapes including spheres and arrows

I tried a simple example and it worked beautifully:

import py3Dmol
view = py3Dmol.view(query='pdb:1hvr')
view.setStyle({'cartoon':{'color':'spectrum'}})
view

The 3Dmol.js website summarizes how to view molecules, along with how to choose representations, how to embed it, and even how to develop with it.

References

Nicholas Rego & David Koes (2015). “3Dmol.js: molecular visualization with WebGL”.
Bioinformatics, 31 (8): 1322-1324. doi:10.1093/bioinformatics/btu829

The Protein World

This week’s issue of Nature has a wonderful “Insight” supplement titled, “The Protein World” (Vol. 537 No. 7620, pp 319-355). It begins with an editorial from Joshua Finkelstein, Alex Eccleston & Sadaf Shadan (Nature, 537: 319, doi:10.1038/537319a), and introduces four reviews, covering:

the computational de novo design of proteins that spontaneously fold and assemble into desired shapes (“The coming of age of de novo protein design“, by Po-Ssu Huang, Scott E. Boyken & David Baker, Nature, 537: 320–327, doi:10.1038/nature19946). Baker et al. point out that much of protein engineering until now has involved modifying naturally-occurring proteins, but assert, “it should now be possible to design new functional proteins from the ground up to tackle current challenges in biomedicine and nanotechnology”;
the cellular proteome is a dynamic structural and regulatory network that constantly adapts to the needs of the cell—and through genetic alterations, ranging from chromosome imbalance to oncogene activation, can become imbalanced due to changes in speed, fidelity and capacity of protein biogenesis and degradation systems. Understanding these complex systems can help us to develop better ways to treat diseases such as cancer (“Proteome complexity and the forces that drive proteome imbalance“, by J. Wade Harper & Eric J. Bennett, Nature, 537: 328–338, doi:10.1038/nature19947);
the new challenger to X-ray crystallography, the workhorse of structural biology: cryo-EM. Cryo-electron microscopy has undergone a renaissance in the last 5 years thanks to new detector technologies, and is starting to give us high-resolution structures and new insights about processes in the cell that are just not possible using other techniques (“Unravelling biological macromolecules with cryo-electron microscopy“, by Rafael Fernandez-Leiro & Sjors H. W. Scheres, Nature, 537: 339–346, doi:10.1038/nature19948); and
the growing role of mass spectrometry in unveiling the higher-order structures and composition, function, and control of the networks of proteins collectively known as the proteome. High resolution mass spectrometry is helping to illuminate and elucidate complex biological processes and phenotypes, to “catalogue the components of proteomes and their sites of post-translational modification, to identify networks of interacting proteins and to uncover alterations in the proteome that are associated with diseases” (“Mass-spectrometric exploration of proteome structure and function“, by Ruedi Aebersold & Matthias Mann, Nature, 537: 347–355, doi:10.1038/nature19949).

Baker points out that the majority of de novo designed proteins consist of a single, deep minimum energy state, and that we have a long way to go to mimic the subtleties of naturally-occurring proteins: things like allostery, signalling, and even recessed binding pockets for small moleculecules, functional sites, and hydrophobic binding interfaces present their own challenges. Only by increasing our understanding, developing better models and computational tools, will we be able to accomplish this.

Comp Chem Kitchen

I recently started “Comp Chem Kitchen” with Richard Cooper and Rob Paton in the Department of Chemistry here in Oxford. It’s a regular forum and seminar series for molecular geeks and hackers, in the original, untarnished sense of the word: people using and developing computational methods to tackle problems in chemistry, biochemistry and drug discovery. Our hope is that we will share best practices, even code snippets and software tools, and avoid re-inventing wheels.

In addition to local researchers, we invite speakers from industry and non-profits from time to time, and occasionally organize software demos and tutorials.

We also provide refreshments including free beer. (We are grateful to Prof. Phil Biggin and the MRC Proximity to Discovery Fund for offering to support CCK.)

CCK-1

Our first meeting, CCK-1, was held in the Abbot’s Kitchen on May 24, 2016, at 5 pm, and was a great success—standing room only, in fact! The Abbot’s Kitchen—originally a laboratory—is a beautiful stone building built in 1860 in the Victorian Gothic style, alongside the Natural History Museum, at a time when Chemistry was first recognized as a discipline.

We heard a fascinating talk from Jerome Wicker from the Department of Chemistry who spoke about “Machine learning for classification of solid form data extracted from CSD and ZINC”, and described a method that could successfully discriminate (~80%) whether a small molecule would crystallize or not. The software tools he discussed included RDKit, CSD, and scikit-learn. There were also two lightning talks, each 5 minutes long, one from OPIG member Hannah Patel, from the Department of Statistics, on “Novelty Score: Prioritising compounds that potentially form novel protein-ligand interactions and novel scaffolds using an interaction centric approach”, who briefly described her Django-based web interface to her RDKit-based tool to analyse structures of ligands bound to proteins and help guide future medicinal chemistry to find novel compounds. We also had a talk from Dr Michael Charlton from InhibOx spoke about “Antibacterial Drug Discovery and Machine Learning”.

CCK-2

Our next Comp Chem Kitchen, CCK-2, will be held next Tuesday (June 14th, 2016), and you can register free for CCK-2 here.

We will have talks from:

Mike Bodkin (Vice President, Research Informatics, Evotec), “Chemical space and how to warp drive discovery”.
Jonathan Yates (Department of Materials, University of Oxford), Lightning talk, “A brief introduction to the Collaborative Computational Project for NMR Crystallography (CCP-NC)”.
Jonny Brooks-Bartlett (Elspeth Garman Group, Department of Biochemistry); Lightning talk: “The Julia Programming Language”.
Fernanda Duarte (Rob Paton Group, Department of Chemistry, University of Oxford): Lightning talk: “Exploring biochemical systems using the Empirical Valence Bond (EVB) approach”.
Matteo Degiacomi (Justin Benesch Group, Department of Chemistry, University of Oxford): Lightning talk: “The Python package BiobOx: a collection of data structures, tools and methods for biomolecular modelling” BiobOx is used for manipulation, measurement, analysis and assembly of atomistic and super coarse-grain structures as well as EM maps.

Hope to see you there! (Did I say free beer?)

Oxford Protein Informatics Group

or "OPIG" to friends

Author Archives: Garrett

Mol2vec: Finding Chemical Meaning in 300 Dimensions

Rasmus Fonseca and GetContacts

Seeing the Mesoscale

Protein Engineering and Structure Determination

Experimental Binding Modes of Small Molecules in Protein-Ligand Docking

Interesting Jupyter and IPython Notebooks

The Emerging Disorder-Function Paradigm

Viewing 3D molecules interactively in Jupyter iPython notebooks

The Protein World

Comp Chem Kitchen