Monthly Archives: August 2018

Mol2vec: Finding Chemical Meaning in 300 Dimensions

Embeddings of Amino Acids

2D projections (t-SNE) of Mol2vec vectors of amino acids (bold arrows). These vectors were obtained by summing the vectors of the Morgan substructures (small arrows) present in the respective molecules (amino acids in the present example). The directions of the vectors provide a visual representation of similarities. Magnitudes reflect importance, i.e. more meaningful words. [Figure from Ref. 1]

Natural Language Processing (NLP) algorithms are usually used for analyzing human communication, often in the form of textual information such as scientific papers and Tweets. One aspect, coming up with a representation that clusters words with similar meanings, has been achieved very successfully with the word2vec approach. This involves training a shallow, two-layer artificial neural network on a very large body of words and sentences — the so-called corpus — to generate “embeddings” of the constituent words into a high-dimensional space. By computing the vector from “woman” to “queen”, and adding it to the position of “man” in this high-dimensional space, the answer, “king”, can be found.

A recent publication of one of my former InhibOx-colleagues, Simone Fulle, and her co-workers, Sabrina Jaeger and Samo Turk, shows how we can embed molecular substructures and chemical compounds into a similarly high-dimensional, continuous vectorial representation, which they dubbed “mol2vec“.1 They also released a Python implementation, available on Samo Turk’s GitHub repository.

 

Continue reading

Cinder: Crystallographic Tinder

Protein structure determination is still dominated by xray diffraction. For diffraction studies structural biologists need to grow and optimise protein crystals until they diffract to an usable and optimal resolution. A purified protein sample is exposed to a number of crystallisation screens, each comprising a selection of chemical conditions that are designed to explore a reasonably wide area of potential crystallisation conditions.

Many crystallography labs routinely image these in large plate storage systems, which reduces the human interaction to viewing a set of usually 100-1000 images at various time points. This is a slow and laborious process, and highly applicable to machine learning approaches tailored to looking at images. TexRank, a texton analysis ranking software was developed by Jia Tsing in OPIG and is used at the Structural Genomics Consortium (SGC). This ranking reduces the number of images that a human needs to search through, providing a quicker review process.  Continue reading

Rasmus Fonseca and GetContacts

We welcomed Rasmus Fonseca to last week’s OPIG Group Meeting. Rasmus is currently a Visiting Scholar at Stanford. He gave a fascinating talk about the interaction analysis of molecular structures and ensembles using the GetContacts package, one of many projects that he has contributed to that you can find on his GitHub repo.

Rasmus was kind enough to share his slides with us:

https://docs.google.com/presentation/d/1HmN9AuU4gL-jMlJdR6cMleueQ-nRWOE_hiWtO8OQEoo/edit?usp=sharing.

He is looking for new users (and developers), so if you have questions, he would be very happy to help get you started.

ISMB 2018: Collaborative Structural Biology using Machine Learning and Jupyter Notebook

This post is a summary of the talk, Collaborative Structural Biology using Machine Learning and Jupyter Notebook, given by Fergus Imrie and Fergus Boyles at ISMB 2018. Materials for the experiments can be found here and here.

Myself and four other members of the Oxford Protein Informatics Group (a.k.a. OPIGlets) recently had the pleasure of attending the Intelligent Systems for Molecular Biology (ISMB) conference in Chicago. Organised by the International Society of Computational Biology (ISCB), ISMB is the largest computational biology conference in the world, with several thousand attendees.

Spread over four action-packed days in July (not including workshops/tutorial sessions), it was an eye-opening experience, showcasing the depth and breadth of computational biology research; particularly striking was the range of problems tackled, techniques applied, and data sources used.

I was fortunate enough to have the opportunity to present alongside my colleague, Fergus Boyles, as part of the 3DSIG Community of Special Interest (COSI). We led the first hands-on practical demonstration at 3DSIG, entitled “Collaborative Structural Biology using Machine Learning and Jupyter Notebook”. While a new format at the conference, with our presentation somewhat of an experiment, I understand the organising committee is keen to repeat the format next year.

In what follows, I’ll briefly outline the key themes and outcomes from our presentation. Full materials to reproduce all results presented in full can be found here and here.

Reproducibility crisis?

In a survey of 1,500 scientists by Nature in 2016 (link), more than 70% of participants had tried and failed to reproduce another scientist’s experiments, while 90% said there was a reproducibility crisis to some extent. Most striking, perhaps, was the revelation that “more than half have failed to reproduce their own experiments”!

Nature, 2016, M. Baker, 1,500 scientists lift the lid on reproducibility

While the focus of the survey was, admittedly, on traditional, lab-based, experimental research, this is certainly also an issue in computational approaches, with the machine learning community under the heaviest scrutiny.

This is clearly unsustainable and many efforts are being taken to address this across the scientific world. As one example, Nature has introduced a code and submission checklist that requires authors to submit custom algorithms or software that are central to the paper for peer review and editorial assessment. While only directly affecting a small portion of research, this is a big step in the right direction and I think we’re only going to see more of this in the future.

Software to the rescue?

With the rise of cloud computing, the open-source community, and much more, there is a plethora of software available that can be used to improve the accessibility of methods and improve the reproducibility of computational experiments. Below, I touch on a couple of general areas that are increasing used in computational pipelines and setups.

  • Cloud computing (such as Amazon Web Services, Google Cloud, and Microsoft Azure) provides widely accessible, standardised compute environments, and allows the use of anything from a single core to near-HPC-level resources for a short period of time at relative inexpensive.
  • Container solutions (such as Docker and Kubernets) allow developers to package an application, with all required libraries and dependencies, into a single executable for the end user, with no further dependencies.

Our approach

We didn’t use any of the above tools for purposes of our talk, but instead constructed our pipeline based on three other widely-used solutions: Conda, Project Jupyter, and Git/GitHub. For those unfamiliar, here is a brief overview of each.

  • Conda is an open-source package and environment management system. It works by creating distinct virtual environments and installing standalone interpreters or compilers within that virtual environment. You can then install additional packages within that virtual environment, that are completely isolated and separate from your system default packages, and other virtual environments.

  • For those of you who are familiar with the iPython notebook, Jupyter is an extension of this format to multiple languages. Jupyter provides an interactive browser-based coding environment in the form of a notebook, that can be thought of as similar to a lightweight IDE. The power of Jupyter notebooks comes from a combination of (1) the ability to intersperse code with markdown, which is much more human readable and friendly on the eye compared to traditional comments; (2) the cell-based format, where small pieces of code are contained in cells that can be run, and re-run, individually and without re-running the remainder of your code; (3) the ability to display inline figures, tables (among other things), rendering in HTML.

  • Git is an open-source version control system. Version control is an essential bedrock of good programming that we don’t have time to go into in more detail, but long-story short, Git takes any headache out of version control.

  • GitHub is a code hosting platform built for collaboration with Git at its core. Beyond a simple code repository, GitHub allows collaboration and development through two key features. “Forking” allows you to clone other projects, and either develop them yourself, or keep a record of a fixed version for integration within another project. “Pull requests” make large scale community collaboration projects possible, with users providing code for specific modifications for the original projects, which the owners/admin of the original project can choose to merge or reject.

Experiments

As a toy problem to showcase this approach to building a reproducible pipeline, we address the problem of protein classification according to the SCOP classification scheme. While the dataset we have shared contains examples of protein pairs that are in the same fold, superfamily, and family (as well as none of these), we focussed on the most straightforward task of determining whether a pair of proteins belong to the same family or not.

Our dataset is based on the Astral data set (06.02.2016 build), and consists of 8 pairwise features computed from the sequences of the two proteins. We won’t go into the details of the exact features here.

Using a simple random forest on these 8 pairwise features between the target and template protein, we achieved an accuracy of 88.0%, and an area under the receiver operative curve of 0.95. A confusion matrix and ROC curve summarising our results can be found below.

Instructions to reproduce these results, together with all materials needed, can be found here and here.

Conclusions

Reproducibility in science is facing a challenging time. All stakeholders, from researchers to funders and publishers, are placing more emphasis on work being reproducible, and are taking measures to ensure this. In computational research, in particular stochastic algorithms such as those prevalent throughout machine learning, the problem is no less serious, and on the face of it should be readily solvable.

In our demonstration, we have illustrated one approach to tackling this in a simple, efficient way. In addition, we only looked to tackle one possible problem or question, and only used a subset of the overall dataset. Please feel free to explore the dataset and pose your own questions. We’d love to hear from you if you do!

Acknowledgements

I’d like to thank all of OPIG for providing feedback on an early version of the talk. Crucially, I’d like to thank Dr Saulo de Oliveira who provided us with the dataset used in our exploratory analysis. Finally, I’d like to thank my co-presenter Fergus Bolyes, without whom I couldn’t have done this.

ISMB 2018 (Chicago): Summary of Interesting Talks/Posters

Catherine’s Selection

Network approach integrates 3D structural and sequence data to improve protein structural comparison

Why: Current graph mapping in protein structural comparison ignores sequence order of residues. Residues distant in sequence but close in 3D space are more important.
How: Introduce sequence order of residues, set a sequence-distance cutoff to consider structurally important residues, count the graphlet frequency and embed into PCA space.
Results: the new method is predictive of SCOP and CATH ‘groups’. Certain graphlets are enriched in alpha and beta folds.
Link: https://www.nature.com/articles/s41598-017-14411-y

Investigating the molecular determinants of Ebola virus pathogenicity

Why: Reston virus is the only Ebola virus that is not pathogenic to human
What they do: multiple sequence alignment to look for specificity determining positions (SDPs) using s3det, then predict the effect of each individual SDP on the stability of the protein with mCSM.
Results: VP40 SDPs alter octamer formation, structure hydrophobic core. VP24 SDPs leads to impair binding to KPNA5 in human, which inhibits interferon signalling.
Impact: only a few SDPs distinguish Reston VP24 from VP24 of others. Human-pathogenic Reston viruses may emerge.
Link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5558184/#__ffn_sectitle

Computational Analysis Highlights Key Molecular Interactions and Conformational Flexibility of a New Epitope on the Malaria Circumsporozoite Protein and Paves the Way for Vaccine Design

Why: An antibody with a strong binding affinity was found in a group of subjects. This antibody prevents cleavage of the surface protein.
What they do: They found the linear epitope, crystallise the strong and medium binders and run a molecular dynamic simulation to find out the flexibility of the structures.
Results: The strong binder is less flexible. Moreover, the strong binder is similar to the germline sequence which may mean that this antibody could have been readily formed.
Link: https://www.nature.com/articles/nm.4512



Matt’s Selection

“Analysis of sequence and structure data to understand nanobody architectures and antigen interactions”
Laura S. Mitchell (Colwell Group)
University of Cambridge, UK

This poster detailed the work from Laura’s two most recent publications, which can be found here: https://doi.org/10.1002/prot.25497, https://doi.org/10.1093/protein/gzy017

They describe a comprehensive analysis of the binding properties of the 156 non-redundant nanobody-antigen (Nb-Ag) complexes in the PDB/SAbDab (October 2017). Their analyses include Nb sequence variability (both global and across the binding regions), contact maps of nanobody-antigen interactions by region, and the typical chemical properties of each paratope. Nb-Ag complexes are compared to a reference set of monoclonal antibody-antigen (mAb-Ag) complexes. This work is a key first step in advancing our understanding of Nb paratopes, and will aid the development of new diagnostics and therapeutics.

OSPREY 3.0: Open-Source Protein Redesign for You, with Powerful New Features”
Jeffrey W. Martin (Donald Group)
Duke University, USA

OSPREY 3.0 (https://www.biorxiv.org/content/early/2018/04/23/306324) represents a large advance towards time-efficient continuous flexibility modelling of protein-protein interfaces.

Its new algorithms LUTE and BBK* allow for continuous rotamer flexibility searching and entropy-aware binding constant approximation in a much more efficient manner. The CATS algorithm also introduces local backbone flexibility as a long-awaited feature. This software now has a easy-to-use Python interface, and is fully Open-Source, making it an extremely attractive alternative to other proprietary protein design tools.

“Functional annotation of chemical libraries across diverse biological processes”
Scott Simpkins
University of Minnesota-Twin Cities, USA

This interesting talk detailed the work published in Nature Chemical Biology in September 2017 (https://doi.org/10.1038/nchembio.2436).

310 yeast gene-deletion mutants were isolated to perform chemical-genetic profile studies across six diverse small molecule high-throughput screening libraries. By studying which gene-deletion mutants were hypersensitive or resistant to each compound, the researchers could assign most members of each chemical library a probable functional annotation. Mapping back to gene-interaction profile data also allowed them to infer likely targets for some compounds. The GO annotations associated with these genes could then be used assess whether a given starting library is likely to contain promising starting-points that affect a given biological function. For example, the authors highlighted a deficiency across all libraries against the cellular processes of cytokinesis and ribosome biogenesis. Conversely, they found a large enrichment across all libraries for compounds likely to affect glycosylation or cell wall biogenesis. Compounds that target transcription and chromatin organisation were found to be enriched in certain datasets, and depleted in others. This genre of profiling provides researchers a way of judging a priori whether a given screening library is likely to contain promising lead compounds, given the functional role of the target of interest.