Tag Archives: Bioinformatics

Mol2vec: Finding Chemical Meaning in 300 Dimensions

Embeddings of Amino Acids

2D projections (t-SNE) of Mol2vec vectors of amino acids (bold arrows). These vectors were obtained by summing the vectors of the Morgan substructures (small arrows) present in the respective molecules (amino acids in the present example). The directions of the vectors provide a visual representation of similarities. Magnitudes reflect importance, i.e. more meaningful words. [Figure from Ref. 1]

Natural Language Processing (NLP) algorithms are usually used for analyzing human communication, often in the form of textual information such as scientific papers and Tweets. One aspect, coming up with a representation that clusters words with similar meanings, has been achieved very successfully with the word2vec approach. This involves training a shallow, two-layer artificial neural network on a very large body of words and sentences — the so-called corpus — to generate “embeddings” of the constituent words into a high-dimensional space. By computing the vector from “woman” to “queen”, and adding it to the position of “man” in this high-dimensional space, the answer, “king”, can be found.

A recent publication of one of my former InhibOx-colleagues, Simone Fulle, and her co-workers, Sabrina Jaeger and Samo Turk, shows how we can embed molecular substructures and chemical compounds into a similarly high-dimensional, continuous vectorial representation, which they dubbed “mol2vec“.1 They also released a Python implementation, available on Samo Turk’s GitHub repository.

 

Continue reading

Journal Club: Large-scale structure prediction by improved contact predictions and model quality assessment.

With the advent of statistical techniques to infer protein contacts from multiple sequence alignments (which you can read more about here), accurate protein structure prediction in the absence of a template has become possible. Taking advantage of this fact, there have been efforts to brave the sea of protein families for which no structure is known (about 8,500 – over 50% of known protein families) in an attempt to predict their topology. This is particularly exciting given that protein structure prediction has been an open problem in biology for over 50 years and, for the first time, the community is able to perform large-scale predictions and have confidence that at least some of those predictions are correct.

Based on these trends, last group meeting I presented a paper entitled “Large-scale structure prediction by improved contact predictions and model quality assessment”. This paper is the culmination of years of work, making use of a large number of computational tools developed by the Elofsson Lab at Stockholm University. With this blog post, I hope to offer some insights as to the innovative findings reported in their paper.

Let me begin by describing their structure prediction pipeline, PconsFold2. Their method for large-scale structure prediction can be broken down into three components: contact prediction, model generation and model quality assessment. As the very name of their article suggests, most of the innovation of the paper stems from improvements in contact prediction and the quality assessment protocols used, whereas for their model generation routine, they opted to sacrifice some quality in favour of speed. I will try and dissect each of these components over the next paragraphs.

Contact prediction relates to the process in which residues that share spatial proximity in a protein’s structure are inferred from multiple sequence alignments by co-evolution. I will not go into the details of how these protocols work, as they have been previously discussed in more detail here and here. The contact predictor used in PconsFold2 is PconsC3, which is another product of the Elofsson Lab. There was some weirdness with the referencing of PconsC3 on the PconsFold2 article, but after a quick google search, I was able to retrieve the article describing PconsC3 and it was worth a read. Other than showcasing PconsC3’s state-of-the-art contact prediction capabilities, the original PconsC3 paper also provides figures for the number of protein families for which accurate contact prediction is possible (over 5,000 of the ~8,500 protein families in Pfam without a member of known structure). I found the PconsC3 article feels like a prequel to the paper I presented. The bottom line here is that PconsC3 is a reliable tool for predicting contacts from multiple sequence alignments and is a sensible choice for the PconsFold2 pipeline.

Another aspect of contact prediction that the authors explore is the idea that the precision of contact prediction is dependent on the quality of the underlying multiple sequence alignment (MSA). They provide a comparison of the Positive Predicted Value (PPV) of PconsC3 using different MSAs on a test set of 626 protein domains from Pfam. To my knowledge, this is the first time I have encountered such a comparison and it serves to highlight the importance the MSA has on the quality of resulting contact predictions. In the PconsFold2 pipeline, the authors use consensus approach; they identify the consensus of four predicted contact maps each using a different alignment. Alignments were generated using Jackhmmer and HHBlits at E-Value cutoffs of 1 and 10^-4.

Now, moving on to the model generation routine. PconsFold2 makes use of CONFOLD to perform model generation. CONFOLD, in turn, uses the simulated annealing routine of the Crystallographic and NMR System (CNS) to produce models based on spatial and geometric constraints. To derive those constraints, predicted secondary structure and the top 2.5 L predicted contacts are given as input. The authors do note that the refinement stage of CONFOLD is omitted, which is a convenience I assume was adopted to save computational time. The article also acknowledges that models generated by CONFOLD are likely to be less accurate than the ones produced by Rosetta, yet a compromise was made in order to make the large-scale comparison feasible in terms of resources.

One particular issue that we often discuss when performing structure prediction is the number of models that should be produced for a particular target. The authors performed a test to assess how many decoys should be produced and, albeit simplistic in their formulation, their results suggest that 50 models per target should be sufficient. Increasing this number further did not lead to improvements in the average quality of the best models produced for their test set of 626 proteins.

After producing 50 models using CONFOLD, the final step in the PconsFold2 protocol is to select the best possible model from this ensemble. Here, they present a novel method, PcombC, for ranking models. PcombC combines the clustering-based method Pcons, the single-model deep learning method ProQ3D, and the proportion of predicted contacts that are present in the model. These three scores are combined linearly, and are given weights that were optimised via a parameter sweep. One of my reservations relating to this paper is that little detail is given regarding the data set that was used to perform this training. It is unclear from their methods section if the parameter sweep was trained on the test set with 626 proteins used throughout the manuscript. Given that no other data set (with known structures) is ever introduced, this scenario seems likely. Therefore, all the classification results obtained by PcombC, and all of the reported TM-score Top results should be interpreted with care since performance on validation set tends to be poorer than on a training set.

Recapitulating the PconsFold2 pipeline:

  • Step 1: generate four multiple sequence alignments using HHBlits and Jackhmmer.
  • Step 2: generate four predicted contact maps using PconsC3.
  • Step 3: Use CONFOLD to produce 50 models using a consensus of the contact maps from step 2.
  • Step 4: Use PCombC to rank the models based on a linear combination of the Pcons and ProQ3D scores and the proportion of predicted contacts that are present in the model.

So, how well does PconsFold2 perform? The conclusion is that it depends on the quality of the contact predictions. For the protein families where abundant sequence information is available, PconsFold2 produces a correct model (TM-Score > 0.5) for 51% of the cases. This is great news. First, because we know which cases have abundant sequence information beforehand. Second, because this comprises a large number of protein families of unknown structure. As the number of effective sequence (a common way to assess the amount of information available on an MSA) decreases, the proportion of families for which a correct model has been generated also decreases, which restricts the applicability of their method to protein families with abundant sequence information. Nonetheless, given that protein sequence databases are growing exponentially, it is possible that over the next years, the number of cases where protein structure prediction achieves success is likely to increase.

One interesting detail that I was curious about was the length distribution of the cases where modelling was successful. Can we detect the cases for which good models were produced simply by looking at a combination of length and number of effective sequences? The authors never address this question, and I think it would provide some nice insights as to which protein features are correlated to modelling success.

We are still left with one final problem to solve: how do we separate the cases for which we have a correct model from the ones where modelling has failed? This is what the authors address with the last two subsections of their Results. In the first of these sections, the authors compare four ways of ranking decoys: PcombC, Pcons, ProQ3D, and the CNS contact score. They report that, for the test set of 626 proteins, PcombC obtains the highest Pearson’s Correlation Coefficient (PCC) between the predicted and observed TM-Score of the highest ranking models. As mentioned before, this measure could be overestimated if PcombC was, indeed, trained on this test set. Reported PCCs are as follows: PcombC = 0.79, Pcons = 0.73, ProQ3D = 0.67, and CNS-contact = -0.56.

In their final analysis, the authors compare the ability of each of the different Quality Assessment (QA) scores to discern between correct and incorrect models. To do this, they only consider the top-ranked model for each target according to different QA scores. They vary the false positive rate and note the number of true positives they are able to recall. At a 10% false positive rate, PcombC is able to recall about 50% of the correct models produced for the test set. This is another piece of good news. Bottomline is: if we have sufficient sequence information available, PconsFold2 can generate a correct model 51% of the time. Furthermore, it can detect 50% of these cases, meaning that for ~25% of the cases it produced something good and it knows the model is good. This opens the door for looking at these protein families with no known structure and trying to accurately predict their topology.

That is exactly what the authors did! On the most interesting section of the paper (in my opinion), the authors predict the topology of 114 protein families (at FPR of 1%) and 558 protein families (at FPR of 10%). Furthermore, the authors compare the overlap of their results with the ones reported by a similar study from the Baker group (previously presented at group meeting here) and find that, at least for some cases, the predictions agree. These large-scale efforts  force us to revisit the way we see template-free structure prediction, which can no longer be dismissed as a viable way of obtaining structural models when sufficient sequences are available. This is a remarkable achievement for the protein structure prediction community, with the potential to change the way we conduct structural biology research.

Is contacts-based protein-protein affinity prediction the way forward?

The binding affinity of protein interactions is useful information for a range of protein engineering and protein-protein interaction (PPI) network challenges. Obvious applications include the development of therapeutic antibodies to given drug targets or the engineering of novel interfaces for synthetic protein complexes. An accurate model would furthermore allow us to predict a large proportion of affinities in existing PPI networks, and enable the identification of new PPIs, which is critical for our ability to model protein network dynamics effectively.

affinity-prediction-intro

“The design of an ideal scoring function for protein−protein docking that would also predict the binding affinity of a complex is one of the challenges in structural proteomics.” Adapted from Kastritis, Panagiotis L., and Alexandre MJJ Bonvin. Journal of proteome research 9.5 (2010): 2216-2225.

In last week’s paper a new binding-affinity prediction method based on interfacial contact information was described. Contacts have long been used to in docking methods but surprisingly this was the first time that binding affinity was predicted with them. Largely, this was due to the lack of a suitable benchmark data set that contained structural as well as affinity data . In 2011, however, Kastritis et al. presented a curated database of 144 non-redundant protein–protein complexes with experimentally determined Kd (ΔG) as well as x-ray structures.
Using this data set they trained and validated their method, compared it against others and concluded that interfacial contacts `can be considered the best structural property to describe binding strength`. This claim may be true but as we discussed in the meeting there is still some work to do before we take this model an run with it. A number of flags were raised:

  • Classification of experimental methods into reliable and non-reliable is based on what gives the best results with their method. Given that different types of protein complexes are often measured with different methods, some protein classes for which contact-based predictions are less effective may be excluded.
  • Number of parameters for model 6 is problematic without exact AIC information. As Lyuba righlty pointed out, the intercept in model 6 `explodes`. It is no surprise that the correlation improves with more parameters. Despite their AIC analysis, overfitting is still a worry due to the lack of details presented in the paper.

model6-intercept-explosion

  • Comparison against other methods is biased in their favour; their method was trained on the same data set, the others were not. In order to ensure a fair comparison all methods should be trained on the same data set. Of course this is hard to do in practice, but the fact remains that a comparison of methods that has been trained on different data sets will be flawed.

Paper: Vangone, A., Bonvin, A. M. J. J., Alberts, B., Aloy, P., Russell, R., Andrusier, N., … Zhou, Y. (2015). Contacts-based prediction of binding affinity in protein-protein complexes. eLife, 4, e07454. http://doi.org/10.7554/eLife.07454

Is “fragment-based” still the way forward in template-free protein structure prediction?

Out of the many questions surrounding the notion that you can predict a protein’s structure from its sequence, there is one in particular that I decided to tackle during last group meeting.

Protein structure prediction is a hard problem (do I sound repetitive?). One of the many cop outs employed by the structure prediction community is the idea that you can break down known structures into fragments and use these protein pieces to perform predictions. This is known as fragment-assembly or fragment-based template-free protein structure prediction.

As absurd as the idea may seem, there is robust evidence that suggests that this is actually a viable strategy. There is a notion that the fragment space is complete; you can reconstruct the backbone of any known structure based on the torsion angles of fragments from other structures. In less technical jargon, you can effectively use fragments and combine them to re-create any of the protein structures that we know and to a fairly acceptable level of precision.

So, technically, it is possible to predict a protein structure using fragments from other structures. In practice, you are still left with the problem of choosing the right fragments to model your sequence of interest. How easy do you think that is?

We can look at this question in light of observations that were made back in the early 80s. Kabsch and Sander reported that two protein fragments having exactly the same sequence can present completely different structures [1]. This complies with the notion that global properties can affect and even define local structure, which in turn suggests that selecting the right fragments to assemble a structure is not necessarily a straightforward process.

The starting point for protein structure prediction is a sequence. Since we are talking about template-free protein structure prediction, it is safe to assume that there is no good global sequence match to your target with a known structure (otherwise you would use that match/structure as a template). Hence, fragment selection is restricted to local sequence similarity, which, as suggested in the previous paragraph, is not necessarily ideal.

On the other hand, we are becoming increasingly more accurate in inferring one-dimensional properties from a protein’s sequence. These properties can and often are used to enhance our fragment-selection capabilities. Yet, even using the state-of-the-art in secondary structure and torsion angle prediction, fragment selection is still fairly imprecise.

During group meeting I highlighted a possible contrast between practical fragment space and general (or possible) fragment space. My premise is simple.  I define practical fragment space as the fragments that we can accurately select from the possible fragment space to model protein structures. In my opinion, it would be extremely interesting to quantify the difference between the two. This would answer the fundamental question of how useful fragment-assembly actually is. More importantly, it would help the community make an educated decision in regards to whether template-free structure prediction strategies should shift from fragment-based to ones based on distance constraints, an approach that is gaining popularity due to the success of contact predictions.

I am very keen to investigate this further. Maybe for my next blog post, we will have an answer! Stay tuned.

[1] Kabsch, Wolfgang, and Christian Sander. “On the use of sequence homologies to predict protein structure: identical pentapeptides can have completely different conformations.” Proceedings of the National Academy of Sciences  81.4 (1984): 1075­1078.

Network Pharmacology

The dominant paradigm in drug discovery has been one of finding small molecules (or more recently, biologics) that bind selectively to one target of therapeutic interest. This reductionist approach conveniently ignores the fact that many drugs do, in fact, bind to multiple targets. Indeed, systems biology is uncovering an unsettling picture for comfortable reductionists: the so-called ‘magic bullet’ of Paul Ehrlich, a single compound that binds to a single target, may be less effective than a compound with multiple targets. This new approach—network pharmacology—offers new ways to improve drug efficacy, to rescue orphan drugs, re-purpose existing drugs, predict targets, and predict side-effects.

Building on work Stuart Armstrong and I did at InhibOx, a spinout from the University of Oxford’s Chemistry Department, and inspired by the work of Shoichet et al. (2007), Álvaro Cortes-Cabrera and I took our ElectroShape method, designed for ultra-fast ligand-based virtual screening (Armstrong et al., 2010 & 2011), and built a new way of exploring the relationships between drug targets (Cortes-Cabrera et al., 2013). Ligand-based virtual screening is predicated on the molecular similarity principle: similar chemical compounds have similar properties (see, e.g., Johnson & Maggiora, 1990). ElectroShape built on the earlier pioneering USR (Ultra-fast Shape Recognition) work of Pedro Ballester and Prof. W. Graham Richards at Oxford (Ballester & Richards, 2007).

Our new approach addressed two Inherent limitations of the network pharmacology approaches available at the time:

  • Chemical similarity is calculated on the basis of the chemical topology of the small molecule; and
  • Structural information about the macromolecular target is neglected.

Our method addressed these issues by taking into account 3D information from both the ligand and the target.

The approach involved comparing the similarity of each set ligands known to bind to a protein, to the equivalent sets of ligands of all other known drug targets in DrugBank, DrugBank is a tremendous “bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information.” This analysis generated a network of related proteins, connected by the similarity of the sets of ligands known to bind to them.

2013.ElectroShapePolypharmacologyServerWe looked at two different kinds of ligand similarity metrics, the inverse Manhattan distance of our ElectroShape descriptor, and compared them to 2D Morgan fingerprints, calculated using the wonderful open source cheminformatics toolkit, RDKit from Greg Landrum. Morgan fingerprints use connectivity information similar to that used for the well known ECFP family of fingerprints, which had been used in the SEA method of Keiser et al. We also looked at the problem from the receptor side, comparing the active sites of the proteins. These complementary approaches produced networks that shared a minimal fraction (0.36% to 6.80%) of nodes: while the direct comparison of target ligand-binding sites could give valuable information in order to achieve some kind of target specificity, ligand-based networks may contribute information about unexpected interactions for side-effect prediction and polypharmacological profile optimization.

Our new target-fishing approach was able to predict drug adverse effects, build polypharmacology profiles, and relate targets from two complementary viewpoints:
ligand-based, and target-based networks. We used the DUD and WOMBAT benchmark sets for on-target validation, and the results were directly comparable to those obtained using other state-of-the-art target-fishing approaches. Off-target validation was performed using a limited set of non-annotated secondary targets for already known drugs. Comparison of the predicted adverse effects with data contained in the SIDER 2 database showed good specificity and reasonable selectivity. All of these features were implemented in a user-friendly web interface that: (i) can be queried for both polypharmacology profiles and adverse effects, (ii) links to related targets in ChEMBLdb in the three networks (2D, 4D ligand and 3D receptor), and (iii) displays the 2D structure of already annotated drugs.

2013.ElectroShapePolypharmacologyServer.Screenshot

References

Armstrong, M. S., G. M. Morris, P. W. Finn, R. Sharma, L. Moretti, R. I. Cooper and W. G. Richards (2010). “ElectroShape: fast molecular similarity calculations incorporating shape, chirality and electrostatics.” J Comput Aided Mol Des, 24(9): 789-801. 10.1007/s10822-010-9374-0.

Armstrong, M. S., P. W. Finn, G. M. Morris and W. G. Richards (2011). “Improving the accuracy of ultrafast ligand-based screening: incorporating lipophilicity into ElectroShape as an extra dimension.” J Comput Aided Mol Des, 25(8): 785-790. 10.1007/s10822-011-9463-8.

Ballester, P. J. and W. G. Richards (2007). “Ultrafast shape recognition to search compound databases for similar molecular shapes.” J Comput Chem, 28(10): 1711-1723. 10.1002/jcc.20681.

Cortes-Cabrera, A., G. M. Morris, P. W. Finn, A. Morreale and F. Gago (2013). “Comparison of ultra-fast 2D and 3D ligand and target descriptors for side effect prediction and network analysis in polypharmacology.” Br J Pharmacol, 170(3): 557-567. 10.1111/bph.12294.

Johnson, A. M., & G. M. Maggiora (1990). “Concepts and Applications of Molecular Similarity.” New York: John Willey & Sons.

Landrum, G. (2011). “RDKit: Open-source cheminformatics.” from http://www.rdkit.org.

Keiser, M. J., B. L. Roth, B. N. Armbruster, P. Ernsberger, J. J. Irwin and B. K. Shoichet (2007). “Relating protein pharmacology by ligand chemistry.” Nat Biotechnol, 25(2): 197-206. 10.1038/nbt1284.

Wishart, D. S., C. Knox, A. C. Guo, S. Shrivastava, M. Hassanali, P. Stothard, Z. Chang and J. Woolsey (2006). “DrugBank: a comprehensive resource for in silico drug discovery and exploration.” Nucleic Acids Res, 34(Database issue): D668-672. 10.1093/nar/gkj067.

Slow and steady improvements in the prediction of one-dimensional protein features

What do you do when you have a big, complex problem whose solution is not necessarily trivial? You break the problem into smaller, easier to solve parts,  solve each of these sub-problems and merge the results to find the solution of the original, bigger problem. This is an algorithm design paradigm known as the divide and conquer approach.

In protein informatics, we use divide and conquer strategies to deal with a plethora of large and complicated problems. From protein structure prediction to protein-protein interaction networks, we have a wide range of sub and sub-sub problems whose solutions are supposed to help us with the bigger picture.

In particular, prediction of the so called one-dimensional protein features are fundamental sub-problems with a wide range of applications such as protein structure modelling,  homology detection, functional characterization and others. Here, one-dimensional protein features refer to secondary structure, backbone dihedral and C-alpha angles, and solvent accessible surface area.

In this week’s group meeting, I discussed the latest advancements in prediction of one-dimensional features as described in an article published by Heffernan R. and colleagues in Scientific Reports (2015):

“Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning.”

In this article, the authors describe the implementation of SPIDER2, a deep learning approach to predict secondary structure, solvent accessible surface area, and four backbone angles (the traditional dihedrals phi and psi, and the recently explored theta and tau).

“Deep learning” is the buzzword (buzz-two-words or buzzsentence, maybe?) of the moment. For those of you who have no idea what I am talking about, deep learning is an umbrella term for a series of convoluted machine learning methods. The term deep comes from the multiple hidden layers of neurons used during learning.

Deep learning is a very fashionable term for a reason. These methods have been shown to produce state-of-the-art results for a wide range of applications in several fields, including bioinformatics. As a matter of fact, one of the leading methods for contact prediction (previously introduced in this blog post), uses a deep learning approach to improve the precision of predicted protein contacts.

Machine learning has already been explored to predict one-dimensional protein features, showing promising (and more importantly, useful) results. With the emergence of new, more powerful machine learning techniques such as deep learning, previous software are now becoming obsolete.

Based on this premise, Heffernan R. and colleagues implemented and applied their deep learning approach to improve the prediction of one-dimensional protein features. Their training process was rigorous: they performed a 10-fold cross validation using their training set of ~4500 proteins and, on top of that, they also had two independent test sets (a ~1200 protein test set and a set based on the targets of CASP11).  Proteins in all sets did not share more than 25% (30% sequence identity for the CASP set) to any other protein in any of the sets.

The method described in the paper, SPIDER2, was thoroughly compared with state-of-the art prediction software for each of the one-dimensional protein features that it  is capable of predicting. Results show that SPIDER2 achieves a small, yet significant improvement compared to other methods.

It is just like they say, slow and steady wins the race, right? In this case, I am not so sure. It would be interesting to see how much the small increments in precision obtained by SPIDER2 can improve the bigger picture, whichever your bigger picture is. The thing about divide and conquer is that if you become marginally better at solving one of the parts, that doesn’t necessarily imply that you will improve the solution of the bigger, main problem.

If we think about it, during the “conquer” stage (that is, when you are merging the solution of the smaller parts to get to the bigger picture),  you may make compromises that completely disregard any minor improvements for the sub-problems. For instance, in my bigger picture, de novo protein structure prediction, predicted local properties can be sacrificed to ensure a more globally consistent model. More than that, most methods that perform de novo structure prediction already account for a certain degree of error or uncertainty for, say, secondary structure prediction. This is particularly important for the border regions between secondary structure elements (i.e. where an alpha-helix ends and a loop begins). Therefore, even if you improve the precision of your predictions for those border regions, the best approach for structure prediction may still consider those slightly more precise border predictions as unreliable.

The other moral of this story is far more pessimistic. If you think about it, there were significant advancements in machine learning, which led to the creation of ever-more-so complicated neural network architectures. However, when we look back to how much improvement we observed when these highly elaborate techniques were applied to an old problem (prediction of one-dimensional protein features), it seems that the pay-off wasn’t as significant (at least as I would expect). Maybe, I am a glass half-empty kind of guy, but given the buzz surrounding deep learning, I think minor improvements is a bit of a let down. Not to take any credit away from the authors. Their work was rigorous and scientifically very sound. It is just that maybe we are reaching our limits when it comes to applying machine learning to predict secondary structure. Maybe when the next generation of buzzword-worthy machine learning techniques appear, we will observe an even smaller improvement to secondary structure prediction. Which leaves a very bitter unanswered question in all our minds: if machine learning is not the answer, what is?

Predicted protein contacts: is it the solution to (de novo) protein structure prediction?

So what is this buzz I hear about predicted protein contacts? Is it really the long awaited solution for one of the biggest open problems in biology today? Has protein structure prediction been solved?

Well, first things first. Let me give you a quick introduction to this predicted protein contact business (probably not quick enough for an elevator pitch, but hopefully you are not reading this in an elevator).

Nowadays, the scientific community has become very good at sequencing things (and by things I mean genetic things, like whole genomes of a bunch of different people and organisms). We are so good at it that mountains of sequence data are now available: genes, mRNAs, protein sequences. The question is what do we do with all this data?

Good scientists are coming up with new and creative ideas to extract knowledge from these mountains of data. For instance, one can build multiple sequence alignments using protein sequences for a given protein family. One of the ways in which information can be extracted from these multiple sequence alignments is by identifying extremely conserved columns (think of the alignment as a big matrix). Residues in these conserved positions are good candidates for being functionally important for the proteins in that particular family.

Another interesting thing that can be done is to look for pairs of residues that are mutating in a correlated fashion. In more practical terms, you are ascertaining how correlated is the information between two columns of a multiple sequence alignment; how often a change in one of them is countered by a change in the other. Why would anyone care about that? Simple. There is an assumption that residues that mutate in a correlated fashion are co-evolving. In other words, they share some sort of functional dependence (i.e. spatial proximity) that is under selective pressure.

Ok, that was a lot of hypotheticals, does it work? For many years, it didn’t. There were lots of issues with the way these correlations were computed and one of the biggest problems was to identify (and correct for) transitivity. Transitivity is the idea that you observe a false correlation between residues A and C because residues A,B and residues B,C are mutating in a correlated fashion. AS more powerful statistical methods were developed (borrowing some ideas from mechanical statistics), the transitivity issue has seemingly been solved.

The newest methods that detect co-evolving residues in a multiple sequence alignment are capable of detecting protein contacts with high precision. In this context, a contact is defined as two residues that are close together in a protein structure. How close?  Their C-betas must be 8 Angstroms or less apart. When sufficient sequence information is available (at least 500 sequences in the MSA), the average precision of the predicted contacts can reach 80%.

This is a powerful way of converting sequence information into distance constraints, which can be used for protein structure modelling. If a sufficient number of correct distance constraints is used, we can accurately predict the topology of a protein [1]. Recently, we have also observed great advances in the way that models are refined (that is, refining a model that contains the correct topology to atomic, near-experimental resolution). If you put those two things together, we start to look at a very nice picture.

So what’s the catch? The catch was there. Very subtle. “When sufficient sequence information is available”. Currently, there is an estimate that only 15% of the de novo protein structure prediction cases present sufficient sequence information for the prediction of protein contacts. One potential solution would be to sit and wait for more and more sequences to be obtained. Yet a potential pitfall of sitting and waiting is that there is no guarantee that we will have sufficient sequence information for a large number of protein families, as they may as well present less than 500 members.

Furthermore, scientists are not very good at sitting around and waiting. They need to keep themselves busy. There are many things that the community as whole can invest time on while we wait for more sequences to be generated. For instance, we want to be sure that, for the cases where there is a sufficient number of sequences, that we get the modelling step right (and predict the accurate protein topology). Predicted contacts also show potential as a tool for quality assessment and may prove to be a nice way of ascertaining whether you have confidence that a model with correct topology was created. More than that, model refinement still needs to improve if we want to make sure that we get from the correct topology to near-experimental resolution.

Protein structure prediction is a hard problem and with so much room for improvement, we still have a long way to go. Yet, this predicted contact business is a huge step in the right direction. Maybe, it won’t be long before models generated ab initio are considered as reliable as the ones generated using a template. Who knows what promised the future holds.

References:

[1] Kim DE, Dimaio F, Yu-Ruei Wang R, Song Y, Baker D. One contact for every twelve residues allows robust and accurate topology-level protein structure modeling. Proteins. 2014 Feb;82 Suppl 2:208-18. doi: 10.1002/prot.24374. Epub 2013 Sep 10.

 

 

 

Hypotheses and Perspectives onto de novo protein structure prediction

Before I start with my musings about my work and the topic of my D. Phil thesis, I would like to direct you to a couple of previous entries here on BLOPIG. If you are completely new to the field of protein structure prediction or if you just need to refresh your brain a bit, here are two interesting pieces that may give you a bit of context:

A very long introductory post about protein structure prediction

and

de novo Protein Structure Prediction software: an elegant “monkey with a typewriter”

Brilliant! Now, we are ready to start.

In this OPIG group meeting, I presented some results that were obtained during my long quest to predict protein structures.

Of course, no good science can happen without the postulation of question-driving hypotheses. This is where I will start my scientific rant: the underlying hypotheses that inspired me to inquire, investigate, explore, analyse, and repeat. A process all so familiar to many.

As previously discussed (you did read the previous posts as suggested, didn’t you?), de novo protein structure prediction is a very hard problem. Computational approaches often struggle to search the humongous conformational space efficiently. Who can blame them? The number of possible protein conformations is so astronomically large that it would take MUCH longer than the age of the universe to look at every single possible protein conformation.

If we go back to biology, protein molecules are constantly undergoing folding. More so, they manage to do so efficiently and accurately. How is that possible? And can we use that information to improve our computational methods?

The initial hypothesis we formulated in the course of my degree was the following:

“We [the scientific community] can benefit from better understanding the context under which protein molecules are folding in vivo. We can use biology as a source of inspiration to improve existing methods that perform structure prediction.”

Hence came the idea to look at biology and search for inspiration. [Side note: It is my personal belief that there should be a back and forth process, a communication, between computational methods and biology. Biology can inspire computational methods, which in turn can shed light on biological hypotheses that are hard to validate experimentally]

To direct the search for biological inspiration, it was paramount to understand the limitations of current prediction methods. I have narrowed down the limitations of de novo protein structure prediction approaches to three major issues:

1- The heuristics that rely on sampling the conformational space using fragments extracted from know structures will fail when those fragments do not encompass or correctly describe the right answer.

2- Even when the conformational space is reduced, say, to fragment space, the combinatorial problem persists. The energy landscape is rugged and unrepresentative of the actual in vivo landscape. Heuristics are not sampling the conformational space efficiently.

3- Following from the previous point, the reason why the energy landscape is unrepresentative of the in vivo landscape is due to the inaccuracy of the knowledge-based potentials used in de novo structure prediction.

Obviously, there are other relevant issues with de novo structure prediction. Nonetheless, I only have a limited amount of time for my D.Phil and those are the limitations I decided to focus on.

To counter each of these offsets, we have looked for inspiration in biology.

Our understanding from looking at different protein structures is that several conformational constraints are imposed by alpha-helices and beta-strands. That is a consequence of hydrogen bond formation within secondary structure elements. Unsurprisingly, when looking for fragments that represent the correct structure of a protein, it is much easier to identify good fragments for alpha-helical or beta-strand regions. Loop regions, on the other hand, are much harder to be described correctly by fragments extracted from known structures. We have incorporated this important information into a fragment library generation software in an attempt to address limitation number 1.

We have investigated the applicability of a biological hypothesis, cotranslational protein folding, into a structure prediction context. Cotranslational protein folding is the notion that some proteins begin their folding process as they are being synthesised. We further hypothesise that cotranslational protein folding restricts the conformational space, promoting the formation of energetically-favourable intermediates, thus steering the folding path towards the right conformation. This hypothesis has been tested in order to improve the efficiency of the heuristics used to search the conformational space.

Finally, following the current trend in protein structure prediction, we used evolutionary information to improve our knowledge-based potentials. Many methods now consider correlated mutations to improve their predictions, namely the idea that residues that mutate in a correlated fashion present spatial proximity in a protein structure. Multiple sequence alignments and elegant statistical techniques can be used to identify these correlated mutations. There is a substantial amount of evidence that this correlated evolution can significantly improve the output of structure prediction, leading us one step closer to solving the protein structure prediction problem. Incorporating this evolution-based information into our routine assisted us in addressing the lack of precision of existing energy potentials.

Well, does it work? Surprisingly or not, in some cases it does! We have participated in a blind competition: the Critical Assessment for protein Structure Prediction (CASP). This event is rather unique and it brings together the whole structure prediction community. It also enables the community to gauge at how good we are at predicting protein structures. Working with completely blind predictions, we were able to produce one correct answer, which is a good thing (I guess).

All of this comes together nicely in our biologically inspired pipeline to predict protein structures. I like to think of our computational pipeline as a microscope. We can use it to prod and look at biology. We can tinker with hypotheses, implement potentials and test them, see what is useful for us and what isn’t. It may not be exactly what get the papers published, but the investigative character of our structure prediction pipeline is definitely the favourite aspect of my work. It is the aspect that makes me feel like a scientist.

Protein Structure Prediction, my own metaphorical microscope…

 

Journal Club: Native contacts in protein folding

Like your good old headphone cables, strings of amino acids have the potential to fold into a vast number of different conformations given the appropriate conditions. A conservative estimation for the time it would take a 100 residue protein to explore all theoretically possible conformations would exceed the age of the Universe several times. This is obviously not feasible and was pointed out by Levinthal when he published his “How To Fold Graciously” in 1969.

The so called Protein-Folding Problem has since been under intense study, which inevitably has led to a few theories and models about its nature. Due to the lack of appropriate wet-lab methods to study this phenomenon theoretical, computational approaches have been key to devising impactful frameworks for formally describing protein folding. One of these goes under the name of principle of minimum frustration introduced by Bryngelson and Wolynes in the late 80s (1). It states that proteins by evolution were enriched for sequences with the propensity to fold into low-energy structures, while actively selecting against traps. By avoiding mis-folding and non-native contacts, the theory says, a smooth funnel-like energy landscape with native-state minima is created that ensures robust and fast folding.

This implies that native contacts, i.e. residues that interact in the fully folded protein play a major role in the folding process. Gō models (2), named after Nobuhiro Gō who first proposed this method, are based around this assumption with the energetic contributions of native interactions acting as the sole driving forces in the folding process. While this approach has yielded promising results, many of which were in concordance with experiments, its underlying principles have never been validated in a statistically meaningful way.

native contact schematic

A schematic for native-contact-driven protein folding

In 2013 a study by Best, Hummer and Eaton (3) formally addressed this question. By devising a set of statistical quantities aimed at weighting the importance of native and non-native interactions for folding and applying these to the analysis of several long MD folding simulations they were able to show a “native-centric mechanism” for small fast-folding proteins.

In a first step it was assessed whether the fraction of native contacts  provided a suitable reaction coordinate for the simulated folding events. From their equilibrium simulations two thresholds of native-contact-fractions  were chosen that defined folded and unfolded states (a two-state model is assumed). Overlaying the values for the most visited native-contact-fractions during simulation against these thresholds revealed a strong correlation between the two equilibrium probability density maxima and the protein’s fold state. In addition they showed that the range of native-contact-fractions between those found to represent unfolded and folded thresholds were indicative of being on a transition path (defined as the  “.. regions of the trajectories that cross directly from the unfolded well to the folded well ..”).

A further measure was introduced with the contact lifetime test. The log-ratio of the time a contact spent on a transition path vs the time it existed in the unfolded state was calculated and compared in a heat-map to the native contact map coloured by the number of contacts between residues.

figure2

Contact life time test for a selected protein.
Adapted from (3).

Among others this result revealed a clear connection between contacts with longer transition path life times and the number of contacts they made in the native structure.

So what about non-native interactions?

Screenshot from 2014-03-27 12:47:04

One of the measures addressing this question was the Bayesian measure for non-native contacts on transition paths. In the examples used in this paper, no obvious link between being on a transition path given a non-native contact was found unless they were close to native contacts. Further criteria such as the complementary quantity, which is the probability of being on a transition path when a contact is not made, concluded in a similar fashion.

Interestingly, it was found that the one protein that was influenced by non-native contacts was the designed α3D. Best et al. reasoned that additional frustration introduced when building a protein with artificially introduced stability has led to a shifting of helix register giving rise to this outlier.

When taken together, these results lay a robust foundation for further studies along the same lines. It is too early to accept or reject the presented findings as universal truth, but strong arguments for the native-centric mechanism being a reasonable model in small fast-folding proteins have been made. It would not be far-fetched to think that larger proteins would adhere to similar principles with non-native contacts modulating the landscape, especially when considering individual downhill folding modules.

References:

(1) Bryngelson, J.D. et al., 1995. Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins, 21(3), pp.167–95.

(2) Taketomi, H., Ueda, Y. & Gō, N., 1975. Studies on protein folding, unfolding and fluctuations by computer simulation. I. The effect of specific amino acid sequence represented by specific inter-unit interactions. International journal of peptide and protein research, 7(6), pp.445–59.

(3) Best, R.B., Hummer, G. & Eaton, W.A., 2013. Native contacts determine protein folding mechanisms in atomistic simulations. Proceedings of the National Academy of Sciences of the United States of America, 110(44), pp.17874–9.

Kinetic Modelling of Co-translational Protein Folding (Journal Club)

Following up on last week’s entry, this post will explore the same topic: polypeptide chains assuming native-like conformations as they are extruded from the ribosome, or for the less intimate with the concept, co-translational protein folding.

Before addressing some important questions concerning co-translational protein folding, I would like to make a parenthesis: I want to dedicate a paragraph or two to talk about time.

Biological processes are dynamic. They are events that occur over a period of time. For instance, one can quantify the effect of mutations propagated and accumulated over millions of years of evolution. One can also quantify the femtoseconds in which subtle conformational changes occur in photoreceptor proteins like rhodopsin, when they respond to light. Time is fundamental to understand and model any sort of biological event.

Albeit it might seem obvious to the reader that time is so crucial to amass biological knowledge, those of us more theoretically inclined (bioinformaticians, computational biologists, biostatisticians,  mathematical biologists and so on and so forth) are usually  presented with models that tend to over-simplify reality. Surprisingly enough, there are many over-simplistic models that neglect the effect of time in order to “better” represent whatever they claim to model. Take Protein Docking for instance. The biological process at hand presents a complicated dynamic. There is a kinetic equilibrium, in which a vast amount of protein and ligand molecules interact, associating into complexes and dissociating. Nonetheless, Protein Docking is traditionally reduced to the binding affinity between a pair of molecules. As one might say, this is only a problem if I can present a solution… Luckily, Protein Docking is not my subject of expertise, so I will leave this question open to more tenacious minds than my own.

One of the areas in which I am truly interested in is the co-translational aspect of protein folding. If one performs a quick Google Images search, using the terms “Protein Synthesis” or “Protein Translation”, the results tell a very interesting story.  The vast majority of nascent protein chains are represented as fully elongates peptide chains. In a majority of pictures, the growing peptides do not even present secondary structure. They are mostly represented by long, unfolded, almost linear polymers.

Now, any first year Biochemistry student learns about something called Hydrophobicity (or hydrophilicity depending on whether you are a glass half empty or half full type of person). It is biochemistry-introductory-text-book stuff that some residues are polar and some residues are apolar, and hence will hide from water, forming a hydrophobic core. That (hydrophobicity) is one of the main driving forces of  protein folding.

Hence, most of the images that appear in our Google Images search are not very representative. They are plain wrong. It is simple physics that the growing peptide chains will form secondary and tertiary structures during the process of protein synthesis. One has to remember that this process is dynamic, it is happening over time. Under these circumstances, time should not be neglected. The time scale at which extrusion occurs is slow enough to allow the nascent chain to probe conformations and simply abide to the laws of physics. A fully elongated, completely unfolded and denatured peptide chain would not exist during protein synthesis. These nascent chains would adopt intermediate conformations simply as a result of apolar residues trying to hide from water.

Ok. Now, the BIG question that can be raised is whether those intermediate conformations actually resemble the native state of the fully elongated protein. I do not want to incur in Baby Kicking, but one thing that evolution has taught us is that cells have evolved to be highly efficient systems. There is no room for wasted energy. It makes sense to hypothesize that over millions of years, the cellular machinery has adapted to explore these intermediate conformations in order to make the process of protein folding more efficient.

Over the past couple of years, substantial evidence has been amassed that codon usage and the degeneracy of the genetic code could be exploited by cells to ensure that protein folding occurs accurately and efficiently. There are many theoretical ways that such exploitation could occur: the codon translation speed could facilitate the formation of certain intermediates that are beneficial for protein folding, that increase stability or that prevent protein aggregation. There is even a biomedical impact given that some observed pathologies have been associated with synonymous codon mutations that may lead to misfolded proteins.

In the paper I presented during this journal club [1], O’Brien and colleagues have devised and described a very interesting kinetic model for protein translation. Their model was used to describe possible scenarios in which both fast and slow translation speed codons are coordinators of co-translational protein folding. Please note that, in this context, co-translational protein folding is perceived as an enrichment of intermediate conformations of  the nascent chains, which resemble the native structure of the fully elongated protein.

In the model described in the paper, they opted for a probabilistic approach instead of an analytical (differential equations) approach. The time is modelled by the use of probabilities. The authors derived a formula to quantify the expected proportion of nascent chains of a given length that would be in a Folded intermediate state (one that resembles the native structure). They have managed to express this in terms of a rate of codon translation. Therefore, they stablish a direct relationship between Co-Translational protein folding and codon translation speed.

Their analysis is robust as none of the constants and kinetic rates need to be experimentally derived in order to provide insights about the protein folding process. Overall, I think the way the model was built was quite ingenious and very interesting. I would suggest any interested reader to read the article if they want to understand how the whole modelling was carried out.

Overall, I think the authors present a compelling argument for how cells could explore codon degeneracy and co-translational aspects of protein folding to improve folding efficiency. One of their results present a scenario in which fast translation speed codons can be used to assist in the fold of unstable protein regions, preventing the formation of misfolded intermediates.

One of the many functions of mathematical models is to provide insights into the underlying biology of the phenomena they attempt to model. The lack of any experimental evidence to support this paper’s results does not make it any less interesting. The article presents to the readers a sound and solid mathematical argument as to how co-translational aspects of protein folding could be beneficial for cell efficiency. If anything, they provide interesting hypotheses that might drive experimentalists in the future.

[1] Kinetic modelling indicates that fast-translating codons can coordinate cotranslational protein folding by avoiding misfolded intermediates.