Tag Archives: protein folding

Unraveling the role of entanglement in protein misfolding

Proteins that fail to fold correctly may populate misfolded conformations with disparate structure and function. Misfolding is the focus of intense research interest due to its putative and confirmed role in various diseases, including neurodegenerative diseases such as Parkinson’s and Alzheimer’s Diseases as well as cystic fibrosis (PMID: 16689923).

Many open questions about protein misfolding remain to be answered. For example, how do misfolded proteins evade cellular quality control mechanisms like chaperones to remain soluble but non-functional for long timescales? How long do misfolded states persist on average? How widespread is misfolding? Experiments indicate that misfolding can even be caused by synonymous mutations that alter the speed of protein translation but not the sequence of the protein produced (PMID: 23417067), introducing the additional puzzle of how the protein maintains a “memory” of its translation kinetics after synthesis is complete.

A series of four recent preprints (Preprints 1, 2, 3, and 4, see below) suggests that these questions can be answered by the partitioning of proteins into long-lived self-entangled conformations that are structurally similar to the native state but with perturbed function. Simulation of the synthesis, termination, and post-translational dynamics of a large dataset of E. coli proteins suggests that misfolding and entanglement are widespread, with two thirds of proteins misfolding some of the time (Preprint 1). Many misfolded conformations may bypass proteostasis machinery to remain soluble but non-functional due to their structural similarity to the native state. Critically, entanglement is associated with particularly long-lived misfolded states based on simulated folding kinetics.

Coarse-grain and all-atom simulation results indicate that these misfolded conformations interact with chaperones like GroEL and HtpG to a similar extent as does the native state (Preprint 2). These results suggest an explanation for why some protein always fails to refold while remaining soluble, even in the presence of multiple folding chaperones – it remains trapped in entangled conformations that resemble the native state and thus fail to recruit chaperones.

Finally, simulations indicate that changes to the translation kinetics of oligoribonuclease introduced by synonymous mutations cause a large change in its probability of entanglement at the dimerization interface (Preprint 3). These entanglements localized at the interface alter its ability to dimerize even after synthesis is complete. These simulations provide a structural explanation for how translation kinetics can have a long-timescale influence on protein behavior.

Together, these preprints suggest that misfolding into entangled conformations is a widespread phenomenon that may provide a consistent explanation for many unanswered question in molecular biology. It should be noted that entanglement is not exclusive to other types of misfolding, such as domain swapping, that may contribute to misfolding in cells. Experimental validation of the existence of entangled conformations is a critical aspect of testing this hypothesis; for comparisons between simulation and experiment, see Preprint 4.

Preprint 1: https://www.biorxiv.org/content/10.1101/2021.08.18.456613v1

Preprint 2: https://www.biorxiv.org/content/10.1101/2021.08.18.456736v1

Preprint 3: https://www.biorxiv.org/content/10.1101/2021.10.26.465867v1

Preprint 4: https://www.biorxiv.org/content/10.1101/2021.08.18.456802v1

How fast can a protein fold?

A protein’s folding time is the time required for it to reach its unique folded state starting from its unfolded ensemble. Globular, cytosolic proteins can only attain their intended biological function once they have folded. This means that protein folding times, which typically exceed the timescales of enzymatic reactions that proteins carry out by several orders of magnitude, are critical to determining when proteins become functional. Many scientists have worked tirelessly over the years to measure protein folding times, determine their theoretical bounds, and understand how they fit into biology. Here, I focus on one of the more interesting questions to fall out of this field over the years: how fast can a protein fold? Note that this is a very different question than asking “how fast do proteins fold?”

Continue reading

Electrostatic interactions govern extreme nascent protein ejection times from ribosomes and can delay ribosome recycling

Finishing up a lingering project from your PhD almost a year into your postdoc is a great feeling, especially when it has actually been about 3 years in the making.

Though somewhat outside of the usual scope of activities in OPIG, I encourage you to take a look if the below summary grabs your interest. The full paper and supporting materials (including some movies which took entirely too long to make) can be found at https://pubs.acs.org/doi/abs/10.1021/jacs.9b12264.

Continue reading

What is the hydrophobic-polar (HP) model?

Proteins are fascinating. They are ubiquitous in living organisms, carrying out all kinds of functions: from structural support to unbelievably powerful catalysis. And yet, despite their ubiquity, we are still bemused by their functioning, not to mention by how they came to be. As computational scientists, our research at OPIG is mostly about modelling proteins in different forms. We are a very heterogeneous group that leverages approaches of diverse scale: from modelling proteins as nodes in a complex interaction network, to full atomistic models that help us understand how they behave.

Continue reading

Journal Club: Large-scale structure prediction by improved contact predictions and model quality assessment.

With the advent of statistical techniques to infer protein contacts from multiple sequence alignments (which you can read more about here), accurate protein structure prediction in the absence of a template has become possible. Taking advantage of this fact, there have been efforts to brave the sea of protein families for which no structure is known (about 8,500 – over 50% of known protein families) in an attempt to predict their topology. This is particularly exciting given that protein structure prediction has been an open problem in biology for over 50 years and, for the first time, the community is able to perform large-scale predictions and have confidence that at least some of those predictions are correct.

Based on these trends, last group meeting I presented a paper entitled “Large-scale structure prediction by improved contact predictions and model quality assessment”. This paper is the culmination of years of work, making use of a large number of computational tools developed by the Elofsson Lab at Stockholm University. With this blog post, I hope to offer some insights as to the innovative findings reported in their paper.

Let me begin by describing their structure prediction pipeline, PconsFold2. Their method for large-scale structure prediction can be broken down into three components: contact prediction, model generation and model quality assessment. As the very name of their article suggests, most of the innovation of the paper stems from improvements in contact prediction and the quality assessment protocols used, whereas for their model generation routine, they opted to sacrifice some quality in favour of speed. I will try and dissect each of these components over the next paragraphs.

Contact prediction relates to the process in which residues that share spatial proximity in a protein’s structure are inferred from multiple sequence alignments by co-evolution. I will not go into the details of how these protocols work, as they have been previously discussed in more detail here and here. The contact predictor used in PconsFold2 is PconsC3, which is another product of the Elofsson Lab. There was some weirdness with the referencing of PconsC3 on the PconsFold2 article, but after a quick google search, I was able to retrieve the article describing PconsC3 and it was worth a read. Other than showcasing PconsC3’s state-of-the-art contact prediction capabilities, the original PconsC3 paper also provides figures for the number of protein families for which accurate contact prediction is possible (over 5,000 of the ~8,500 protein families in Pfam without a member of known structure). I found the PconsC3 article feels like a prequel to the paper I presented. The bottom line here is that PconsC3 is a reliable tool for predicting contacts from multiple sequence alignments and is a sensible choice for the PconsFold2 pipeline.

Another aspect of contact prediction that the authors explore is the idea that the precision of contact prediction is dependent on the quality of the underlying multiple sequence alignment (MSA). They provide a comparison of the Positive Predicted Value (PPV) of PconsC3 using different MSAs on a test set of 626 protein domains from Pfam. To my knowledge, this is the first time I have encountered such a comparison and it serves to highlight the importance the MSA has on the quality of resulting contact predictions. In the PconsFold2 pipeline, the authors use consensus approach; they identify the consensus of four predicted contact maps each using a different alignment. Alignments were generated using Jackhmmer and HHBlits at E-Value cutoffs of 1 and 10^-4.

Now, moving on to the model generation routine. PconsFold2 makes use of CONFOLD to perform model generation. CONFOLD, in turn, uses the simulated annealing routine of the Crystallographic and NMR System (CNS) to produce models based on spatial and geometric constraints. To derive those constraints, predicted secondary structure and the top 2.5 L predicted contacts are given as input. The authors do note that the refinement stage of CONFOLD is omitted, which is a convenience I assume was adopted to save computational time. The article also acknowledges that models generated by CONFOLD are likely to be less accurate than the ones produced by Rosetta, yet a compromise was made in order to make the large-scale comparison feasible in terms of resources.

One particular issue that we often discuss when performing structure prediction is the number of models that should be produced for a particular target. The authors performed a test to assess how many decoys should be produced and, albeit simplistic in their formulation, their results suggest that 50 models per target should be sufficient. Increasing this number further did not lead to improvements in the average quality of the best models produced for their test set of 626 proteins.

After producing 50 models using CONFOLD, the final step in the PconsFold2 protocol is to select the best possible model from this ensemble. Here, they present a novel method, PcombC, for ranking models. PcombC combines the clustering-based method Pcons, the single-model deep learning method ProQ3D, and the proportion of predicted contacts that are present in the model. These three scores are combined linearly, and are given weights that were optimised via a parameter sweep. One of my reservations relating to this paper is that little detail is given regarding the data set that was used to perform this training. It is unclear from their methods section if the parameter sweep was trained on the test set with 626 proteins used throughout the manuscript. Given that no other data set (with known structures) is ever introduced, this scenario seems likely. Therefore, all the classification results obtained by PcombC, and all of the reported TM-score Top results should be interpreted with care since performance on validation set tends to be poorer than on a training set.

Recapitulating the PconsFold2 pipeline:

  • Step 1: generate four multiple sequence alignments using HHBlits and Jackhmmer.
  • Step 2: generate four predicted contact maps using PconsC3.
  • Step 3: Use CONFOLD to produce 50 models using a consensus of the contact maps from step 2.
  • Step 4: Use PCombC to rank the models based on a linear combination of the Pcons and ProQ3D scores and the proportion of predicted contacts that are present in the model.

So, how well does PconsFold2 perform? The conclusion is that it depends on the quality of the contact predictions. For the protein families where abundant sequence information is available, PconsFold2 produces a correct model (TM-Score > 0.5) for 51% of the cases. This is great news. First, because we know which cases have abundant sequence information beforehand. Second, because this comprises a large number of protein families of unknown structure. As the number of effective sequence (a common way to assess the amount of information available on an MSA) decreases, the proportion of families for which a correct model has been generated also decreases, which restricts the applicability of their method to protein families with abundant sequence information. Nonetheless, given that protein sequence databases are growing exponentially, it is possible that over the next years, the number of cases where protein structure prediction achieves success is likely to increase.

One interesting detail that I was curious about was the length distribution of the cases where modelling was successful. Can we detect the cases for which good models were produced simply by looking at a combination of length and number of effective sequences? The authors never address this question, and I think it would provide some nice insights as to which protein features are correlated to modelling success.

We are still left with one final problem to solve: how do we separate the cases for which we have a correct model from the ones where modelling has failed? This is what the authors address with the last two subsections of their Results. In the first of these sections, the authors compare four ways of ranking decoys: PcombC, Pcons, ProQ3D, and the CNS contact score. They report that, for the test set of 626 proteins, PcombC obtains the highest Pearson’s Correlation Coefficient (PCC) between the predicted and observed TM-Score of the highest ranking models. As mentioned before, this measure could be overestimated if PcombC was, indeed, trained on this test set. Reported PCCs are as follows: PcombC = 0.79, Pcons = 0.73, ProQ3D = 0.67, and CNS-contact = -0.56.

In their final analysis, the authors compare the ability of each of the different Quality Assessment (QA) scores to discern between correct and incorrect models. To do this, they only consider the top-ranked model for each target according to different QA scores. They vary the false positive rate and note the number of true positives they are able to recall. At a 10% false positive rate, PcombC is able to recall about 50% of the correct models produced for the test set. This is another piece of good news. Bottomline is: if we have sufficient sequence information available, PconsFold2 can generate a correct model 51% of the time. Furthermore, it can detect 50% of these cases, meaning that for ~25% of the cases it produced something good and it knows the model is good. This opens the door for looking at these protein families with no known structure and trying to accurately predict their topology.

That is exactly what the authors did! On the most interesting section of the paper (in my opinion), the authors predict the topology of 114 protein families (at FPR of 1%) and 558 protein families (at FPR of 10%). Furthermore, the authors compare the overlap of their results with the ones reported by a similar study from the Baker group (previously presented at group meeting here) and find that, at least for some cases, the predictions agree. These large-scale efforts  force us to revisit the way we see template-free structure prediction, which can no longer be dismissed as a viable way of obtaining structural models when sufficient sequences are available. This is a remarkable achievement for the protein structure prediction community, with the potential to change the way we conduct structural biology research.

Is “fragment-based” still the way forward in template-free protein structure prediction?

Out of the many questions surrounding the notion that you can predict a protein’s structure from its sequence, there is one in particular that I decided to tackle during last group meeting.

Protein structure prediction is a hard problem (do I sound repetitive?). One of the many cop outs employed by the structure prediction community is the idea that you can break down known structures into fragments and use these protein pieces to perform predictions. This is known as fragment-assembly or fragment-based template-free protein structure prediction.

As absurd as the idea may seem, there is robust evidence that suggests that this is actually a viable strategy. There is a notion that the fragment space is complete; you can reconstruct the backbone of any known structure based on the torsion angles of fragments from other structures. In less technical jargon, you can effectively use fragments and combine them to re-create any of the protein structures that we know and to a fairly acceptable level of precision.

So, technically, it is possible to predict a protein structure using fragments from other structures. In practice, you are still left with the problem of choosing the right fragments to model your sequence of interest. How easy do you think that is?

We can look at this question in light of observations that were made back in the early 80s. Kabsch and Sander reported that two protein fragments having exactly the same sequence can present completely different structures [1]. This complies with the notion that global properties can affect and even define local structure, which in turn suggests that selecting the right fragments to assemble a structure is not necessarily a straightforward process.

The starting point for protein structure prediction is a sequence. Since we are talking about template-free protein structure prediction, it is safe to assume that there is no good global sequence match to your target with a known structure (otherwise you would use that match/structure as a template). Hence, fragment selection is restricted to local sequence similarity, which, as suggested in the previous paragraph, is not necessarily ideal.

On the other hand, we are becoming increasingly more accurate in inferring one-dimensional properties from a protein’s sequence. These properties can and often are used to enhance our fragment-selection capabilities. Yet, even using the state-of-the-art in secondary structure and torsion angle prediction, fragment selection is still fairly imprecise.

During group meeting I highlighted a possible contrast between practical fragment space and general (or possible) fragment space. My premise is simple.  I define practical fragment space as the fragments that we can accurately select from the possible fragment space to model protein structures. In my opinion, it would be extremely interesting to quantify the difference between the two. This would answer the fundamental question of how useful fragment-assembly actually is. More importantly, it would help the community make an educated decision in regards to whether template-free structure prediction strategies should shift from fragment-based to ones based on distance constraints, an approach that is gaining popularity due to the success of contact predictions.

I am very keen to investigate this further. Maybe for my next blog post, we will have an answer! Stay tuned.

[1] Kabsch, Wolfgang, and Christian Sander. “On the use of sequence homologies to predict protein structure: identical pentapeptides can have completely different conformations.” Proceedings of the National Academy of Sciences  81.4 (1984): 1075­1078.

Slow and steady improvements in the prediction of one-dimensional protein features

What do you do when you have a big, complex problem whose solution is not necessarily trivial? You break the problem into smaller, easier to solve parts,  solve each of these sub-problems and merge the results to find the solution of the original, bigger problem. This is an algorithm design paradigm known as the divide and conquer approach.

In protein informatics, we use divide and conquer strategies to deal with a plethora of large and complicated problems. From protein structure prediction to protein-protein interaction networks, we have a wide range of sub and sub-sub problems whose solutions are supposed to help us with the bigger picture.

In particular, prediction of the so called one-dimensional protein features are fundamental sub-problems with a wide range of applications such as protein structure modelling,  homology detection, functional characterization and others. Here, one-dimensional protein features refer to secondary structure, backbone dihedral and C-alpha angles, and solvent accessible surface area.

In this week’s group meeting, I discussed the latest advancements in prediction of one-dimensional features as described in an article published by Heffernan R. and colleagues in Scientific Reports (2015):

“Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning.”

In this article, the authors describe the implementation of SPIDER2, a deep learning approach to predict secondary structure, solvent accessible surface area, and four backbone angles (the traditional dihedrals phi and psi, and the recently explored theta and tau).

“Deep learning” is the buzzword (buzz-two-words or buzzsentence, maybe?) of the moment. For those of you who have no idea what I am talking about, deep learning is an umbrella term for a series of convoluted machine learning methods. The term deep comes from the multiple hidden layers of neurons used during learning.

Deep learning is a very fashionable term for a reason. These methods have been shown to produce state-of-the-art results for a wide range of applications in several fields, including bioinformatics. As a matter of fact, one of the leading methods for contact prediction (previously introduced in this blog post), uses a deep learning approach to improve the precision of predicted protein contacts.

Machine learning has already been explored to predict one-dimensional protein features, showing promising (and more importantly, useful) results. With the emergence of new, more powerful machine learning techniques such as deep learning, previous software are now becoming obsolete.

Based on this premise, Heffernan R. and colleagues implemented and applied their deep learning approach to improve the prediction of one-dimensional protein features. Their training process was rigorous: they performed a 10-fold cross validation using their training set of ~4500 proteins and, on top of that, they also had two independent test sets (a ~1200 protein test set and a set based on the targets of CASP11).  Proteins in all sets did not share more than 25% (30% sequence identity for the CASP set) to any other protein in any of the sets.

The method described in the paper, SPIDER2, was thoroughly compared with state-of-the art prediction software for each of the one-dimensional protein features that it  is capable of predicting. Results show that SPIDER2 achieves a small, yet significant improvement compared to other methods.

It is just like they say, slow and steady wins the race, right? In this case, I am not so sure. It would be interesting to see how much the small increments in precision obtained by SPIDER2 can improve the bigger picture, whichever your bigger picture is. The thing about divide and conquer is that if you become marginally better at solving one of the parts, that doesn’t necessarily imply that you will improve the solution of the bigger, main problem.

If we think about it, during the “conquer” stage (that is, when you are merging the solution of the smaller parts to get to the bigger picture),  you may make compromises that completely disregard any minor improvements for the sub-problems. For instance, in my bigger picture, de novo protein structure prediction, predicted local properties can be sacrificed to ensure a more globally consistent model. More than that, most methods that perform de novo structure prediction already account for a certain degree of error or uncertainty for, say, secondary structure prediction. This is particularly important for the border regions between secondary structure elements (i.e. where an alpha-helix ends and a loop begins). Therefore, even if you improve the precision of your predictions for those border regions, the best approach for structure prediction may still consider those slightly more precise border predictions as unreliable.

The other moral of this story is far more pessimistic. If you think about it, there were significant advancements in machine learning, which led to the creation of ever-more-so complicated neural network architectures. However, when we look back to how much improvement we observed when these highly elaborate techniques were applied to an old problem (prediction of one-dimensional protein features), it seems that the pay-off wasn’t as significant (at least as I would expect). Maybe, I am a glass half-empty kind of guy, but given the buzz surrounding deep learning, I think minor improvements is a bit of a let down. Not to take any credit away from the authors. Their work was rigorous and scientifically very sound. It is just that maybe we are reaching our limits when it comes to applying machine learning to predict secondary structure. Maybe when the next generation of buzzword-worthy machine learning techniques appear, we will observe an even smaller improvement to secondary structure prediction. Which leaves a very bitter unanswered question in all our minds: if machine learning is not the answer, what is?

Hypotheses and Perspectives onto de novo protein structure prediction

Before I start with my musings about my work and the topic of my D. Phil thesis, I would like to direct you to a couple of previous entries here on BLOPIG. If you are completely new to the field of protein structure prediction or if you just need to refresh your brain a bit, here are two interesting pieces that may give you a bit of context:

A very long introductory post about protein structure prediction

and

de novo Protein Structure Prediction software: an elegant “monkey with a typewriter”

Brilliant! Now, we are ready to start.

In this OPIG group meeting, I presented some results that were obtained during my long quest to predict protein structures.

Of course, no good science can happen without the postulation of question-driving hypotheses. This is where I will start my scientific rant: the underlying hypotheses that inspired me to inquire, investigate, explore, analyse, and repeat. A process all so familiar to many.

As previously discussed (you did read the previous posts as suggested, didn’t you?), de novo protein structure prediction is a very hard problem. Computational approaches often struggle to search the humongous conformational space efficiently. Who can blame them? The number of possible protein conformations is so astronomically large that it would take MUCH longer than the age of the universe to look at every single possible protein conformation.

If we go back to biology, protein molecules are constantly undergoing folding. More so, they manage to do so efficiently and accurately. How is that possible? And can we use that information to improve our computational methods?

The initial hypothesis we formulated in the course of my degree was the following:

“We [the scientific community] can benefit from better understanding the context under which protein molecules are folding in vivo. We can use biology as a source of inspiration to improve existing methods that perform structure prediction.”

Hence came the idea to look at biology and search for inspiration. [Side note: It is my personal belief that there should be a back and forth process, a communication, between computational methods and biology. Biology can inspire computational methods, which in turn can shed light on biological hypotheses that are hard to validate experimentally]

To direct the search for biological inspiration, it was paramount to understand the limitations of current prediction methods. I have narrowed down the limitations of de novo protein structure prediction approaches to three major issues:

1- The heuristics that rely on sampling the conformational space using fragments extracted from know structures will fail when those fragments do not encompass or correctly describe the right answer.

2- Even when the conformational space is reduced, say, to fragment space, the combinatorial problem persists. The energy landscape is rugged and unrepresentative of the actual in vivo landscape. Heuristics are not sampling the conformational space efficiently.

3- Following from the previous point, the reason why the energy landscape is unrepresentative of the in vivo landscape is due to the inaccuracy of the knowledge-based potentials used in de novo structure prediction.

Obviously, there are other relevant issues with de novo structure prediction. Nonetheless, I only have a limited amount of time for my D.Phil and those are the limitations I decided to focus on.

To counter each of these offsets, we have looked for inspiration in biology.

Our understanding from looking at different protein structures is that several conformational constraints are imposed by alpha-helices and beta-strands. That is a consequence of hydrogen bond formation within secondary structure elements. Unsurprisingly, when looking for fragments that represent the correct structure of a protein, it is much easier to identify good fragments for alpha-helical or beta-strand regions. Loop regions, on the other hand, are much harder to be described correctly by fragments extracted from known structures. We have incorporated this important information into a fragment library generation software in an attempt to address limitation number 1.

We have investigated the applicability of a biological hypothesis, cotranslational protein folding, into a structure prediction context. Cotranslational protein folding is the notion that some proteins begin their folding process as they are being synthesised. We further hypothesise that cotranslational protein folding restricts the conformational space, promoting the formation of energetically-favourable intermediates, thus steering the folding path towards the right conformation. This hypothesis has been tested in order to improve the efficiency of the heuristics used to search the conformational space.

Finally, following the current trend in protein structure prediction, we used evolutionary information to improve our knowledge-based potentials. Many methods now consider correlated mutations to improve their predictions, namely the idea that residues that mutate in a correlated fashion present spatial proximity in a protein structure. Multiple sequence alignments and elegant statistical techniques can be used to identify these correlated mutations. There is a substantial amount of evidence that this correlated evolution can significantly improve the output of structure prediction, leading us one step closer to solving the protein structure prediction problem. Incorporating this evolution-based information into our routine assisted us in addressing the lack of precision of existing energy potentials.

Well, does it work? Surprisingly or not, in some cases it does! We have participated in a blind competition: the Critical Assessment for protein Structure Prediction (CASP). This event is rather unique and it brings together the whole structure prediction community. It also enables the community to gauge at how good we are at predicting protein structures. Working with completely blind predictions, we were able to produce one correct answer, which is a good thing (I guess).

All of this comes together nicely in our biologically inspired pipeline to predict protein structures. I like to think of our computational pipeline as a microscope. We can use it to prod and look at biology. We can tinker with hypotheses, implement potentials and test them, see what is useful for us and what isn’t. It may not be exactly what get the papers published, but the investigative character of our structure prediction pipeline is definitely the favourite aspect of my work. It is the aspect that makes me feel like a scientist.

Protein Structure Prediction, my own metaphorical microscope…

 

Augmented Modelling with Natural Move Monte Carlo Simulations

In the last group meeting I reported on the progress that I have made regarding the development of a protocol for the systematic use of Natural Move Monte Carlo simulations.

Natural Move Monte Carlo simulations
Natural Moves are degrees of freedom that describe the collective motion of groups of residues. In DNA this might be the concerted motion of a double helix; in proteins this could be the movement of a stable secondary structure element such as a beta-sheet. These segments are joined by so called melting areas. At each simulation step the segments are propagated independently in an MC fashion. The resulting chain breaks are resolved by a chain closure algorithm that acts on the melting areas. This results in a reduction of degrees of freedom of several orders of magnitude. Therefore, large complexes and conformational changes can be sampled more effectively.

In order to get sensible results, however, the initial decomposition of the system is important. The challenge is to accurately represent the plasticity of the system, while keeping the number of degrees of freedom as small as possible. Detailed insight into the flexibility of the system might be gained from experimental sources such as NMR or computational methods such as MD simulations and Normal Mode Analysis. This can help with defining segments and melting areas. However, there are many systems for which this data is not available. Even if it is, there is no guarantee that the segmentation is correct.

Therefore, I am developing a protocol that allows for the evaluation of a range of different test cases that each reflect a unique set of segments and melting areas.

Augmented Modelling Protocol
This protocol is aimed at the systematic evaluation of NMMC segmentations. It allows researchers to feed experimental information, biological knowledge and educated guesses into molecular simulations and so provides a framework for testing competing hypotheses. The protocol has four steps.

Step 1: Segmentation of the system into low-level segments
The initial segmentation contains all possible areas of flexibility that may play a role in conformational changes in the system of interest. This decision may be influenced by many sources. For now, however, we only consider secondary structure information. Helices and beta strands are treated as potential segments. Unstructured regions such as kinks, loops and random coils are treated as melting areas. For a small fold with four helices we get the segmentation shown in figure 1a.

Step 2: Formulate test cases
Generate multiple test cases that reflect hypotheses about the mechanism of interest. In this step we try to narrow down the degrees of freedom as much as possible in order to retain sampling efficiency. This is done by selectively deactivating some melting areas that were defined in step 1. For a system with three melting areas that can either be on or off, 2^3 = 8 different test cases may be generated (example shown in figure 1b).

Segmentation of a small α-fold.

Figure 1 a) Segmentation of a small α-fold. The blue rectangles represent α-helices. The dashed lines indicate the presence of melting areas I, II and III. Each melting area can be switched on or off (1/0) b) Example of a test case in which the first of three melting area is switched off. c) The six degrees of freedom along which a segment is propagated.

Step 3: Perform simulations
Sample the conformational space of all test cases that were generated in step 2. We generally use Parallel Tempering or Simulated Tempering algorithm to accelerate the sampling process. These methods rely on the modulation of temperature to overcome energy barriers.

Step 4: Evaluate results
Score the results against a given control and rank the test cases accordingly. The scoring might be done by comparing experimental distributions of observables with those generated by simulations (e.g. Kullback-Leibler divergence). A test case that reproduces desired expectation values of observables might then be considered as a candidate hypothesis for a certain structural mechanism.

What’s next?
I am currently working on example uses for this protocol. These include questions regarding aspects of protein folding and the stability of the empty MHC II binding groove.

Journal Club: Native contacts in protein folding

Like your good old headphone cables, strings of amino acids have the potential to fold into a vast number of different conformations given the appropriate conditions. A conservative estimation for the time it would take a 100 residue protein to explore all theoretically possible conformations would exceed the age of the Universe several times. This is obviously not feasible and was pointed out by Levinthal when he published his “How To Fold Graciously” in 1969.

The so called Protein-Folding Problem has since been under intense study, which inevitably has led to a few theories and models about its nature. Due to the lack of appropriate wet-lab methods to study this phenomenon theoretical, computational approaches have been key to devising impactful frameworks for formally describing protein folding. One of these goes under the name of principle of minimum frustration introduced by Bryngelson and Wolynes in the late 80s (1). It states that proteins by evolution were enriched for sequences with the propensity to fold into low-energy structures, while actively selecting against traps. By avoiding mis-folding and non-native contacts, the theory says, a smooth funnel-like energy landscape with native-state minima is created that ensures robust and fast folding.

This implies that native contacts, i.e. residues that interact in the fully folded protein play a major role in the folding process. Gō models (2), named after Nobuhiro Gō who first proposed this method, are based around this assumption with the energetic contributions of native interactions acting as the sole driving forces in the folding process. While this approach has yielded promising results, many of which were in concordance with experiments, its underlying principles have never been validated in a statistically meaningful way.

native contact schematic

A schematic for native-contact-driven protein folding

In 2013 a study by Best, Hummer and Eaton (3) formally addressed this question. By devising a set of statistical quantities aimed at weighting the importance of native and non-native interactions for folding and applying these to the analysis of several long MD folding simulations they were able to show a “native-centric mechanism” for small fast-folding proteins.

In a first step it was assessed whether the fraction of native contacts  provided a suitable reaction coordinate for the simulated folding events. From their equilibrium simulations two thresholds of native-contact-fractions  were chosen that defined folded and unfolded states (a two-state model is assumed). Overlaying the values for the most visited native-contact-fractions during simulation against these thresholds revealed a strong correlation between the two equilibrium probability density maxima and the protein’s fold state. In addition they showed that the range of native-contact-fractions between those found to represent unfolded and folded thresholds were indicative of being on a transition path (defined as the  “.. regions of the trajectories that cross directly from the unfolded well to the folded well ..”).

A further measure was introduced with the contact lifetime test. The log-ratio of the time a contact spent on a transition path vs the time it existed in the unfolded state was calculated and compared in a heat-map to the native contact map coloured by the number of contacts between residues.

figure2

Contact life time test for a selected protein.
Adapted from (3).

Among others this result revealed a clear connection between contacts with longer transition path life times and the number of contacts they made in the native structure.

So what about non-native interactions?

Screenshot from 2014-03-27 12:47:04

One of the measures addressing this question was the Bayesian measure for non-native contacts on transition paths. In the examples used in this paper, no obvious link between being on a transition path given a non-native contact was found unless they were close to native contacts. Further criteria such as the complementary quantity, which is the probability of being on a transition path when a contact is not made, concluded in a similar fashion.

Interestingly, it was found that the one protein that was influenced by non-native contacts was the designed α3D. Best et al. reasoned that additional frustration introduced when building a protein with artificially introduced stability has led to a shifting of helix register giving rise to this outlier.

When taken together, these results lay a robust foundation for further studies along the same lines. It is too early to accept or reject the presented findings as universal truth, but strong arguments for the native-centric mechanism being a reasonable model in small fast-folding proteins have been made. It would not be far-fetched to think that larger proteins would adhere to similar principles with non-native contacts modulating the landscape, especially when considering individual downhill folding modules.

References:

(1) Bryngelson, J.D. et al., 1995. Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins, 21(3), pp.167–95.

(2) Taketomi, H., Ueda, Y. & Gō, N., 1975. Studies on protein folding, unfolding and fluctuations by computer simulation. I. The effect of specific amino acid sequence represented by specific inter-unit interactions. International journal of peptide and protein research, 7(6), pp.445–59.

(3) Best, R.B., Hummer, G. & Eaton, W.A., 2013. Native contacts determine protein folding mechanisms in atomistic simulations. Proceedings of the National Academy of Sciences of the United States of America, 110(44), pp.17874–9.