Journal Club: Native contacts in protein folding

Like your good old headphone cables, strings of amino acids have the potential to fold into a vast number of different conformations given the appropriate conditions. A conservative estimation for the time it would take a 100 residue protein to explore all theoretically possible conformations would exceed the age of the Universe several times. This is obviously not feasible and was pointed out by Levinthal when he published his “How To Fold Graciously” in 1969.

The so called Protein-Folding Problem has since been under intense study, which inevitably has led to a few theories and models about its nature. Due to the lack of appropriate wet-lab methods to study this phenomenon theoretical, computational approaches have been key to devising impactful frameworks for formally describing protein folding. One of these goes under the name of principle of minimum frustration introduced by Bryngelson and Wolynes in the late 80s (1). It states that proteins by evolution were enriched for sequences with the propensity to fold into low-energy structures, while actively selecting against traps. By avoiding mis-folding and non-native contacts, the theory says, a smooth funnel-like energy landscape with native-state minima is created that ensures robust and fast folding.

This implies that native contacts, i.e. residues that interact in the fully folded protein play a major role in the folding process. Gō models (2), named after Nobuhiro Gō who first proposed this method, are based around this assumption with the energetic contributions of native interactions acting as the sole driving forces in the folding process. While this approach has yielded promising results, many of which were in concordance with experiments, its underlying principles have never been validated in a statistically meaningful way.

A schematic for native-contact-driven protein folding

In 2013 a study by Best, Hummer and Eaton (3) formally addressed this question. By devising a set of statistical quantities aimed at weighting the importance of native and non-native interactions for folding and applying these to the analysis of several long MD folding simulations they were able to show a “native-centric mechanism” for small fast-folding proteins.

In a first step it was assessed whether the fraction of native contacts provided a suitable reaction coordinate for the simulated folding events. From their equilibrium simulations two thresholds of native-contact-fractions were chosen that defined folded and unfolded states (a two-state model is assumed). Overlaying the values for the most visited native-contact-fractions during simulation against these thresholds revealed a strong correlation between the two equilibrium probability density maxima and the protein’s fold state. In addition they showed that the range of native-contact-fractions between those found to represent unfolded and folded thresholds were indicative of being on a transition path (defined as the “.. regions of the trajectories that cross directly from the unfolded well to the folded well ..”).

A further measure was introduced with the contact lifetime test. The log-ratio of the time a contact spent on a transition path vs the time it existed in the unfolded state was calculated and compared in a heat-map to the native contact map coloured by the number of contacts between residues.

Contact life time test for a selected protein.
Adapted from (3).

Among others this result revealed a clear connection between contacts with longer transition path life times and the number of contacts they made in the native structure.

So what about non-native interactions?

Screenshot from 2014-03-27 12:47:04

One of the measures addressing this question was the Bayesian measure for non-native contacts on transition paths. In the examples used in this paper, no obvious link between being on a transition path given a non-native contact was found unless they were close to native contacts. Further criteria such as the complementary quantity, which is the probability of being on a transition path when a contact is not made, concluded in a similar fashion.

Interestingly, it was found that the one protein that was influenced by non-native contacts was the designed α3D. Best et al. reasoned that additional frustration introduced when building a protein with artificially introduced stability has led to a shifting of helix register giving rise to this outlier.

When taken together, these results lay a robust foundation for further studies along the same lines. It is too early to accept or reject the presented findings as universal truth, but strong arguments for the native-centric mechanism being a reasonable model in small fast-folding proteins have been made. It would not be far-fetched to think that larger proteins would adhere to similar principles with non-native contacts modulating the landscape, especially when considering individual downhill folding modules.

References:

(1) Bryngelson, J.D. et al., 1995. Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins, 21(3), pp.167–95.

(2) Taketomi, H., Ueda, Y. & Gō, N., 1975. Studies on protein folding, unfolding and fluctuations by computer simulation. I. The effect of specific amino acid sequence represented by specific inter-unit interactions. International journal of peptide and protein research, 7(6), pp.445–59.

(3) Best, R.B., Hummer, G. & Eaton, W.A., 2013. Native contacts determine protein folding mechanisms in atomistic simulations. Proceedings of the National Academy of Sciences of the United States of America, 110(44), pp.17874–9.

Journal Club: Random Coordinate Descent

The paper I chose to present at last week’s group meeting was “Random Coordinate Descent with Spinor-Matrices and Geometric Filters for Efficient Loop Closure”, by Pieter Chys and Pablo Chacón.

Loop closure is an important step in the ab initio modelling of protein loops. After a loop is initially built, normally by randomly choosing φ/ψ (phi/psi) dihedral angles from a distribution (Step 1 in the figure below), it is probably not ‘closed’ – i.e. the end of the loop does not meet the rest of the protein structure on the other side of the gap. Waiting for the algorithm to produce closed initial conformations would be horribly inefficient, so it’s much better to have some method of closing the initial loop structures computationally.

The main steps in the ab initio prediction of protein loops.

Loop closure methods can be classified into three different types:

Analytical methods: the exact solution to the loop closure problem is calculated. The difficulty with this approach is that it becomes increasingly complicated the more degrees of freedom (i.e. dihedral angles) you have.
Build-up methods: the loop is built residue-by-residue to construct an approximately closed loop which can then be refined. Basically, the loop is guided to the closed position as it is being built.
Iterative methods: do just what they say on the tin – the loop is closed gradually through a series of iterations.

Of course, science is never simple, and loop closure algorithms often cannot be classified into just one of the above categories. Cyclic coordinate descent (CCD), the method on which the random coordinate descent algorithm introduced in this paper is based, is a mix of analytical and iterative methods. Starting from one anchor residue (the residues either side of the loop), the loop is initialised. To the end of the ‘open’ loop structure is added the anchor residue from the other side. This residue is therefore present twice: the ‘fixed’ anchor residue (the true structure) and the ‘mobile’ anchor residue (the one added to the loop structure). Then, starting from the end of the loop that is attached to the rest of the protein, the dihedral angles are changed sequentially to try and minimise the distance between the fixed and mobile anchor residues. The angle change that would minimise this distance is calculated analytically. Once the distance is within a particular cut-off value, the loop is considered to be closed and this is then the final structure.

Random coordinate descent (RCD) is based upon CCD, but with a number of alterations and additions:

Instead of iterating through each dihedral angle sequentially along the loop backbone, angles are chosen randomly
A spinor-matrix approach is used – this reduces loop closure times
Various geometric filters are added at various points in the algorithm – either before, during or after loop closure.
‘Switching‘ – if loop building fails, then the direction of loop building is changed to the opposite – for example, if the structure is being grown from the N-anchor, but doesn’t pass through the filters, then the loop is discarded and the next loop will be grown from the C-anchor. This should mean that the favoured loop closure direction naturally dominates.

The different geometric filters are as follows:

A grid clash filter, which checks for clashes between the loop residues and the rest of the protein structure
A loop clash filter, which checks for internal clashes between loop residues
An adaptive Ramachandran filter, which restrains the dihedral angles to the allowed regions of the Ramachandran plot.

The Ramachandran filter is a good idea, since loop closure can change the dihedral angles of a structure significantly, moving them into disallowed regions. φ (phi) angles are restricted to the range between -175˚ and -40˚, and ψ angles are restricted between -60˚ and 175˚ – this is basically the top left part of the Ramachandran plot. There are two exceptions: the φ angle of proline is fixed, and the dihedral angles of glycine residues are not restricted at all. When placed inside the loop closure routine, the filter is ‘adaptive’ – if the calculated optimum angle is outside of the allowed region, the filter calculates the maximum possible rotation that would still be allowed. When these angle changes become too small, however, the restriction is removed entirely and the angle is allowed to change freely.

By testing different combinations of filters in different places, the authors decided upon a final RCD algorithm. This version includes the grid clash filter during loop closure, and the Ramachandran filter applied both before and during loop closure. They then compare their method to some other loop closure algorithms – their method produces good results, outperforming all except a method called ‘direct tweak’ – the only other method tested that includes clash detection during loop closure. From this, the authors conclude that this is a key factor in generating accurate loop conformations. They also report that RCD is 6 to 17 times faster than direct tweak.

Overall, then, the authors of this paper have introduced an accurate and fast loop closure algorithm which outperforms most other methods. Currently, my research is focussed upon developing a new antibody-specific ab initio loop modelling method, and some of the concepts used in this paper would definitely be worth investigating further. Watch this space!

Kinetic Modelling of Co-translational Protein Folding (Journal Club)

Following up on last week’s entry, this post will explore the same topic: polypeptide chains assuming native-like conformations as they are extruded from the ribosome, or for the less intimate with the concept, co-translational protein folding.

Before addressing some important questions concerning co-translational protein folding, I would like to make a parenthesis: I want to dedicate a paragraph or two to talk about time.

Biological processes are dynamic. They are events that occur over a period of time. For instance, one can quantify the effect of mutations propagated and accumulated over millions of years of evolution. One can also quantify the femtoseconds in which subtle conformational changes occur in photoreceptor proteins like rhodopsin, when they respond to light. Time is fundamental to understand and model any sort of biological event.

Albeit it might seem obvious to the reader that time is so crucial to amass biological knowledge, those of us more theoretically inclined (bioinformaticians, computational biologists, biostatisticians, mathematical biologists and so on and so forth) are usually presented with models that tend to over-simplify reality. Surprisingly enough, there are many over-simplistic models that neglect the effect of time in order to “better” represent whatever they claim to model. Take Protein Docking for instance. The biological process at hand presents a complicated dynamic. There is a kinetic equilibrium, in which a vast amount of protein and ligand molecules interact, associating into complexes and dissociating. Nonetheless, Protein Docking is traditionally reduced to the binding affinity between a pair of molecules. As one might say, this is only a problem if I can present a solution… Luckily, Protein Docking is not my subject of expertise, so I will leave this question open to more tenacious minds than my own.

One of the areas in which I am truly interested in is the co-translational aspect of protein folding. If one performs a quick Google Images search, using the terms “Protein Synthesis” or “Protein Translation”, the results tell a very interesting story. The vast majority of nascent protein chains are represented as fully elongates peptide chains. In a majority of pictures, the growing peptides do not even present secondary structure. They are mostly represented by long, unfolded, almost linear polymers.

Now, any first year Biochemistry student learns about something called Hydrophobicity (or hydrophilicity depending on whether you are a glass half empty or half full type of person). It is biochemistry-introductory-text-book stuff that some residues are polar and some residues are apolar, and hence will hide from water, forming a hydrophobic core. That (hydrophobicity) is one of the main driving forces of protein folding.

Hence, most of the images that appear in our Google Images search are not very representative. They are plain wrong. It is simple physics that the growing peptide chains will form secondary and tertiary structures during the process of protein synthesis. One has to remember that this process is dynamic, it is happening over time. Under these circumstances, time should not be neglected. The time scale at which extrusion occurs is slow enough to allow the nascent chain to probe conformations and simply abide to the laws of physics. A fully elongated, completely unfolded and denatured peptide chain would not exist during protein synthesis. These nascent chains would adopt intermediate conformations simply as a result of apolar residues trying to hide from water.

Ok. Now, the BIG question that can be raised is whether those intermediate conformations actually resemble the native state of the fully elongated protein. I do not want to incur in Baby Kicking, but one thing that evolution has taught us is that cells have evolved to be highly efficient systems. There is no room for wasted energy. It makes sense to hypothesize that over millions of years, the cellular machinery has adapted to explore these intermediate conformations in order to make the process of protein folding more efficient.

Over the past couple of years, substantial evidence has been amassed that codon usage and the degeneracy of the genetic code could be exploited by cells to ensure that protein folding occurs accurately and efficiently. There are many theoretical ways that such exploitation could occur: the codon translation speed could facilitate the formation of certain intermediates that are beneficial for protein folding, that increase stability or that prevent protein aggregation. There is even a biomedical impact given that some observed pathologies have been associated with synonymous codon mutations that may lead to misfolded proteins.

In the paper I presented during this journal club [1], O’Brien and colleagues have devised and described a very interesting kinetic model for protein translation. Their model was used to describe possible scenarios in which both fast and slow translation speed codons are coordinators of co-translational protein folding. Please note that, in this context, co-translational protein folding is perceived as an enrichment of intermediate conformations of the nascent chains, which resemble the native structure of the fully elongated protein.

In the model described in the paper, they opted for a probabilistic approach instead of an analytical (differential equations) approach. The time is modelled by the use of probabilities. The authors derived a formula to quantify the expected proportion of nascent chains of a given length that would be in a Folded intermediate state (one that resembles the native structure). They have managed to express this in terms of a rate of codon translation. Therefore, they stablish a direct relationship between Co-Translational protein folding and codon translation speed.

Their analysis is robust as none of the constants and kinetic rates need to be experimentally derived in order to provide insights about the protein folding process. Overall, I think the way the model was built was quite ingenious and very interesting. I would suggest any interested reader to read the article if they want to understand how the whole modelling was carried out.

Overall, I think the authors present a compelling argument for how cells could explore codon degeneracy and co-translational aspects of protein folding to improve folding efficiency. One of their results present a scenario in which fast translation speed codons can be used to assist in the fold of unstable protein regions, preventing the formation of misfolded intermediates.

One of the many functions of mathematical models is to provide insights into the underlying biology of the phenomena they attempt to model. The lack of any experimental evidence to support this paper’s results does not make it any less interesting. The article presents to the readers a sound and solid mathematical argument as to how co-translational aspects of protein folding could be beneficial for cell efficiency. If anything, they provide interesting hypotheses that might drive experimentalists in the future.

[1] Kinetic modelling indicates that fast-translating codons can coordinate cotranslational protein folding by avoiding misfolded intermediates.

Expanding Anfinsen’s Principle (Journal Club)

Paper: Expanding Anfinsen’s Principle: Contributions of Synonymous Codon Selection to Rational Protein Design.

In 1961, Anfinsen performed his now (in)famous denaturing experiment upon ribonuclease A, a small one hundred residue globular protein. He showed that it could both unfold and refold via the addition and subsequent removal of chemical substances. From this he concluded that a protein’s fold is that of its global free energy minimum and, consequently, all the information required to know the folded structure of a given protein is solely encoded within its sequence of amino acids. In 1972, Anfinsen was awarded the Nobel prize for this work from which stemmed the vast field of protein folding prediction, a global arms race to see who could best predict/find the elusive global minimum for any given protein.

Unfortunately, protein fold prediction is still in its infancy with regards to its predictive power. As a scientific community, we have made huge progress using homology models, whereby we use the structure of a protein with similar sequence to the one under investigation to provide a reasonable starting point for refinement. However, when there is no similar structure in existence, we struggle abysmally due to being forced to resort to de novo models. This lack of ability when we are given solely a sequence to work with, shows that that our fundamental understanding of the protein folding process must be incomplete.

An increasingly common viewpoint, one that is at odds with Anfinsen’s conclusions, is that there is additional information required for a protein to fold. One suggested source of information is in the production of the protein itself at the ribosome. Coined as cotranslational folding, it has been shown that a protein undergoing synthesis will fold as it emerges from the ribosome, not waiting until the entire sequence is synthesised. As such, the energy landscape that the protein must explore to fold is under constant change and development as more and more of the protein emerges from the ribosome. It is an iterative process of smaller searches as the energy landscape is modulated in steps to that of the complete amino acid sequence.

Another suggested source of information is within the degeneracy observed within the genetic code. Each amino acid is encoded for by up to 6 different codons, and as such, one can never determine exactly the coding DNA that created a given protein. While this degeneracy has been suggested as merely an effect to reduce the deleterious nature of point mutations, it has also been found that each of these codons are translated at a different rate. It is evident that information is consumed when RNA is converted into protein at the ribosome, sine reverse translation is impossible, and it is hypothesised that these variations in speed can alter the final protein structure.

Figure 1. Experimental design for kinetically controlled folding. (a) Schematic of YKB, which consists of three half-domains connected by flexible (AGQ)5 linkers (black lines). The Y (yellow) and B (blue) half-domains compete to form a mutually exclusive kinetically trapped folded domain with the central K (black) half-domain. The red wedge indicates the location of synonymous codon substitutions (see text). (b) Energy landscapes for proteins that fold under kinetic control have multiple deep minima, representing alternative folded structures, separated by large barriers. The conformations of the unfolded protein and early folding intermediates (colored arrows) determine the final folded state of the protein. Forces that constrict the unfolded ensemble (grey cone) can bias folding toward one structure. (c) During translation of the nascent chain by the ribosome (orange), folding cannot be initiated from the untranslated C-terminus, which restricts the ensemble of unfolded states and leads to the preferential formation of one folded structure. Image sourced from J. Am. Chem. Soc., 2014, 136(3),

The journal club paper by Sander et al. looked experimentally at whether both cotranslational folding and codon choice can have effect on the resultant protein structure. This was achieved through the construction of a toy model protein, consisting of three half domains as shown in Figure 1. Each of these half domains were sourced from bifluorescent proteins, a group of protein half domains that when combined fluoresce. The second half domain (K) could combine with either the first (Y) or the third (B) half domains to create a fluorophore, crucially this occurring in a non-reversible fashion such that once a full domain was formed it could not form the other. By choosing flurophores that differed in wavelength, it was simple to measure the ratio in which the species, YK-B or Y-KB, were formed.

They found that the ratio of these two species differed between in-vitro and in-vivo formation. When denatured Y-K-B species were allowed to refold, a racemic mixtrue was produced, both species found the be equally likely to form. In contrast, when synthesised at the ribosome, the protein showed an extreme bias to form the YK-B species as shown in Figure 2. They concluded that this is caused by cotranslational folding, the half domains Y and K having time to form the YK species before B was finished being produced. As pointed out by some members within the OPIG group, it would have been appreciated to see if similar results were produced if the species were reversed, such that B was synthesised first and Y last, but this point does not invalidate what was reported.

Figure 2. Translation alters YKB folded structure. (a) Fluorescence emission spectra of intact E. coli expressing the control fluorescent protein constructs YK (yellow) or KB (cyan). (b) Fluorescence emission spectra of intact E. coli expressing YKB constructs with common or rare codon usage (green versus red solid lines) versus the same YKB constructs folded in vitro upon dilution from a chemical denaturant (dashed lines). Numbers in parentheses correspond to synonymous codon usage; larger positive numbers correspond to more common codons. (c) E. coli MG1655 relative codon usage(3) for codons encoding three representative YKB synonymous mutants: (+65) (light green), (−54) (red), and (−100) (pink line). Image sourced from J. Am. Chem. Soc., 2014, 136(3).

Following the above, they also probed the role of codon choice using this toy model system. They varied the codons choice over a small segment of residues between the K and B half domains, such that the had multitude of species which would be encoded either “faster” or “slower” across this region. Codon usage was used as the measure of speed, though this has yet to established within the literature as to its appropriateness. They found that the slower species increased the bias towards the YK-B over Y-KB, while faster species reduced it. This experiment shows clearly that codon choice has a role on a protein’s final structure, though they only show a large global effect. My work is primarily on whether codon choice has a role at the secondary structure level, so I will be avidly hoping that more experiments will follow that show the role of codons at finer levels.

In conclusion, Sander et al. performed one of the cleanest experimental proofs of cotranslational folding to date. Older evidence is more anecdotal in nature, with reports of protein X or Y changing in response to a single synonymous mutation. In contrast, the experiment reported here is systematic in the approach and leaves little room for doubt over the results. Secondly and more ground breaking, is the (again) systematic nature in which codon choice is investigated and shown to effect the global protein structure. This is one of those rare pieces of science which the conclusions are clear and forthcoming to all readers.

Activity cliffs

Mini-perspective on activity cliffs as a medicinal chemistry tool

Recently in group meeting we discussed activity cliffs and their application to medicinal chemistry. The talk was based largely on the excellent mini-perspective on the subject written by Stumpfe et al.

What is an activity cliff?

Activity cliffs are two compounds that represent a small structural change but a large activity change. They are used commonly in the design of new compounds targeting a particular protein (drugs). They work on the principal that if a given structural change has previously had a large affect on activity it is likely to have a similar affect on a different compound series. In this way they can be used as predictive tools to suggest chemical transformations that are likely to improve activity for a given compound.

To define an activity cliff, one must consider what a small structural change and a large activity change mean.

Small structural change

Structural changes can be measured using a seemingly endless array of methods. A lot of methods will condense the chemical information of the molecule into a bit-vector. Each bit indicates the molecule contains a particular type of chemical functionality, e.g. a methyl group. Molecular similarity is then assessed by comparing the bit-vectors, most commonly by finding the Tanimoto similarity between the them. This then returns a single value between 0 and 1 indicating how similar the two molecules are (the greater the more similar). To define small structural change, one must decide upon a threshold value above which two molecules are sufficiently similar.
An alternative method is to find matched molecular pairs – compounds which are identical apart from one structural change. An example of one is shown below. For matched molecular pairs the only parameter required is the size of the non-matching part of the pair. This is usually measured in non-hydrogen atoms. The threshold to use for this parameter is chosen equally arbitrarily however it has a much more intuitive effect.

An example of a matched molecular pair

Which method to use?

Similarity methods are less rigid and are capable of finding molecules that are very similar, however that differ in two or more subtle ways. They however are also liable to find molecules similar when they would not be perceived as so. In this work Stumpfe et al. show that different similarity methods do not agree greatly on which molecules are “similar”. They compare six different fingerprint methods used to find similar molecules. Each method finds around 30% similar molecules in the datasets used, however the consensus between the methods is only 15%. This indicates that there is no clear definition of “similar” using bit-string similarity. Interestingly a third of the molecules found to be similar by all six fingerprint methods are not considered matched molecular pairs. This demonstrates a downside of the matched molecular pair approach, that it is liable to miss highly similar molecules that differ in a couple of small ways.

Matched molecular pairs are, however, least liable to find false-positives, i.e. compounds that are seen as similar but in fact are not actually similar. The transformations they represent are easily understood and this can be easily applied to novel compounds. For these reasons matched molecular pairs were chosen by Stumpfe et al. for this work to indicate small structural changes.

Large activity change

A large activity change is an equally arbitrary decision to make. The exact value that indicates an activity cliff will depend on the assay used and the protein being tested against. Stumpfe et al. reasonably suggest that approximate measures should not be used and that activity scores found between different assays should not be compared.

Rationales for activity cliffs

If structural data is available for an activity cliff, rationales for their corresponding activity change can be suggested. These can then be used to suggest other alterations that might have a similar impact. Stumpfe et al. consider the five most common rationales for activity cliffs.

H-bond and or ionic interactions: these interactions will increase the binding energy forming specific interactions with the protein
Lipophilic and aromatic groups: these groups can form specific protein-ligand interactions, e.g. pi-pi stacking and also form favourable interactions with hydrophobic residues in the protein
Water molecules: One molecule in the pair displaces water molecules from the active site, altering the binding energy
Stereochemistry changes: for example altering an enantiomeric form of a compound alters the projection of a group, forming or losing favourable/disfavourable protein-ligand interactions
Multiple effects: a combination of the above, and thus difficult to establish the dominant feature.

Are they generally useful?

Stumpfe et al. consider whether activity cliffs are more useful for some proteins or protein classes than others. They investigate how many compounds form activity cliffs for many protein targets for which activity data is openly available. For proteins with more than 200 compounds with activity data the number of activity cliff forming compounds is roughly equivalent (around 10%). This is an interesting and unexpected result. The proteins used in this study have different binding sites attracting different opportunities for protein-ligand interactions. It would not, therefore naturally be expected that these would attract similar opportunities for generating activity cliffs. This result shows that the activity cliff concept is generically useful, irrespective of the protein being targeted.

Are they predictive?

Although activity cliffs make intuitive sense, Stumpfe et al. consider whether it has been quantitatively successful in previous drug discovery efforts. They investigate all of the times that activity cliff information was available from openly available data. They then find all the times this information was used in a different compound series and if it was used whether it had a positive or negative effect on activity.

Interestingly available activity cliff information had not been used in 75% of cases. They suggest that this indicates this information is an as yet underused resource. Secondly, in the cases where it was used, 60% of the time it was successful in improving activity and 40% of the time is was unsuccessful. They suggest this indicates the activity cliff method is useful for suggesting novel additions to compounds. Indeed it is true that a method that gives a 60% success rate in predicting more potent compounds would be considered useful by most if not all medicinal chemists. It would be interesting to investigate if there were patterns in protein environment or the nature of the structural changes in the cases where the activity cliff method is not successful.

Have they been successful?

Finally Stumpfe et al. investigate whether using activity cliff information gives a higher probability of synthesising a compound in the 10% most active against the target protein. They show that in 54% of cases using activity cliff information a compound in the 10% most active is formed. Conversely when this information is not used only 28% of pathways produce a compound in the 10% most active. They argue this indicates that using activity cliff information improves the chances of producing active compounds.

Conclusion

The paper discussed here offers an excellent overview of the activity cliff concept and its application. They demonstrate, in this work and others, that activity cliffs are generally useful, predictive and currently underused. The method can therefore be used in new tools to improve the efficiency of drug discovery.

Stepwise Assembly For Protein Loop Prediction

INTRODUCTION:

Loop modeling is used frequently in designing the structure of new proteins or refining protein structures with limited X-ray and NMR data. In 2011, Sripakdeevong et al. introduced a hypothesis called “Stepwise Ansatz” for modeling RNA loops with atomic accuracy. They believed that current knowledge-based RNA loop predictors which aimed at predicting loops with atomic accuracy, failed to sample models within 1.5 Å RMSD of the native structures. The bottleneck in these methods is related to inefficient sampling of the conformational space. To overcome the limitation of sampling, Sripakdeevong et al. introduced an ‘ab initio’ (de novo) buildup strategy to allow for high resolution sampling of loops instead of restricting the search space to available fragments. But with current computational power, exhaustive enumeration of N-length (N>1) loops with atomic resolution is impossible. If N=1, considering all the degrees of freedom for nucleotide will result in 1 million conformations. Performing Rosetta energy minimization on these models will need 1 hour CPU time which is computationally reasonable. Every time a new nucleotide is added the conformational size will be multiplied exponentially by the RNA loop length (for a N=5 computational time ~ 10^23 CPU year).

Since enumeration of one nucleotide long loop is possible, the entire loop can be modeled by stepwise enumerative building of one nucleotide at a time on low energy conformations which are well-packed and hydrogen bonded. Therefore, their implementation of stepwise assembly (SWA) protocol in a dynamic programming-like recursion style enables sampling of 12 length loops with achievable CPU time. SWA being successful in prediction of RNA loops, was first used to predict protein loops with atomic accuracy by Rhiju Das . Loop regions in protein structures have particular characteristics compared to the regions of regular secondary structure. Loops have similar number of hydrogen bonds (on average 1.1 per residue), mainly contain polar/charged side chains and have less contact with the non-polar core of the protein. Current Loop modeling methods with atomic resolution start off with a reduced representation of the protein with simplified or no side-chains. Although coarse graining of proteins will assist in reducing large number of local minima but will fail in capturing non-polar and hydrogen bond interactions involving side chains.Therefore, SWA is used to build up a loop at its atomic resolution by sampling the possible conformation space which is energetically favorable and also computationally possible.

SWA PIPELINE:

SWA is implemented in c++ in Rosetta framework. SWA uses a dynamic programming matrix (example is shown below in Figure 1D for a 6 length loop) to allow de novo buildup of loops from residue k to l. To achieve this, at each step SWA adds loop residue to build up forward from the N-terminus (from residue k-1 to i) and backward from the C-terminus (l+1 to j). Therefore, each circle point in figure 1D represents a (i,j) stage. SWA contains 5 main steps:

Pre-packing the starting point : To start, all atoms of the loop region is removed from the model and side-chains are added to the model. This stage (k-1,l+1) is shown as green circle in figure 1D. Side chains are added and their torsion are minimized. Note that the non-loop backbones are kept fix in all stages of SWA.
Adding one loop residue to n-terminal: This stage is shown by orange arrows (moving downward) in Figure 1D. To generating possible conformations after adding the loop residue, backbone torsion angles (Φ,Ψ) of the added residue and the backbone residue before that are sampled (Figure 1A). Φ,Ψ combinations which do not correspond to the Ramachandram are discarded. This procedure, can result in generating tens to thousands of conformations. For all the generated models, side chains are added to the sampled residues (i and i-1) and these side-chain along with the potential neighboring side chains are optimized. Afterward, clustering is performed, in which models are ranked in order of the energy and if a lower energy model has backbone RMSD of residue (i and i-1) <0.10Å compared to a higher energy model then the low energy model is removed (otherwise kept as a seed for a new cluster). After clustering the top 400 models are selected for all atom energy minimization on sampled residue backbone torsion and its neighbouring side-chain. Then, a final clustering is performed on the these models as described above.
Adding one loop residue to c-terminal: This stage is shown by pink arrows (moving left) in Figure 1D. This is similar to step2, in which residue j and j+1 are considered for backbone sampling (Figure 1B), side-chain packing, model clustering and torsional minimization and final clustering.
Closing loop chains :All models where the gap difference between C-terminal and N-terminal are 1,2 or 3 are subjected to chain closure. To generate closed loops, residue i+1 is added to N-terminal and i+2 and j-1 are added to C-terminal. For i+1, Φ and Ψ torsion are sampled by performing grid search as described above while backbone of i+2 and j-1 undergo Cyclic Coordinate Descent (CCD) which changes the Φ and Ψ torsion of i+2 and j-1 till it closes the gap to i+2. Models with chain closure < 0.01Å are then subject to side chain optimization, clustering, and torsional minimization. This procedure differs to above since all loop side chains and all loop backbones are affected by minimization. This stage is shown by blue arrows in Figure 1D just for gap lengths of one. In addition, to this procedure for loop closure, all models were closed by adding the last gap residue by grid search and trying to close the loop by CCD on the preceding residue. Also, models created by only sampling C-terminal or N-terminal are also used along with CCD loop closure to create full length models.
Clustering: For each stage 400 models are generated, where the next stage uses these models to generate new conformations resulting in thousands models. Also several path can be used to get reach a specific stage, adding up to the numbers of generated models. Therefore, since SWA works on only low-energy models, only the 4000 lowest energy models are kept for further clustering. Clustering is similar to procedure above but with RMSD of 0.25Å and is applied on the entire loop segment which is build up to that stage. Then, the 400 lowest energy is used to move on to the next stage. At the loop closure stage also when the whole loops are modeled clustering is also used with RMSD of 1.0Å and the five lowest energy models are considered as SWA prediction.

Figure 1: Stepwise Assembly Schematic Representation

For short loops of (6 to 9 residue long), it was shown that solutions can be found just by creating models from N-terminal onward and separately by C-terminal backward and joining them by loop closure (or simply be moving just along the first column and first row of the dynamic matrix). Figure 1E shows a directed acyclic graph (DAG) of this procedure. The positive point is that in these cases computational time reduces to O(N) instead of O(N^2). Therefore, for such cases this procedure is tested first. If the difference between the lowest energy model and the second lowest is less than 1 kBT (a Rosetta energy unit is approximately 1 kBT) we can argue that modeling has not converged toward one model and the whole O(N^2) calculation should take place (Except for loops of length 24)

RESULTS:

A difficult case study:

Predicting 1oyc loop (residue 203-214) has always been a challenge by loop predictors since most of its side-chains are polar/charged where hydrogen bonds play an important for stabilising the loop. All these factors are not considered in ab initio predictors with coarse-grained representation. Figure 2 of paper, displays the SWA build up procedure for 1oyc loop.The final best model (Figure 2:I) with the lowest energy has a c-alpha RMSD of 0.39 Å to the native structure. Following the build up path of 1oyc shows that the intermediate steps which lead to this final best model have not always been the lowest energy, therefore it is important to keep all the possible intermediate conformations. It is important to consider that different search paths allows sampling of totally diverse configurations. For example in Figure 2 (below), for 1oyc, 5 different configurations with comparable low energy generated by different build up paths are shown. Two totally different paths (blue and brown) may result in similar configurations while reasonably similar paths (pink, green and orange) have resulted in substantially different loop models.

SWA for 1oyc. 5 different configurations with comparable low energy

Figure 2: prediction of SWA for 1oyc loop. Five different configurations with comparable low energy are shown.

SWA on 35 loop test set:

SWA was used on a data set of 35 protein loops, where 20 of them allowed comparison with PLOP and Rosetta KIC and 15 where difficult cases with loop ranging between 8 to 24 residue. Comparing the median of RMSDs of lowest energy models (Table S1 of paper) shows SWA achieves better quality models (0.63 Å) with the same computational time as PLOP and Rosetta KIC. For the other 15 difficult cases SWA performance reduced by median RMSD of 1.4 Å for the lowest energy models.But, the highlight of SWA is prediction of 24 residue long loops,where it achieves sub-angstrom accuracy. Since SWA uses the O(n) strategy to solve the problem, in comparison to Rosetta KIC it requires less computational time.

In total, considering the best of 5 models, for 27 of 35 cases SWA produces sub-angstrom accuracy. But looking at the five best models of these 27 models show that best RMSD does not corresponds to the best lowest energy model. Also, in some cases Rosetta KIC produces better RMSD models while energy wise it is worse than SWA. This shows Rosetta energy function requires improvement specially in its solvent model (where it fails the most).

SWA and blind test:

SWA was used to predict 4 loops of a protein which its native structure was not released. SWA started with a model where the loop regions were removed and achieved sub-angstrom accuracy (Rosetta KIC achieved this for 3 out of the 4 cases).
SWA loop prediction accuracy of 0.53 Å for a RNA/protein target on a comparative model (instead of X-ray model) shows its ability in prediction complex structures.

DISCUSSION:

SWA method has been successful in predicting protein loops with sub-angstrom accuracy. Of significance are prediction of RNA-Protein model target and loop lengths of 24 residue. Although it provides atomic-accuracy predictions, SWA requires 5000 CPU hours (which is achievable with current processing powers) for 12 length loops. While Monte Carlo and refinement-based methods can predict loops in hundreds of CPU hours. SWA computational time can be improved by considering minimization of several factors in the build up pathway and the use of experimental constraints.

SWA method can be used to guide and assist ab-initio prediction of protein structures and in protein folding. Also it may have application in ab inito modeling problems such as fitting high-res protein structures in low-res electron density maps or prediction of NMR structures using sparse chemical shift data. In addition, stepwise ansatz offers solutions to design of protein interfaces which require simultaneous optimizing of side-chain conformation, side-chain identity and back bones.

CAPRI, pyDock and the state of protein-protein docking.

Intro

We have discussed a paper highlighting the latest progress in the field of protein-protein docking. It covers the results of the participation of the pyDock group in CAPRI. The latest round of the experiment is particularly interesting as it challenged the participants with affinity prediction which is going one step beyond modeling protein-protein complexes.

Role of the group in this CAPRI edition

For those that are not very familiar with CAPRI, one can participate in any of the following roles:

Predictors: Participants are given the unbound structures or sequences of one or both interacting partners. Predictors are then supposed to submit a ranked list of modeled complexes which can be achieved by any docking/prediction tools and some human intervention. This tests the current ad hoc capacity to model protein-protein complexes.
Servers: Participants are given the unbound structures or sequences of one or both interacting partners that are supposed to be submitted to an automatic server which needs to return the results (ideally) within 24 hours. This tests the automated capacity to model protein-protein models.
Scorers: The community contributes sets of poses (unranked) for each target. The scorers are then supposed to re-rank the poses. This is a form of cross-validation of methods since one could expect that the group that samples the poses also performs better scoring. This exercise is supposed to quantify this.

The pyDock group participates in the capacity of Predictors and Scorers. Below we first outline the general docking methodology, followed by the particular implementation of the pyDock group.

General Docking Methodology

Protein-protein docking starts with two interacting structures at input: (ligand (L) and receptor (R)) (if the structures are unavailable a model is created). The docking techniques can be broadly divided into two types: template based and template-free. In template-based docking, one needs either a complex of homologs or one that is sufficiently structurally similar. However since it is estimated that the number of complexes in the PDB only covers ~4% of the presumed total, this approach is not applicable in a great number of cases.

Template-free docking does not rely on pre-solved complexes and it is the main focus of this post. Standard modus-operandi of a template-free protein-protein docker is the following:

Sample N poses.
Re-score top D sampled poses.
Refine the top K poses.

In the first step, possible poses of ligand with respect to the receptor. This is very often achieved by FFT (ZDOCK) or more elaborate methods such as geometric hashing/pose clustering (PatchDock). The possible orientations are ranked by a simplified energy function or a statistical potential. In most cases this step is performed in rigid-body mode (intra-molecular distances are not allowed to change) for the computational efficiency. The number of sampled poses is usually in the thousands (N~ 2k-10k).

The re-scoring set uses a more elaborate energy function/statistical potential to identify more close-to native poses. Notable examples include pyDock and ZRANK. Since such functions are usually more expensive computationally (for instance pyDock calculates the LJ potential, Coulombic electrostatics and desolvation terms), it is more tractable to apply them to a smaller set of (D<<N) of top poses as returned by the rigid-body sampler. Additionally, the final poses might be pruned for redundancy by removing structures which are very similar to each other. One method in particular (ClusPro) owes its success to scoring the rankings based on the numbers of similar decoys per cluster out of a number of top predictions.

The final step constitutes the refinement of the select top K poses (K<<D). Here, flexibility might be introduced so as to account for conformational changes upon binding. Tools used to achieve this are very computationally expensive energy fields such as AMBER or CHARMM. The coordinates of side-chains or even backbones are distorted, for instance using normal mode calculations, until the energy function reaches an energetic minimum via a flavor of gradient descent.

Fig. 1: Breakdown of results of the pyDock group at the latest CAPRI round. The results are presented for each target, in the predictor as well as scorer capacity.

The pyDock group Methodology and Results

The pyDock group follows the pattern outlined above. As a sampler they employ ZDOCK and FTDOCK, both fast rigid-body Fast Fourier Transform-based methods. They use their pyDock function to score the best models. Additionally, they remove redundancy from the final lists of decoys by removing these entries which are too similar to the higher-scoring ones according to ligand rmsd. The final refining step is carried out using TINKER.