Category Archives: Group Meetings

What we discuss during cake at our Tuesday afternoon group meetings

Journal Club: Statistical Quality Indicators for Electron Density Maps

This Week I presented Ian Tickle’s 2012 Paper “Statistical quality indicators for electron-density maps”. This paper presented new, statistically robust metrics for describing the agreement between an atomic model and the experimentally derived electron density.

Previous metrics such as the Real-space R (RSR) and Real-Space Correlation Coefficient (RSCC) (Brandon & Jones, 1991, and others) are popular electron density metrics, and can inform on the quality of an atomic model. However, as Tickle claims, they cannot inform on how the model is good, or bad, as they give no indication of the accuracy, or the precision, of the electron density.

Accuracy:

Ian Tickle describes accuracy as – “How close are the results on average to the truth (regardless of precision)?” This is more often referred to as ‘error’. The most accurate model is the one that best agrees with the electron density.

Precision:

Precision is described as – “If you were to repeat the experiment, how much would you expect the results to vary (regardless of accuracy)?” This is more often described as ‘uncertainty’. Precision is a property of the crystal and the experiment. It is independent of the model.

A pictographic representation is shown below –

Pictographic representation of accuracy and precision. Taken from Tickle, 2012.

Before the discussion of the new metrics proposed, there are several assumptions that must be made and several influencing factors to be considered.

Assumptions:

The electron density, and the phases used to generate it, are accurate. This assumption is reasonable because density-based validation is generally done near to the end of refinement when the model is mostly correct.

Metric usefulness depends critically on:

Accurate calculation and representation of the electron density from our atomic model.
Accurate scaling of the observed and model density (neither calculated nor observed density is not on an absolute scale).
Accurate determination of the area to be considered for the calculation of the metric. If too large an area is considered, noise and density from other features will influence the metric. Too small an area will not encompass the whole model and its environment.

Calculating the Model Density:

Accurate calculation of the model’s electron density is essential, as the profile of the atoms will of course affect the comparison of the model to the experimental density. Often (as in Jones 1991, and others) a fixed profile is assumed for all atoms. Of course, in reality the profile will depend on atom type, B-factors, data completeness, and resolution limits.

Due to the resolution limits, the electron density from an atom is the convolution of a 3d gaussian and a sphere of constant scattering power (Blundell & Johnson, 1976). The truncated density function for an atom then becomes:

Scaling the calculated density:

This, fortunately, is already available and calculated by refinement programs (when calculating 2mFo – DFc maps), and the correct scaling factor is the resolution-dependent D.

Visualising the quality of a model:

To demonstrate how the (global) quality of a model can easily be seen, Tickle calculates and normalises difference density maps for a good, and a bad, model. If the model is ‘correct’, then the difference density should be gaussian noise, but if the model is ‘incorrect’, it will be skewed. This can easily be seen in Figure 8 from the paper.

A difference density map is calculated, sorted by value and normalised to give a density distribution. For a good model, this should look like (a), where the density function is a gaussian ~ N(0,1). For a bad model, (b), the distribution is skewed.

The main feature that appears in a ‘bad’ model is the increased weight in the tails of the distribution. Extra weight on the left-hand side indicates model that is not supported by the evidence, and extra weight on the right-hand side indicates under-modelled density.

The New Accuracy Metric

Using the ideas from above (that the difference density between a model and the experimental density should be distributed as a gaussian) Tickle goes on to develop metrics for determining the likelihood that a model density and an experimental density differ only by the addition of random noise.

The metric he describes tests a hypothesis – Does the distribution of the difference density reflect that obtained from the propagation of random errors in the experimental data (and phases)?

To do this, statistical tests are developed. First we define the difference density Z-score (ZD)

This quantity is the difference between the calculated electron density and the experimental density (delta rho), divided by the error in the difference density, giving the normal definition of a normalised Z-score.

The difference density (the numerator) has been discussed above, so we now discuss the error in the difference density. Assuming that the experimental data and the phases are ‘correct’, any non-random errors arise only from the model.

That is, errors arising in the experimental data will appear only as random noise, whereas errors in the model will manifest as the signal that we are trying to detect.

To calculate the strength of the noise (that of the experimental data and the phases), we look at the bulk-solvent regions. Here, the atoms are unordered, and so should give uniform density. Any deviations from uniform should be approximately the random noise from the experimental data and the phases.

Maximum Z-score analysis

Tickle considers using the maximum ZD of a sample as a test for model accuracy, and discusses merits and failings. In brief, if we were to sample from a difference density distribution, and take only the most significant ZD score, “focusing only on the maximum value inevitably overstates the significance of the results”.

A Chi-Squared test for ZD scores

The solution that Tickle proposes is to allow that all sample values may be significant (rather than just the largest values). He creates a joint probability density function of the absolute sample values (assumed half-normal and iid). This probability density function then becomes a chi-squared distribution.

By calculating the CDF of the chi-squared (a lower regularised gamma function), Tickle is able to attain p-values for a set of observations.

These can then be converted back to Z-scores, which crystallographers are more comfortable using. As Tickle states, just because the metric is in terms of Z-scores does not mean that the distribution is normal (here it is clearly a chi-squared).

Problem: Large Samples

The problem with the above is that for large samples, a small number of significant values will be drowned out by noise and the signal may be missed. The failure of the above test in this situation is put down to the choice of null hypothesis. Multiple null hypotheses are needed in order to cover all possibilities.

When distinguishing between multiple hypotheses, we must aim to avoid type II errors wherever possible, whilst attempting to minimise type I errors. We must select the hypothesis “that maximises the probability of obtaining a result less extreme that the one actually observed (…) or equivalently the one that minimises the probability of obtaining a result more extreme than that observed”.

Solution: A new JPDF and CDF

To solve this, Tickle takes a subset of the highest values of the original sample of n, say from i=k to n (the n-k highest scores), and calculates the chi-squared and its associated cumulative probability. We will then choose the value of k such that it gives us the highest probability,

However, the cumulative probability of chi-squared is no longer the regularised gamma function due to the bias introduced by selected the largest values. Recalculating the JPDF and integrating analytically and numerically to obtain the CDF, we could arrive at a result. This, however, has the problem of a large dimensionality, which requires the use of very accurate Monte Carlo integration (accuracy of much better 0.27% is required, since we are interested in the p-values between 0.9973 and 1 – greater than 3 sigma).

Fortunately, an approximation can be made to bring about a practical solution.

Examples of Significant Distributions:

Tickle generates a table which gives several scenarios that will give a significant result, for different extremes of Z-value, and different sample sizes. One particularly key point is

…small samples are statistically less reliable so require a higher proportion of significant data points to achieve the same overall level of significance. Large samples require relatively fewer data points but they must have higher values to overcome the ‘multiple comparisons’ effect, where large values are more likely to occur occur purely as a result of random error.

Summary

B-factors:

Tickle shows early in the paper that the RSR and RSCC are correlated with the B-factor of the model. RSZD shows no correlation with the B-factor, as is desired.

RSZD+ & RSZD:

More useful scores can be generated by scoring the negative and positive values of delta-rho separately. This gives two scores, RSZD+ and RSZD-. RSZD+ gives the significance/prevalence of unexplained density (missing atoms) and RSZD- gives the significance/prevalence of unjustified model/misplaced atoms.

Precision & Reliability:

Although not discussed in as much depth in the paper, Tickle also proposes a metric to account for the precision of the electron density

This is clearly independent of the model, and is the signal-to-noise ratio of the average observed density in a specified region. Weak density (or large noise) will lead to a small RSZO, implying that any model placed here should be considered unreliable.

Journal Club: Native contacts in protein folding

Like your good old headphone cables, strings of amino acids have the potential to fold into a vast number of different conformations given the appropriate conditions. A conservative estimation for the time it would take a 100 residue protein to explore all theoretically possible conformations would exceed the age of the Universe several times. This is obviously not feasible and was pointed out by Levinthal when he published his “How To Fold Graciously” in 1969.

The so called Protein-Folding Problem has since been under intense study, which inevitably has led to a few theories and models about its nature. Due to the lack of appropriate wet-lab methods to study this phenomenon theoretical, computational approaches have been key to devising impactful frameworks for formally describing protein folding. One of these goes under the name of principle of minimum frustration introduced by Bryngelson and Wolynes in the late 80s (1). It states that proteins by evolution were enriched for sequences with the propensity to fold into low-energy structures, while actively selecting against traps. By avoiding mis-folding and non-native contacts, the theory says, a smooth funnel-like energy landscape with native-state minima is created that ensures robust and fast folding.

This implies that native contacts, i.e. residues that interact in the fully folded protein play a major role in the folding process. Gō models (2), named after Nobuhiro Gō who first proposed this method, are based around this assumption with the energetic contributions of native interactions acting as the sole driving forces in the folding process. While this approach has yielded promising results, many of which were in concordance with experiments, its underlying principles have never been validated in a statistically meaningful way.

A schematic for native-contact-driven protein folding

In 2013 a study by Best, Hummer and Eaton (3) formally addressed this question. By devising a set of statistical quantities aimed at weighting the importance of native and non-native interactions for folding and applying these to the analysis of several long MD folding simulations they were able to show a “native-centric mechanism” for small fast-folding proteins.

In a first step it was assessed whether the fraction of native contacts provided a suitable reaction coordinate for the simulated folding events. From their equilibrium simulations two thresholds of native-contact-fractions were chosen that defined folded and unfolded states (a two-state model is assumed). Overlaying the values for the most visited native-contact-fractions during simulation against these thresholds revealed a strong correlation between the two equilibrium probability density maxima and the protein’s fold state. In addition they showed that the range of native-contact-fractions between those found to represent unfolded and folded thresholds were indicative of being on a transition path (defined as the “.. regions of the trajectories that cross directly from the unfolded well to the folded well ..”).

A further measure was introduced with the contact lifetime test. The log-ratio of the time a contact spent on a transition path vs the time it existed in the unfolded state was calculated and compared in a heat-map to the native contact map coloured by the number of contacts between residues.

Contact life time test for a selected protein.
Adapted from (3).

Among others this result revealed a clear connection between contacts with longer transition path life times and the number of contacts they made in the native structure.

So what about non-native interactions?

Screenshot from 2014-03-27 12:47:04

One of the measures addressing this question was the Bayesian measure for non-native contacts on transition paths. In the examples used in this paper, no obvious link between being on a transition path given a non-native contact was found unless they were close to native contacts. Further criteria such as the complementary quantity, which is the probability of being on a transition path when a contact is not made, concluded in a similar fashion.

Interestingly, it was found that the one protein that was influenced by non-native contacts was the designed α3D. Best et al. reasoned that additional frustration introduced when building a protein with artificially introduced stability has led to a shifting of helix register giving rise to this outlier.

When taken together, these results lay a robust foundation for further studies along the same lines. It is too early to accept or reject the presented findings as universal truth, but strong arguments for the native-centric mechanism being a reasonable model in small fast-folding proteins have been made. It would not be far-fetched to think that larger proteins would adhere to similar principles with non-native contacts modulating the landscape, especially when considering individual downhill folding modules.

References:

(1) Bryngelson, J.D. et al., 1995. Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins, 21(3), pp.167–95.

(2) Taketomi, H., Ueda, Y. & Gō, N., 1975. Studies on protein folding, unfolding and fluctuations by computer simulation. I. The effect of specific amino acid sequence represented by specific inter-unit interactions. International journal of peptide and protein research, 7(6), pp.445–59.

(3) Best, R.B., Hummer, G. & Eaton, W.A., 2013. Native contacts determine protein folding mechanisms in atomistic simulations. Proceedings of the National Academy of Sciences of the United States of America, 110(44), pp.17874–9.

Journal Club: Random Coordinate Descent

The paper I chose to present at last week’s group meeting was “Random Coordinate Descent with Spinor-Matrices and Geometric Filters for Efficient Loop Closure”, by Pieter Chys and Pablo Chacón.

Loop closure is an important step in the ab initio modelling of protein loops. After a loop is initially built, normally by randomly choosing φ/ψ (phi/psi) dihedral angles from a distribution (Step 1 in the figure below), it is probably not ‘closed’ – i.e. the end of the loop does not meet the rest of the protein structure on the other side of the gap. Waiting for the algorithm to produce closed initial conformations would be horribly inefficient, so it’s much better to have some method of closing the initial loop structures computationally.

The main steps in the ab initio prediction of protein loops.

Loop closure methods can be classified into three different types:

Analytical methods: the exact solution to the loop closure problem is calculated. The difficulty with this approach is that it becomes increasingly complicated the more degrees of freedom (i.e. dihedral angles) you have.
Build-up methods: the loop is built residue-by-residue to construct an approximately closed loop which can then be refined. Basically, the loop is guided to the closed position as it is being built.
Iterative methods: do just what they say on the tin – the loop is closed gradually through a series of iterations.

Of course, science is never simple, and loop closure algorithms often cannot be classified into just one of the above categories. Cyclic coordinate descent (CCD), the method on which the random coordinate descent algorithm introduced in this paper is based, is a mix of analytical and iterative methods. Starting from one anchor residue (the residues either side of the loop), the loop is initialised. To the end of the ‘open’ loop structure is added the anchor residue from the other side. This residue is therefore present twice: the ‘fixed’ anchor residue (the true structure) and the ‘mobile’ anchor residue (the one added to the loop structure). Then, starting from the end of the loop that is attached to the rest of the protein, the dihedral angles are changed sequentially to try and minimise the distance between the fixed and mobile anchor residues. The angle change that would minimise this distance is calculated analytically. Once the distance is within a particular cut-off value, the loop is considered to be closed and this is then the final structure.

Random coordinate descent (RCD) is based upon CCD, but with a number of alterations and additions:

Instead of iterating through each dihedral angle sequentially along the loop backbone, angles are chosen randomly
A spinor-matrix approach is used – this reduces loop closure times
Various geometric filters are added at various points in the algorithm – either before, during or after loop closure.
‘Switching‘ – if loop building fails, then the direction of loop building is changed to the opposite – for example, if the structure is being grown from the N-anchor, but doesn’t pass through the filters, then the loop is discarded and the next loop will be grown from the C-anchor. This should mean that the favoured loop closure direction naturally dominates.

The different geometric filters are as follows:

A grid clash filter, which checks for clashes between the loop residues and the rest of the protein structure
A loop clash filter, which checks for internal clashes between loop residues
An adaptive Ramachandran filter, which restrains the dihedral angles to the allowed regions of the Ramachandran plot.

The Ramachandran filter is a good idea, since loop closure can change the dihedral angles of a structure significantly, moving them into disallowed regions. φ (phi) angles are restricted to the range between -175˚ and -40˚, and ψ angles are restricted between -60˚ and 175˚ – this is basically the top left part of the Ramachandran plot. There are two exceptions: the φ angle of proline is fixed, and the dihedral angles of glycine residues are not restricted at all. When placed inside the loop closure routine, the filter is ‘adaptive’ – if the calculated optimum angle is outside of the allowed region, the filter calculates the maximum possible rotation that would still be allowed. When these angle changes become too small, however, the restriction is removed entirely and the angle is allowed to change freely.

By testing different combinations of filters in different places, the authors decided upon a final RCD algorithm. This version includes the grid clash filter during loop closure, and the Ramachandran filter applied both before and during loop closure. They then compare their method to some other loop closure algorithms – their method produces good results, outperforming all except a method called ‘direct tweak’ – the only other method tested that includes clash detection during loop closure. From this, the authors conclude that this is a key factor in generating accurate loop conformations. They also report that RCD is 6 to 17 times faster than direct tweak.

Overall, then, the authors of this paper have introduced an accurate and fast loop closure algorithm which outperforms most other methods. Currently, my research is focussed upon developing a new antibody-specific ab initio loop modelling method, and some of the concepts used in this paper would definitely be worth investigating further. Watch this space!

Kinetic Modelling of Co-translational Protein Folding (Journal Club)

Following up on last week’s entry, this post will explore the same topic: polypeptide chains assuming native-like conformations as they are extruded from the ribosome, or for the less intimate with the concept, co-translational protein folding.

Before addressing some important questions concerning co-translational protein folding, I would like to make a parenthesis: I want to dedicate a paragraph or two to talk about time.

Biological processes are dynamic. They are events that occur over a period of time. For instance, one can quantify the effect of mutations propagated and accumulated over millions of years of evolution. One can also quantify the femtoseconds in which subtle conformational changes occur in photoreceptor proteins like rhodopsin, when they respond to light. Time is fundamental to understand and model any sort of biological event.

Albeit it might seem obvious to the reader that time is so crucial to amass biological knowledge, those of us more theoretically inclined (bioinformaticians, computational biologists, biostatisticians, mathematical biologists and so on and so forth) are usually presented with models that tend to over-simplify reality. Surprisingly enough, there are many over-simplistic models that neglect the effect of time in order to “better” represent whatever they claim to model. Take Protein Docking for instance. The biological process at hand presents a complicated dynamic. There is a kinetic equilibrium, in which a vast amount of protein and ligand molecules interact, associating into complexes and dissociating. Nonetheless, Protein Docking is traditionally reduced to the binding affinity between a pair of molecules. As one might say, this is only a problem if I can present a solution… Luckily, Protein Docking is not my subject of expertise, so I will leave this question open to more tenacious minds than my own.

One of the areas in which I am truly interested in is the co-translational aspect of protein folding. If one performs a quick Google Images search, using the terms “Protein Synthesis” or “Protein Translation”, the results tell a very interesting story. The vast majority of nascent protein chains are represented as fully elongates peptide chains. In a majority of pictures, the growing peptides do not even present secondary structure. They are mostly represented by long, unfolded, almost linear polymers.

Now, any first year Biochemistry student learns about something called Hydrophobicity (or hydrophilicity depending on whether you are a glass half empty or half full type of person). It is biochemistry-introductory-text-book stuff that some residues are polar and some residues are apolar, and hence will hide from water, forming a hydrophobic core. That (hydrophobicity) is one of the main driving forces of protein folding.

Hence, most of the images that appear in our Google Images search are not very representative. They are plain wrong. It is simple physics that the growing peptide chains will form secondary and tertiary structures during the process of protein synthesis. One has to remember that this process is dynamic, it is happening over time. Under these circumstances, time should not be neglected. The time scale at which extrusion occurs is slow enough to allow the nascent chain to probe conformations and simply abide to the laws of physics. A fully elongated, completely unfolded and denatured peptide chain would not exist during protein synthesis. These nascent chains would adopt intermediate conformations simply as a result of apolar residues trying to hide from water.

Ok. Now, the BIG question that can be raised is whether those intermediate conformations actually resemble the native state of the fully elongated protein. I do not want to incur in Baby Kicking, but one thing that evolution has taught us is that cells have evolved to be highly efficient systems. There is no room for wasted energy. It makes sense to hypothesize that over millions of years, the cellular machinery has adapted to explore these intermediate conformations in order to make the process of protein folding more efficient.

Over the past couple of years, substantial evidence has been amassed that codon usage and the degeneracy of the genetic code could be exploited by cells to ensure that protein folding occurs accurately and efficiently. There are many theoretical ways that such exploitation could occur: the codon translation speed could facilitate the formation of certain intermediates that are beneficial for protein folding, that increase stability or that prevent protein aggregation. There is even a biomedical impact given that some observed pathologies have been associated with synonymous codon mutations that may lead to misfolded proteins.

In the paper I presented during this journal club [1], O’Brien and colleagues have devised and described a very interesting kinetic model for protein translation. Their model was used to describe possible scenarios in which both fast and slow translation speed codons are coordinators of co-translational protein folding. Please note that, in this context, co-translational protein folding is perceived as an enrichment of intermediate conformations of the nascent chains, which resemble the native structure of the fully elongated protein.

In the model described in the paper, they opted for a probabilistic approach instead of an analytical (differential equations) approach. The time is modelled by the use of probabilities. The authors derived a formula to quantify the expected proportion of nascent chains of a given length that would be in a Folded intermediate state (one that resembles the native structure). They have managed to express this in terms of a rate of codon translation. Therefore, they stablish a direct relationship between Co-Translational protein folding and codon translation speed.

Their analysis is robust as none of the constants and kinetic rates need to be experimentally derived in order to provide insights about the protein folding process. Overall, I think the way the model was built was quite ingenious and very interesting. I would suggest any interested reader to read the article if they want to understand how the whole modelling was carried out.

Overall, I think the authors present a compelling argument for how cells could explore codon degeneracy and co-translational aspects of protein folding to improve folding efficiency. One of their results present a scenario in which fast translation speed codons can be used to assist in the fold of unstable protein regions, preventing the formation of misfolded intermediates.

One of the many functions of mathematical models is to provide insights into the underlying biology of the phenomena they attempt to model. The lack of any experimental evidence to support this paper’s results does not make it any less interesting. The article presents to the readers a sound and solid mathematical argument as to how co-translational aspects of protein folding could be beneficial for cell efficiency. If anything, they provide interesting hypotheses that might drive experimentalists in the future.

[1] Kinetic modelling indicates that fast-translating codons can coordinate cotranslational protein folding by avoiding misfolded intermediates.

Stepwise Assembly For Protein Loop Prediction

INTRODUCTION:

Loop modeling is used frequently in designing the structure of new proteins or refining protein structures with limited X-ray and NMR data. In 2011, Sripakdeevong et al. introduced a hypothesis called “Stepwise Ansatz” for modeling RNA loops with atomic accuracy. They believed that current knowledge-based RNA loop predictors which aimed at predicting loops with atomic accuracy, failed to sample models within 1.5 Å RMSD of the native structures. The bottleneck in these methods is related to inefficient sampling of the conformational space. To overcome the limitation of sampling, Sripakdeevong et al. introduced an ‘ab initio’ (de novo) buildup strategy to allow for high resolution sampling of loops instead of restricting the search space to available fragments. But with current computational power, exhaustive enumeration of N-length (N>1) loops with atomic resolution is impossible. If N=1, considering all the degrees of freedom for nucleotide will result in 1 million conformations. Performing Rosetta energy minimization on these models will need 1 hour CPU time which is computationally reasonable. Every time a new nucleotide is added the conformational size will be multiplied exponentially by the RNA loop length (for a N=5 computational time ~ 10^23 CPU year).

Since enumeration of one nucleotide long loop is possible, the entire loop can be modeled by stepwise enumerative building of one nucleotide at a time on low energy conformations which are well-packed and hydrogen bonded. Therefore, their implementation of stepwise assembly (SWA) protocol in a dynamic programming-like recursion style enables sampling of 12 length loops with achievable CPU time. SWA being successful in prediction of RNA loops, was first used to predict protein loops with atomic accuracy by Rhiju Das . Loop regions in protein structures have particular characteristics compared to the regions of regular secondary structure. Loops have similar number of hydrogen bonds (on average 1.1 per residue), mainly contain polar/charged side chains and have less contact with the non-polar core of the protein. Current Loop modeling methods with atomic resolution start off with a reduced representation of the protein with simplified or no side-chains. Although coarse graining of proteins will assist in reducing large number of local minima but will fail in capturing non-polar and hydrogen bond interactions involving side chains.Therefore, SWA is used to build up a loop at its atomic resolution by sampling the possible conformation space which is energetically favorable and also computationally possible.

SWA PIPELINE:

SWA is implemented in c++ in Rosetta framework. SWA uses a dynamic programming matrix (example is shown below in Figure 1D for a 6 length loop) to allow de novo buildup of loops from residue k to l. To achieve this, at each step SWA adds loop residue to build up forward from the N-terminus (from residue k-1 to i) and backward from the C-terminus (l+1 to j). Therefore, each circle point in figure 1D represents a (i,j) stage. SWA contains 5 main steps:

Pre-packing the starting point : To start, all atoms of the loop region is removed from the model and side-chains are added to the model. This stage (k-1,l+1) is shown as green circle in figure 1D. Side chains are added and their torsion are minimized. Note that the non-loop backbones are kept fix in all stages of SWA.
Adding one loop residue to n-terminal: This stage is shown by orange arrows (moving downward) in Figure 1D. To generating possible conformations after adding the loop residue, backbone torsion angles (Φ,Ψ) of the added residue and the backbone residue before that are sampled (Figure 1A). Φ,Ψ combinations which do not correspond to the Ramachandram are discarded. This procedure, can result in generating tens to thousands of conformations. For all the generated models, side chains are added to the sampled residues (i and i-1) and these side-chain along with the potential neighboring side chains are optimized. Afterward, clustering is performed, in which models are ranked in order of the energy and if a lower energy model has backbone RMSD of residue (i and i-1) <0.10Å compared to a higher energy model then the low energy model is removed (otherwise kept as a seed for a new cluster). After clustering the top 400 models are selected for all atom energy minimization on sampled residue backbone torsion and its neighbouring side-chain. Then, a final clustering is performed on the these models as described above.
Adding one loop residue to c-terminal: This stage is shown by pink arrows (moving left) in Figure 1D. This is similar to step2, in which residue j and j+1 are considered for backbone sampling (Figure 1B), side-chain packing, model clustering and torsional minimization and final clustering.
Closing loop chains :All models where the gap difference between C-terminal and N-terminal are 1,2 or 3 are subjected to chain closure. To generate closed loops, residue i+1 is added to N-terminal and i+2 and j-1 are added to C-terminal. For i+1, Φ and Ψ torsion are sampled by performing grid search as described above while backbone of i+2 and j-1 undergo Cyclic Coordinate Descent (CCD) which changes the Φ and Ψ torsion of i+2 and j-1 till it closes the gap to i+2. Models with chain closure < 0.01Å are then subject to side chain optimization, clustering, and torsional minimization. This procedure differs to above since all loop side chains and all loop backbones are affected by minimization. This stage is shown by blue arrows in Figure 1D just for gap lengths of one. In addition, to this procedure for loop closure, all models were closed by adding the last gap residue by grid search and trying to close the loop by CCD on the preceding residue. Also, models created by only sampling C-terminal or N-terminal are also used along with CCD loop closure to create full length models.
Clustering: For each stage 400 models are generated, where the next stage uses these models to generate new conformations resulting in thousands models. Also several path can be used to get reach a specific stage, adding up to the numbers of generated models. Therefore, since SWA works on only low-energy models, only the 4000 lowest energy models are kept for further clustering. Clustering is similar to procedure above but with RMSD of 0.25Å and is applied on the entire loop segment which is build up to that stage. Then, the 400 lowest energy is used to move on to the next stage. At the loop closure stage also when the whole loops are modeled clustering is also used with RMSD of 1.0Å and the five lowest energy models are considered as SWA prediction.

Figure 1: Stepwise Assembly Schematic Representation

For short loops of (6 to 9 residue long), it was shown that solutions can be found just by creating models from N-terminal onward and separately by C-terminal backward and joining them by loop closure (or simply be moving just along the first column and first row of the dynamic matrix). Figure 1E shows a directed acyclic graph (DAG) of this procedure. The positive point is that in these cases computational time reduces to O(N) instead of O(N^2). Therefore, for such cases this procedure is tested first. If the difference between the lowest energy model and the second lowest is less than 1 kBT (a Rosetta energy unit is approximately 1 kBT) we can argue that modeling has not converged toward one model and the whole O(N^2) calculation should take place (Except for loops of length 24)

RESULTS:

A difficult case study:

Predicting 1oyc loop (residue 203-214) has always been a challenge by loop predictors since most of its side-chains are polar/charged where hydrogen bonds play an important for stabilising the loop. All these factors are not considered in ab initio predictors with coarse-grained representation. Figure 2 of paper, displays the SWA build up procedure for 1oyc loop.The final best model (Figure 2:I) with the lowest energy has a c-alpha RMSD of 0.39 Å to the native structure. Following the build up path of 1oyc shows that the intermediate steps which lead to this final best model have not always been the lowest energy, therefore it is important to keep all the possible intermediate conformations. It is important to consider that different search paths allows sampling of totally diverse configurations. For example in Figure 2 (below), for 1oyc, 5 different configurations with comparable low energy generated by different build up paths are shown. Two totally different paths (blue and brown) may result in similar configurations while reasonably similar paths (pink, green and orange) have resulted in substantially different loop models.

SWA for 1oyc. 5 different configurations with comparable low energy

Figure 2: prediction of SWA for 1oyc loop. Five different configurations with comparable low energy are shown.

SWA on 35 loop test set:

SWA was used on a data set of 35 protein loops, where 20 of them allowed comparison with PLOP and Rosetta KIC and 15 where difficult cases with loop ranging between 8 to 24 residue. Comparing the median of RMSDs of lowest energy models (Table S1 of paper) shows SWA achieves better quality models (0.63 Å) with the same computational time as PLOP and Rosetta KIC. For the other 15 difficult cases SWA performance reduced by median RMSD of 1.4 Å for the lowest energy models.But, the highlight of SWA is prediction of 24 residue long loops,where it achieves sub-angstrom accuracy. Since SWA uses the O(n) strategy to solve the problem, in comparison to Rosetta KIC it requires less computational time.

In total, considering the best of 5 models, for 27 of 35 cases SWA produces sub-angstrom accuracy. But looking at the five best models of these 27 models show that best RMSD does not corresponds to the best lowest energy model. Also, in some cases Rosetta KIC produces better RMSD models while energy wise it is worse than SWA. This shows Rosetta energy function requires improvement specially in its solvent model (where it fails the most).

SWA and blind test:

SWA was used to predict 4 loops of a protein which its native structure was not released. SWA started with a model where the loop regions were removed and achieved sub-angstrom accuracy (Rosetta KIC achieved this for 3 out of the 4 cases).
SWA loop prediction accuracy of 0.53 Å for a RNA/protein target on a comparative model (instead of X-ray model) shows its ability in prediction complex structures.

DISCUSSION:

SWA method has been successful in predicting protein loops with sub-angstrom accuracy. Of significance are prediction of RNA-Protein model target and loop lengths of 24 residue. Although it provides atomic-accuracy predictions, SWA requires 5000 CPU hours (which is achievable with current processing powers) for 12 length loops. While Monte Carlo and refinement-based methods can predict loops in hundreds of CPU hours. SWA computational time can be improved by considering minimization of several factors in the build up pathway and the use of experimental constraints.

SWA method can be used to guide and assist ab-initio prediction of protein structures and in protein folding. Also it may have application in ab inito modeling problems such as fitting high-res protein structures in low-res electron density maps or prediction of NMR structures using sparse chemical shift data. In addition, stepwise ansatz offers solutions to design of protein interfaces which require simultaneous optimizing of side-chain conformation, side-chain identity and back bones.

CAPRI, pyDock and the state of protein-protein docking.

Intro

We have discussed a paper highlighting the latest progress in the field of protein-protein docking. It covers the results of the participation of the pyDock group in CAPRI. The latest round of the experiment is particularly interesting as it challenged the participants with affinity prediction which is going one step beyond modeling protein-protein complexes.

Role of the group in this CAPRI edition

For those that are not very familiar with CAPRI, one can participate in any of the following roles:

Predictors: Participants are given the unbound structures or sequences of one or both interacting partners. Predictors are then supposed to submit a ranked list of modeled complexes which can be achieved by any docking/prediction tools and some human intervention. This tests the current ad hoc capacity to model protein-protein complexes.
Servers: Participants are given the unbound structures or sequences of one or both interacting partners that are supposed to be submitted to an automatic server which needs to return the results (ideally) within 24 hours. This tests the automated capacity to model protein-protein models.
Scorers: The community contributes sets of poses (unranked) for each target. The scorers are then supposed to re-rank the poses. This is a form of cross-validation of methods since one could expect that the group that samples the poses also performs better scoring. This exercise is supposed to quantify this.

The pyDock group participates in the capacity of Predictors and Scorers. Below we first outline the general docking methodology, followed by the particular implementation of the pyDock group.

General Docking Methodology

Protein-protein docking starts with two interacting structures at input: (ligand (L) and receptor (R)) (if the structures are unavailable a model is created). The docking techniques can be broadly divided into two types: template based and template-free. In template-based docking, one needs either a complex of homologs or one that is sufficiently structurally similar. However since it is estimated that the number of complexes in the PDB only covers ~4% of the presumed total, this approach is not applicable in a great number of cases.

Template-free docking does not rely on pre-solved complexes and it is the main focus of this post. Standard modus-operandi of a template-free protein-protein docker is the following:

Sample N poses.
Re-score top D sampled poses.
Refine the top K poses.

In the first step, possible poses of ligand with respect to the receptor. This is very often achieved by FFT (ZDOCK) or more elaborate methods such as geometric hashing/pose clustering (PatchDock). The possible orientations are ranked by a simplified energy function or a statistical potential. In most cases this step is performed in rigid-body mode (intra-molecular distances are not allowed to change) for the computational efficiency. The number of sampled poses is usually in the thousands (N~ 2k-10k).

The re-scoring set uses a more elaborate energy function/statistical potential to identify more close-to native poses. Notable examples include pyDock and ZRANK. Since such functions are usually more expensive computationally (for instance pyDock calculates the LJ potential, Coulombic electrostatics and desolvation terms), it is more tractable to apply them to a smaller set of (D<<N) of top poses as returned by the rigid-body sampler. Additionally, the final poses might be pruned for redundancy by removing structures which are very similar to each other. One method in particular (ClusPro) owes its success to scoring the rankings based on the numbers of similar decoys per cluster out of a number of top predictions.

The final step constitutes the refinement of the select top K poses (K<<D). Here, flexibility might be introduced so as to account for conformational changes upon binding. Tools used to achieve this are very computationally expensive energy fields such as AMBER or CHARMM. The coordinates of side-chains or even backbones are distorted, for instance using normal mode calculations, until the energy function reaches an energetic minimum via a flavor of gradient descent.

Fig. 1: Breakdown of results of the pyDock group at the latest CAPRI round. The results are presented for each target, in the predictor as well as scorer capacity.

The pyDock group Methodology and Results

The pyDock group follows the pattern outlined above. As a sampler they employ ZDOCK and FTDOCK, both fast rigid-body Fast Fourier Transform-based methods. They use their pyDock function to score the best models. Additionally, they remove redundancy from the final lists of decoys by removing these entries which are too similar to the higher-scoring ones according to ligand rmsd. The final refining step is carried out using TINKER.