Category Archives: Group Meetings

What we discuss during cake at our Tuesday afternoon group meetings

Journal Club: Statistical Quality Indicators for Electron Density Maps

This Week I presented Ian Tickle’s 2012 Paper “Statistical quality indicators for electron-density maps”. This paper presented new, statistically robust metrics for describing the agreement between an atomic model and the experimentally derived electron density.

Previous metrics such as the Real-space R (RSR) and Real-Space Correlation Coefficient (RSCC) (Brandon & Jones, 1991, and others) are popular electron density metrics, and can inform on the quality of an atomic model. However, as Tickle claims, they cannot inform on how the model is good, or bad, as they give no indication of the accuracy, or the precision, of the electron density.

Accuracy:

Ian Tickle describes accuracy as – “How close are the results on average to the truth (regardless of precision)?” This is more often referred to as ‘error’. The most accurate model is the one that best agrees with the electron density.

Precision:

Precision is described as – “If you were to repeat the experiment, how much would you expect the results to vary (regardless of accuracy)?” This is more often described as ‘uncertainty’. Precision is a property of the crystal and the experiment. It is independent of the model.

A pictographic representation is shown below –

Pictographic representation of accuracy and precision. Taken from Tickle, 2012.

Before the discussion of the new metrics proposed, there are several assumptions that must be made and several influencing factors to be considered.

Assumptions:

  • The electron density, and the phases used to generate it, are accurate. This assumption is reasonable because density-based validation is generally done near to the end of refinement when the model is mostly correct.

Metric usefulness depends critically on:

  • Accurate calculation and representation of the electron density from our atomic model.
  • Accurate scaling of the observed and model density (neither calculated nor observed density is not on an absolute scale).
  • Accurate determination of the area to be considered for the calculation of the metric. If too large an area is considered, noise and density from other features will influence the metric. Too small an area will not encompass the whole model and its environment.

Calculating the Model Density:

Accurate calculation of the model’s electron density is essential, as the profile of the atoms will of course affect the comparison of the model to the experimental density. Often (as in Jones 1991, and others) a fixed profile is assumed for all atoms. Of course, in reality the profile will depend on atom type, B-factors, data completeness, and resolution limits.

Due to the resolution limits, the electron density from an atom is the convolution of a 3d gaussian and a sphere of constant scattering power (Blundell & Johnson, 1976). The truncated density function for an atom then becomes:

Screen Shot 2014-04-08 at 19.50.14

Scaling the calculated density:

This, fortunately, is already available and calculated by refinement programs (when calculating 2mFo – DFc maps), and the correct scaling factor is the resolution-dependent D.

Visualising the quality of a model:

To demonstrate how the (global) quality of a model can easily be seen, Tickle calculates and normalises difference density maps for a good, and a bad, model. If the model is ‘correct’, then the difference density should be gaussian noise, but if the model is ‘incorrect’, it will be skewed. This can easily be seen in Figure 8 from the paper.

Screen Shot 2014-04-08 at 20.26.06

A difference density map is calculated, sorted by value and normalised to give a density distribution. For a good model, this should look like (a), where the density function is a gaussian ~ N(0,1). For a bad model, (b), the distribution is skewed.

The main feature that appears in a ‘bad’ model is the increased weight in the tails of the distribution. Extra weight on the left-hand side indicates model that is not supported by the evidence, and extra weight on the right-hand side indicates under-modelled density.

The New Accuracy Metric

Using the ideas from above (that the difference density between a model and the experimental density should be distributed as a gaussian) Tickle goes on to develop metrics for determining the likelihood that a model density and an experimental density differ only by the addition of random noise.

The metric he describes tests a hypothesis – Does the distribution of the difference density reflect that obtained from the propagation of random errors in the experimental data (and phases)?

To do this, statistical tests are developed. First we define the difference density Z-score (ZD)

Screen Shot 2014-04-08 at 21.05.19

This quantity is the difference between the calculated electron density and the experimental density (delta rho), divided by the error in the difference density, giving the normal definition of a normalised Z-score.

The difference density (the numerator) has been discussed above, so we now discuss the error in the difference density. Assuming that the experimental data and the phases are ‘correct’, any non-random errors arise only from the model.

That is, errors arising in the experimental data will appear only as random noise, whereas errors in the model will manifest as the signal that we are trying to detect.

To calculate the strength of the noise (that of the experimental data and the phases), we look at the bulk-solvent regions. Here, the atoms are unordered, and so should give uniform density. Any deviations from uniform should be approximately the random noise from the experimental data and the phases.

Maximum Z-score analysis

Tickle considers using the maximum ZD of a sample as a test for model accuracy, and discusses merits and failings. In brief, if we were to sample from a difference density distribution, and take only the most significant ZD score, “focusing only on the maximum value inevitably overstates the significance of the results”.

A Chi-Squared test for ZD scores

The solution that Tickle proposes is to allow that all sample values may be significant (rather than just the largest values). He creates a joint probability density function of the absolute sample values (assumed half-normal and iid). This probability density function then becomes a chi-squared distribution.

Screen Shot 2014-04-08 at 22.34.30By calculating the CDF of the chi-squared (a lower regularised gamma function), Tickle is able to attain p-values for a set of observations.

Screen Shot 2014-04-08 at 22.38.11These can then be converted back to Z-scores, which crystallographers are more comfortable using. As Tickle states, just because the metric is in terms of Z-scores does not mean that the distribution is normal (here it is clearly a chi-squared).

Problem: Large Samples

The problem with the above is that for large samples, a small number of significant values will be drowned out by noise and the signal may be missed. The failure of the above test in this situation is put down to the choice of null hypothesis. Multiple null hypotheses are needed in order to cover all possibilities.

When distinguishing between multiple hypotheses, we must aim to avoid type II errors wherever possible, whilst attempting to minimise type I errors. We must select the hypothesis “that maximises the probability of obtaining a result less extreme that the one actually observed (…) or equivalently the one that minimises the probability of obtaining a result more extreme than that observed”.

Solution: A new JPDF and CDF

To solve this, Tickle takes a subset of the highest values of the original sample of n, say from i=k to n (the n-k highest scores), and calculates the chi-squared and its associated cumulative probability. We will then choose the value of k such that it gives us the highest probability,

Screen Shot 2014-04-08 at 22.56.34However, the cumulative probability of chi-squared is no longer the regularised gamma function due to the bias introduced by selected the largest values. Recalculating the JPDF and integrating analytically and numerically to obtain the CDF, we could arrive at a result. This, however, has the problem of a large dimensionality, which requires the use of very accurate Monte Carlo integration (accuracy of much better 0.27% is required, since we are interested in the p-values between 0.9973 and 1 – greater than 3 sigma).

Fortunately, an approximation can be made to bring about a practical solution.

Examples of Significant Distributions:

Tickle generates a table which gives several scenarios that will give a significant result, for different extremes of Z-value, and different sample sizes. One particularly key point is

…small samples are statistically less reliable so require a higher proportion of significant data points to achieve the same overall level of significance. Large samples require relatively fewer data points but they must have higher values to overcome the ‘multiple comparisons’ effect, where large values are more likely to occur occur purely as a result of random error.

Summary

B-factors:

Tickle shows early in the paper that the RSR and RSCC are correlated with the B-factor of the model. RSZD shows no correlation with the B-factor, as is desired.

RSZD+ & RSZD:

More useful scores can be generated by scoring the negative and positive values of delta-rho separately. This gives two scores, RSZD+ and RSZD-. RSZD+ gives the significance/prevalence of unexplained density (missing atoms) and RSZD- gives the significance/prevalence of unjustified model/misplaced atoms.

Precision & Reliability:

Although not discussed in as much depth in the paper, Tickle also proposes a metric to account for the precision of the electron density

Screen Shot 2014-04-08 at 23.24.34

This is clearly independent of the model, and is the signal-to-noise ratio of the average observed density in a specified region. Weak density (or large noise) will lead to a small RSZO, implying that any model placed here should be considered unreliable.

Journal Club: Native contacts in protein folding

Like your good old headphone cables, strings of amino acids have the potential to fold into a vast number of different conformations given the appropriate conditions. A conservative estimation for the time it would take a 100 residue protein to explore all theoretically possible conformations would exceed the age of the Universe several times. This is obviously not feasible and was pointed out by Levinthal when he published his “How To Fold Graciously” in 1969.

The so called Protein-Folding Problem has since been under intense study, which inevitably has led to a few theories and models about its nature. Due to the lack of appropriate wet-lab methods to study this phenomenon theoretical, computational approaches have been key to devising impactful frameworks for formally describing protein folding. One of these goes under the name of principle of minimum frustration introduced by Bryngelson and Wolynes in the late 80s (1). It states that proteins by evolution were enriched for sequences with the propensity to fold into low-energy structures, while actively selecting against traps. By avoiding mis-folding and non-native contacts, the theory says, a smooth funnel-like energy landscape with native-state minima is created that ensures robust and fast folding.

This implies that native contacts, i.e. residues that interact in the fully folded protein play a major role in the folding process. Gō models (2), named after Nobuhiro Gō who first proposed this method, are based around this assumption with the energetic contributions of native interactions acting as the sole driving forces in the folding process. While this approach has yielded promising results, many of which were in concordance with experiments, its underlying principles have never been validated in a statistically meaningful way.

native contact schematic

A schematic for native-contact-driven protein folding

In 2013 a study by Best, Hummer and Eaton (3) formally addressed this question. By devising a set of statistical quantities aimed at weighting the importance of native and non-native interactions for folding and applying these to the analysis of several long MD folding simulations they were able to show a “native-centric mechanism” for small fast-folding proteins.

In a first step it was assessed whether the fraction of native contacts  provided a suitable reaction coordinate for the simulated folding events. From their equilibrium simulations two thresholds of native-contact-fractions  were chosen that defined folded and unfolded states (a two-state model is assumed). Overlaying the values for the most visited native-contact-fractions during simulation against these thresholds revealed a strong correlation between the two equilibrium probability density maxima and the protein’s fold state. In addition they showed that the range of native-contact-fractions between those found to represent unfolded and folded thresholds were indicative of being on a transition path (defined as the  “.. regions of the trajectories that cross directly from the unfolded well to the folded well ..”).

A further measure was introduced with the contact lifetime test. The log-ratio of the time a contact spent on a transition path vs the time it existed in the unfolded state was calculated and compared in a heat-map to the native contact map coloured by the number of contacts between residues.

figure2

Contact life time test for a selected protein.
Adapted from (3).

Among others this result revealed a clear connection between contacts with longer transition path life times and the number of contacts they made in the native structure.

So what about non-native interactions?

Screenshot from 2014-03-27 12:47:04

One of the measures addressing this question was the Bayesian measure for non-native contacts on transition paths. In the examples used in this paper, no obvious link between being on a transition path given a non-native contact was found unless they were close to native contacts. Further criteria such as the complementary quantity, which is the probability of being on a transition path when a contact is not made, concluded in a similar fashion.

Interestingly, it was found that the one protein that was influenced by non-native contacts was the designed α3D. Best et al. reasoned that additional frustration introduced when building a protein with artificially introduced stability has led to a shifting of helix register giving rise to this outlier.

When taken together, these results lay a robust foundation for further studies along the same lines. It is too early to accept or reject the presented findings as universal truth, but strong arguments for the native-centric mechanism being a reasonable model in small fast-folding proteins have been made. It would not be far-fetched to think that larger proteins would adhere to similar principles with non-native contacts modulating the landscape, especially when considering individual downhill folding modules.

References:

(1) Bryngelson, J.D. et al., 1995. Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins, 21(3), pp.167–95.

(2) Taketomi, H., Ueda, Y. & Gō, N., 1975. Studies on protein folding, unfolding and fluctuations by computer simulation. I. The effect of specific amino acid sequence represented by specific inter-unit interactions. International journal of peptide and protein research, 7(6), pp.445–59.

(3) Best, R.B., Hummer, G. & Eaton, W.A., 2013. Native contacts determine protein folding mechanisms in atomistic simulations. Proceedings of the National Academy of Sciences of the United States of America, 110(44), pp.17874–9.

Journal Club: Random Coordinate Descent

The paper I chose to present at last week’s group meeting was “Random Coordinate Descent with Spinor-Matrices and Geometric Filters for Efficient Loop Closure”, by Pieter Chys and Pablo Chacón.

Loop closure is an important step in the ab initio modelling of protein loops. After a loop is initially built, normally by randomly choosing φ/ψ (phi/psi) dihedral angles from a distribution (Step 1 in the figure below), it is probably not ‘closed’ – i.e. the end of the loop does not meet the rest of the protein structure on the other side of the gap. Waiting for the algorithm to produce closed initial conformations would be horribly inefficient, so it’s much better to have some method of closing the initial loop structures computationally.

The main steps in the ab initio prediction of protein loops.

The main steps in the ab initio prediction of protein loops.

Loop closure methods can be classified into three different types:

  1. Analytical methods: the exact solution to the loop closure problem is calculated. The difficulty with this approach is that it becomes increasingly complicated the more degrees of freedom (i.e. dihedral angles) you have.
  2. Build-up methods: the loop is built residue-by-residue to construct an approximately closed loop which can then be refined. Basically, the loop is guided to the closed position as it is being built.
  3. Iterative methods: do just what they say on the tin – the loop is closed gradually through a series of iterations.

Of course, science is never simple, and  loop closure algorithms often cannot be classified into just one of the above categories. Cyclic coordinate descent (CCD), the method on which the random coordinate descent algorithm introduced in this paper is based, is a mix of analytical and iterative methods. Starting from one anchor residue (the residues either side of the loop), the loop is initialised. To the end of the ‘open’ loop structure is added the anchor residue from the other side. This residue is therefore present twice: the ‘fixed’ anchor residue (the true structure) and the ‘mobile’ anchor residue (the one added to the loop structure). Then, starting from the end of the loop that is attached to the  rest of the protein, the dihedral angles are changed sequentially to try and minimise the distance between the fixed and mobile anchor residues. The angle change that would minimise this distance is calculated analytically. Once the distance is within a particular cut-off value, the loop is considered to be closed and this is then the final structure.

Random coordinate descent (RCD) is based upon CCD, but with a number of alterations and additions:

  1. Instead of iterating through each dihedral angle sequentially along the loop backbone, angles are chosen randomly
  2. A spinor-matrix approach is used – this reduces loop closure times
  3. Various geometric filters are added at various points in the algorithm – either before, during or after loop closure.
  4. Switching‘  – if loop building fails, then the direction of loop building is changed to the opposite – for example, if the structure is being grown from the N-anchor, but doesn’t pass through the filters, then the loop is discarded and the next loop will be grown from the C-anchor. This should mean that the favoured loop closure direction naturally dominates.

The different geometric filters are as follows:

  1. A grid clash filter, which checks for clashes between the loop residues and the rest of the protein structure
  2. A loop clash filter, which checks for internal clashes between loop residues
  3. An adaptive Ramachandran filter, which restrains the dihedral angles to the allowed regions of the Ramachandran plot.

The Ramachandran filter is a good idea, since loop closure can change the dihedral angles of a structure significantly, moving them into disallowed regions. φ (phi) angles are restricted to the range between -175˚ and -40˚, and  ψ angles are restricted between -60˚ and 175˚ – this is basically the top left part of the Ramachandran plot. There are two exceptions: the  φ angle of proline is fixed, and the dihedral angles of glycine residues are not restricted at all. When placed inside the loop closure routine, the filter is ‘adaptive’ – if the calculated optimum angle is outside of the allowed region, the filter calculates the maximum possible rotation that would still be allowed. When these angle changes become too small, however, the restriction is removed entirely and the angle is allowed to change freely.

By testing different combinations of filters in different places, the authors decided upon a final RCD algorithm. This version includes the grid clash filter during loop closure, and the Ramachandran filter applied both before and during loop closure. They then compare their method to some other loop closure algorithms – their method produces good results, outperforming all except a method called ‘direct tweak’ – the only other method tested that includes clash detection during loop closure. From this, the authors conclude that this is a key factor in generating accurate loop conformations. They also report that RCD is 6 to 17 times faster than direct tweak.

Overall, then, the authors of this paper have introduced an accurate and fast loop closure algorithm which outperforms most other methods. Currently, my research is focussed upon developing a new antibody-specific ab initio loop modelling method, and some of the concepts used in this paper would definitely be worth investigating further. Watch this space!

Kinetic Modelling of Co-translational Protein Folding (Journal Club)

Following up on last week’s entry, this post will explore the same topic: polypeptide chains assuming native-like conformations as they are extruded from the ribosome, or for the less intimate with the concept, co-translational protein folding.

Before addressing some important questions concerning co-translational protein folding, I would like to make a parenthesis: I want to dedicate a paragraph or two to talk about time.

Biological processes are dynamic. They are events that occur over a period of time. For instance, one can quantify the effect of mutations propagated and accumulated over millions of years of evolution. One can also quantify the femtoseconds in which subtle conformational changes occur in photoreceptor proteins like rhodopsin, when they respond to light. Time is fundamental to understand and model any sort of biological event.

Albeit it might seem obvious to the reader that time is so crucial to amass biological knowledge, those of us more theoretically inclined (bioinformaticians, computational biologists, biostatisticians,  mathematical biologists and so on and so forth) are usually  presented with models that tend to over-simplify reality. Surprisingly enough, there are many over-simplistic models that neglect the effect of time in order to “better” represent whatever they claim to model. Take Protein Docking for instance. The biological process at hand presents a complicated dynamic. There is a kinetic equilibrium, in which a vast amount of protein and ligand molecules interact, associating into complexes and dissociating. Nonetheless, Protein Docking is traditionally reduced to the binding affinity between a pair of molecules. As one might say, this is only a problem if I can present a solution… Luckily, Protein Docking is not my subject of expertise, so I will leave this question open to more tenacious minds than my own.

One of the areas in which I am truly interested in is the co-translational aspect of protein folding. If one performs a quick Google Images search, using the terms “Protein Synthesis” or “Protein Translation”, the results tell a very interesting story.  The vast majority of nascent protein chains are represented as fully elongates peptide chains. In a majority of pictures, the growing peptides do not even present secondary structure. They are mostly represented by long, unfolded, almost linear polymers.

Now, any first year Biochemistry student learns about something called Hydrophobicity (or hydrophilicity depending on whether you are a glass half empty or half full type of person). It is biochemistry-introductory-text-book stuff that some residues are polar and some residues are apolar, and hence will hide from water, forming a hydrophobic core. That (hydrophobicity) is one of the main driving forces of  protein folding.

Hence, most of the images that appear in our Google Images search are not very representative. They are plain wrong. It is simple physics that the growing peptide chains will form secondary and tertiary structures during the process of protein synthesis. One has to remember that this process is dynamic, it is happening over time. Under these circumstances, time should not be neglected. The time scale at which extrusion occurs is slow enough to allow the nascent chain to probe conformations and simply abide to the laws of physics. A fully elongated, completely unfolded and denatured peptide chain would not exist during protein synthesis. These nascent chains would adopt intermediate conformations simply as a result of apolar residues trying to hide from water.

Ok. Now, the BIG question that can be raised is whether those intermediate conformations actually resemble the native state of the fully elongated protein. I do not want to incur in Baby Kicking, but one thing that evolution has taught us is that cells have evolved to be highly efficient systems. There is no room for wasted energy. It makes sense to hypothesize that over millions of years, the cellular machinery has adapted to explore these intermediate conformations in order to make the process of protein folding more efficient.

Over the past couple of years, substantial evidence has been amassed that codon usage and the degeneracy of the genetic code could be exploited by cells to ensure that protein folding occurs accurately and efficiently. There are many theoretical ways that such exploitation could occur: the codon translation speed could facilitate the formation of certain intermediates that are beneficial for protein folding, that increase stability or that prevent protein aggregation. There is even a biomedical impact given that some observed pathologies have been associated with synonymous codon mutations that may lead to misfolded proteins.

In the paper I presented during this journal club [1], O’Brien and colleagues have devised and described a very interesting kinetic model for protein translation. Their model was used to describe possible scenarios in which both fast and slow translation speed codons are coordinators of co-translational protein folding. Please note that, in this context, co-translational protein folding is perceived as an enrichment of intermediate conformations of  the nascent chains, which resemble the native structure of the fully elongated protein.

In the model described in the paper, they opted for a probabilistic approach instead of an analytical (differential equations) approach. The time is modelled by the use of probabilities. The authors derived a formula to quantify the expected proportion of nascent chains of a given length that would be in a Folded intermediate state (one that resembles the native structure). They have managed to express this in terms of a rate of codon translation. Therefore, they stablish a direct relationship between Co-Translational protein folding and codon translation speed.

Their analysis is robust as none of the constants and kinetic rates need to be experimentally derived in order to provide insights about the protein folding process. Overall, I think the way the model was built was quite ingenious and very interesting. I would suggest any interested reader to read the article if they want to understand how the whole modelling was carried out.

Overall, I think the authors present a compelling argument for how cells could explore codon degeneracy and co-translational aspects of protein folding to improve folding efficiency. One of their results present a scenario in which fast translation speed codons can be used to assist in the fold of unstable protein regions, preventing the formation of misfolded intermediates.

One of the many functions of mathematical models is to provide insights into the underlying biology of the phenomena they attempt to model. The lack of any experimental evidence to support this paper’s results does not make it any less interesting. The article presents to the readers a sound and solid mathematical argument as to how co-translational aspects of protein folding could be beneficial for cell efficiency. If anything, they provide interesting hypotheses that might drive experimentalists in the future.

[1] Kinetic modelling indicates that fast-translating codons can coordinate cotranslational protein folding by avoiding misfolded intermediates.

Stepwise Assembly For Protein Loop Prediction

INTRODUCTION:

Loop modeling is used frequently in designing the structure of new proteins or refining protein structures with limited X-ray and NMR data. In 2011, Sripakdeevong et al. introduced a hypothesis called “Stepwise Ansatz” for modeling RNA loops with atomic accuracy. They believed that current knowledge-based RNA loop predictors which aimed at predicting loops with atomic accuracy, failed to sample models within 1.5 Å RMSD of the native structures. The bottleneck in these methods is related to inefficient sampling of the conformational space. To overcome the limitation of sampling, Sripakdeevong et al. introduced an ‘ab initio’ (de novo) buildup strategy to allow for high resolution sampling of loops instead of restricting the search space to available fragments. But with current computational power, exhaustive enumeration of N-length (N>1) loops with atomic resolution is impossible. If N=1, considering all the degrees of freedom for nucleotide will result in 1 million conformations. Performing Rosetta energy minimization on these models will need 1 hour CPU time which is computationally reasonable. Every time a new nucleotide is added the conformational size will be multiplied exponentially by the RNA loop length (for a N=5 computational time ~ 10^23 CPU year).

Since enumeration of one nucleotide long loop is possible, the entire loop can be modeled by stepwise enumerative building of one nucleotide at a time on low energy conformations which are well-packed and hydrogen bonded. Therefore, their implementation of stepwise assembly (SWA) protocol in a dynamic programming-like recursion style enables sampling of 12 length loops with achievable CPU time. SWA being successful in prediction of RNA loops, was first used to predict protein loops with atomic accuracy by Rhiju Das . Loop regions in protein structures have particular characteristics compared to the regions of regular secondary structure. Loops have similar number of hydrogen bonds (on average 1.1 per residue), mainly contain polar/charged side chains and have less contact with the non-polar core of the protein. Current Loop modeling methods with atomic resolution start off with a reduced representation of the protein with simplified or no side-chains. Although coarse graining of proteins will assist in reducing large number of local minima but will fail in capturing non-polar and hydrogen bond interactions involving side chains.Therefore, SWA is used to build up a loop at its atomic resolution by sampling the possible conformation space which is energetically favorable and also computationally possible.

SWA PIPELINE:

SWA is implemented in c++ in Rosetta framework. SWA uses a dynamic programming matrix (example is shown below in Figure 1D for a 6 length loop) to allow de novo buildup of loops from residue k to l. To achieve this, at each step SWA adds loop residue to build up forward from the N-terminus (from residue k-1 to i) and backward from the C-terminus (l+1 to j). Therefore, each circle point in figure 1D represents a (i,j) stage. SWA contains 5 main steps:

  1. Pre-packing the starting point : To start, all atoms of the loop region is removed from the model and side-chains are added to the model. This stage (k-1,l+1) is shown as green circle in figure 1D. Side chains are added and their torsion are minimized. Note that the non-loop backbones are kept fix in all stages of SWA.
  2.  Adding one loop residue to n-terminal: This stage is shown by orange arrows (moving downward) in Figure 1D. To generating possible conformations after adding the loop residue, backbone torsion angles (Φ,Ψ) of the added residue  and the backbone residue before that are sampled (Figure 1A). Φ,Ψ combinations which do not correspond to the Ramachandram are discarded. This procedure, can result in generating tens to thousands of conformations. For all the generated models, side chains are added to the sampled residues (i and i-1) and these side-chain along with the potential neighboring side chains are optimized. Afterward, clustering is performed, in which models are ranked in order of the energy and if a lower energy model has backbone RMSD of residue (i and i-1) <0.10Å compared to a higher energy model then the low energy model is removed (otherwise kept as a seed for a new cluster). After clustering the top 400 models are selected for all atom energy minimization on sampled residue backbone torsion and its neighbouring side-chain. Then, a final clustering is performed on the these models as described above.
  3. Adding one loop residue to c-terminal: This stage is shown by pink arrows (moving left) in Figure 1D. This is similar to step2, in which residue j and j+1 are considered for backbone sampling (Figure 1B), side-chain packing, model clustering and torsional minimization and final clustering.
  4. Closing loop chains :All models where the gap difference between C-terminal and N-terminal are 1,2 or 3 are subjected to chain closure. To generate closed loops, residue i+1 is added to N-terminal and i+2 and j-1 are added to C-terminal. For i+1, Φ and Ψ torsion are sampled by performing grid search as described above while backbone of i+2 and j-1 undergo Cyclic Coordinate Descent (CCD) which changes the Φ and Ψ torsion of i+2 and j-1 till it closes the gap to i+2. Models with chain closure < 0.01Å are then subject to side chain optimization, clustering, and torsional minimization. This procedure differs to above since all loop side chains and all loop backbones are affected by minimization. This stage is shown by blue arrows in Figure 1D just for gap lengths of one. In addition, to this procedure for loop closure, all models were closed by adding the last gap residue by grid search and trying to close the loop by CCD on the preceding residue. Also, models created by only sampling C-terminal or N-terminal are also used along with CCD loop closure to create full length models.
  5. Clustering: For each stage 400 models are generated, where the next stage uses these models to generate new conformations resulting in thousands models. Also several path can be used to get reach a specific stage, adding up to the numbers of generated models. Therefore, since SWA works on only low-energy models, only the 4000 lowest energy models are kept for further clustering. Clustering is similar to procedure above but with RMSD of 0.25Å and is applied on the entire loop segment which is build up to that stage. Then, the 400 lowest energy is used to move on to the next stage. At the loop closure stage also when the whole loops are modeled clustering is also used with RMSD of 1.0Å and the five lowest energy models are considered as SWA prediction.
journal.pone.0074830.g001

Figure 1: Stepwise Assembly Schematic Representation

For short loops of (6 to 9 residue long), it was shown that solutions can be found just by creating models from N-terminal onward and separately by C-terminal backward and joining them by loop closure (or simply be moving just along the first column and first row of the dynamic matrix). Figure 1E shows a directed acyclic graph (DAG) of this procedure. The positive point is that in these cases computational time reduces to O(N) instead of O(N^2). Therefore, for such cases this procedure is tested first. If the difference between the lowest energy model and the second lowest is less than 1 kBT (a Rosetta energy unit is approximately 1 kBT) we can argue that modeling has not converged toward one model and the whole O(N^2) calculation should take place (Except for loops of length 24)

RESULTS:

A difficult case study:

Predicting 1oyc loop (residue 203-214) has always been a challenge by loop predictors since most of its side-chains are polar/charged where hydrogen bonds play an important for stabilising the loop. All these factors are not considered in ab initio predictors with coarse-grained representation. Figure 2 of paper, displays the SWA build up procedure for 1oyc loop.The final best model (Figure 2:I) with the lowest energy has a c-alpha RMSD of 0.39 Å to the native structure. Following the build up path of 1oyc shows that the intermediate steps which lead to this final best model have not always been the lowest energy, therefore it is important to keep all the possible intermediate conformations. It is important to consider that different search paths allows sampling of totally diverse configurations. For example in Figure 2 (below), for 1oyc, 5 different configurations with comparable low energy generated by different build up paths are shown. Two totally different paths (blue and brown) may result in similar configurations while reasonably similar paths (pink, green and orange) have resulted in substantially different loop models.

SWA for 1oyc. 5 different configurations with comparable low energy

Figure 2: prediction of SWA for 1oyc loop. Five different configurations with comparable low energy are shown.

SWA on 35 loop test set:

SWA was used on a data set of 35 protein loops, where 20 of them allowed comparison with PLOP and Rosetta KIC and 15 where difficult cases with loop ranging between 8 to 24 residue. Comparing the median of RMSDs of lowest energy models (Table S1 of paper) shows SWA achieves better quality models (0.63 Å) with the same computational time as PLOP and Rosetta KIC. For the other 15 difficult cases SWA performance reduced by median RMSD of 1.4 Å for the lowest energy models.But, the highlight of SWA is prediction of 24 residue long loops,where it achieves sub-angstrom accuracy. Since SWA uses the O(n) strategy to solve the problem, in comparison to Rosetta KIC it requires less computational time.

In total, considering the best of 5 models, for 27 of 35 cases SWA produces sub-angstrom accuracy. But looking at the five best models of these 27 models show that best RMSD does not corresponds to the best lowest energy model. Also, in some cases Rosetta KIC produces better RMSD models while energy wise it is worse than SWA. This shows Rosetta energy function requires improvement specially in its solvent model (where it fails the most).

SWA and blind test:

  • SWA was used to predict 4 loops of a protein which its native structure was not released. SWA started with a model where the loop regions were removed and achieved sub-angstrom accuracy (Rosetta KIC achieved this for 3 out of the 4 cases).
  • SWA loop prediction accuracy of 0.53 Å for a RNA/protein target on a comparative model (instead of X-ray model) shows its ability in prediction complex structures.

DISCUSSION:

SWA method has been successful in predicting protein loops with sub-angstrom accuracy. Of significance are prediction of RNA-Protein model target and loop lengths of 24 residue. Although it provides atomic-accuracy predictions, SWA requires 5000 CPU hours (which is achievable with current processing powers) for 12 length loops. While Monte Carlo and refinement-based methods can predict loops in hundreds of CPU hours. SWA computational time can be improved by considering minimization of several factors in the build up pathway and the use of experimental constraints.

SWA method can be used to guide and assist ab-initio prediction of protein structures and in protein folding. Also it may have application in ab inito modeling problems such as fitting high-res protein structures in low-res electron density maps or prediction of NMR structures using sparse chemical shift data. In addition, stepwise ansatz offers solutions to design of protein interfaces which require simultaneous optimizing of side-chain conformation, side-chain identity and back bones.

CAPRI, pyDock and the state of protein-protein docking.

Intro

We have discussed a paper highlighting the latest progress in the field of protein-protein docking. It covers the results of the participation of the pyDock group in CAPRI. The latest round of the experiment is particularly interesting as it challenged the participants with affinity prediction which is going one step beyond modeling protein-protein complexes.

Role of the group in this CAPRI edition

For those that are not very familiar with CAPRI, one can participate in any of the following roles:

  • Predictors: Participants are given the unbound structures or sequences of one or both interacting partners. Predictors are then supposed to submit a ranked list of modeled complexes which can be achieved by any docking/prediction tools and some human intervention. This tests the current ad hoc capacity to model protein-protein complexes.
  • Servers: Participants are given the unbound structures or sequences of one or both interacting partners that are supposed to be submitted to an automatic server which needs to return the results (ideally) within 24 hours. This tests the automated capacity to model protein-protein models.
  • Scorers: The community contributes sets of poses (unranked) for each target. The scorers are then supposed to re-rank the poses. This is a form of cross-validation of methods since one could expect that the group that samples the poses also performs better scoring. This exercise is supposed to quantify this.

The pyDock group participates in the capacity of Predictors and Scorers. Below we first outline the general docking methodology, followed by the particular implementation of the pyDock group.

General Docking Methodology

Protein-protein docking starts with two interacting structures at input: (ligand (L) and receptor (R)) (if the structures are unavailable a model is created). The docking techniques can be broadly divided into two types: template based and template-free. In template-based docking, one needs either a complex of homologs or one that is sufficiently structurally similar. However since it is estimated that the number of complexes in the PDB only covers ~4% of the presumed total, this approach is not applicable in a great number of cases.

Template-free docking does not rely on pre-solved complexes and it is the main focus of this post. Standard modus-operandi of a template-free protein-protein docker is the following:

  1. Sample N poses.
  2. Re-score top D sampled poses.
  3. Refine the top K poses.

In the first step, possible poses of ligand with respect to the receptor. This is very often achieved by FFT (ZDOCK) or more elaborate methods such as geometric hashing/pose clustering (PatchDock). The possible orientations are ranked by a simplified energy function or a statistical potential. In most cases this step is performed in rigid-body mode (intra-molecular distances are not allowed to change) for the computational efficiency. The number of sampled poses is usually in the thousands (N~ 2k-10k).

The re-scoring set uses a more elaborate energy function/statistical potential to identify more close-to native poses. Notable examples include pyDock and ZRANK. Since such functions are usually more expensive computationally (for instance pyDock calculates the LJ potential, Coulombic electrostatics and desolvation terms), it is more tractable to apply them to a smaller set of (D<<N) of top poses as returned by the rigid-body sampler. Additionally, the final poses might be pruned for redundancy by removing structures which are very similar to each other. One method in particular (ClusPro) owes its success to scoring the rankings based on the numbers of similar decoys per cluster out of a number of top predictions.

The final step constitutes the refinement of the select top K poses (K<<D). Here, flexibility might be introduced so as to account for conformational changes upon binding. Tools used to achieve this are very computationally expensive energy fields such as AMBER or CHARMM. The coordinates of side-chains or even backbones are distorted, for instance using normal mode calculations, until the energy function reaches an energetic minimum via a flavor of gradient descent.

Fig. 1: Breakdown of results of the pyDock group at the latest CAPRI round. The results are presented for each target, in the predictor as well as scorer capacity.

Fig. 1: Breakdown of results of the pyDock group at the latest CAPRI round. The results are presented for each target, in the predictor as well as scorer capacity.

The pyDock group Methodology and Results

The pyDock group follows the pattern outlined above. As a sampler they employ ZDOCK and FTDOCK, both fast rigid-body Fast Fourier Transform-based methods. They use their pyDock function to score the best models. Additionally, they remove redundancy from the final lists of decoys by removing these entries which are too similar to the higher-scoring ones according to ligand rmsd. The final refining step is carried out using TINKER.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

The pipeline outlined above is carried out in most of targets, however there are some exceptions. For some targets, additional docking poses were generated using SwarmDock (T53, T54, T57 and T58) and RotBUS (T46 and T47). In some cases rather than using TINKER at the refinement stage, CHARMM or AMBER were used.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Such ad hoc approaches were employed for almost every target. Available information was employed to the fullest in order to achieve best models. For instance, in cases where good homology structures were available (T47), or it was known which residues appear of importance to the complex, appropriate distance constraints were imposed. The pyDock group achieves better accuracy as predictors rather than scorers when it comes to docking (67% correct models submitted against 57%). They follow a trend wherein predictors usually do better than scorers (See Fig. 1). This trend is however broken at the stage of predicting the water molecules at the interface of T47. Using DOWSER In the predictor capacity they correctly identify 20% contact-mediating water molecules and 43% as scorers.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Overall, the performance of docking is indicating that the field is steadily moving forward towards achieving the goal of modeling complexes using sequence alone. There were some cases in this round where predictor groups started with sequence only and still produced reasonable models of complexes (T47, T50 and T53). In each of these cases one of the partners was an unbound structure and the other was a sequence. The only case where both partners were sequences did not produce any reasonable models. In this case only two groups out of 40 managed to present meaningful solutions.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Furthermore, this CAPRI round was the first to introduce affinity prediction – in targets T55 – T56. The aim was to predict the binding energy changes upon mutations. The predictor designed by the pyDock group achieved a good results on this test case with more in-depth details on the approach found in a related community-wide experiment.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

 

Journal Club: A Novel Insight into Gene Ontology Semantic Similarity

In a past journal club, I presented a paper by Xu et al on their method of quantifying semantic similarity of proteins based on the Gene Ontology. The Gene Ontology (GO) is a directed acyclic graph (DAG, c.f. Figure 1) of terms with which a protein or a gene can be annotated. These terms are designed to describe the process the protein is involved in, what role it fulfils in this process, and where in the cell it is localized. Thus, when comparing the function of two proteins in a quantifiable way, it has become standard to refer back to the GO terms these proteins are annotated with and compare these based on their relationship in the DAG.

Schematic Diagram of a Directed Acyclic Graph (DAG).

Figure 1: Schematic Diagram of a Directed Acyclic Graph (DAG).

As opposed to many methods, which measure the importance of a node (GO-term) in the DAG as its information content given an external database, the method proposed by Xu et al measures semantic similarity independently of external resources, which gives it the appearance of an independent measure. Furthermore, it claims to be good at dealing with badly annotated proteins, which is often a big problem in functional similarity calculations.

The similarity measure is a hybrid between node-based and edge-based methods, and is seemingly inspired by Wang et al’s 2007 paper and Shen et al’s 2010 paper. It is based on what they call “Shortest Semantic Differentiation Distance” or (SSDD), which is calculated over the shortest distance between two GO-terms on the DAG. When comparing the GO-terms A and B, the shortest path is measured by traversing the DAG upwards from node A to the lowest common ancestor of both nodes and down to node B.

The SSDD calculation over the shortest path is based on their so-called semantic Totipotency values assigned to the terms in the DAG that are part of the shortest path. The semantic Totipotency, T, of a term is calculated by:

Semantic Totipotency Measure

Semantic Totipotency Measure

where the weight, ω, is given by:

Weight, ω

Weight, ω

Here, Dst(t) denotest the number of descendents of the term t, and tp denotes the parent term of term t. Thus, the T-value of every node is both an expression of the depth of the DAG in this area and the coverage.

Finally, the SSDD is calculated by:

Semantic Similarity Differentiation Distance

Shortest Semantic Differentiation Distance

And subsequently the similarity of two GO terms is measured by:

Screenshot from 2014-01-27 12:47:46

 

Results

In their paper Xu et al showed the method to be competitive compared to other methods which compute protein functional similarity by pairwise GO-term comparisons, while also outperforming a common graph-based method in simUI. While these results look promising, the biological interpretability of such a semantic similarity measure remains difficult.

The strongest advantage of the SSDD method proposed was however its alleged invariance to annotation richness of proteins, which was presented as shown in Figure 2 below (Figure 5 in the paper).

Figure 2: The performance of difference methods dealing with sets of proteins with difference annotation richness.

Figure 2: The performance of difference methods dealing with sets of proteins with difference annotation richness.

The results in this figure show that SSDD exhibits only a slight decrease in Pearson Correlation Coefficient to a set of reference similarity values for proteins which are less well annotated. This ability to deal with badly annotated proteins is the true value of the SSDD method proposed by Xu et al. However, this investigation was performed by sets of proteins selected by the authors, and should thus be validated independently to confirm these surprising results.

Journal Club: Ligand placement based on prior structures: the guided ligand-replacement method

Last week I presented a paper by Klei et al. on a new module in the Phenix software suite. This module, entitled Guided Ligand-Replacement (GLR), aims to make it easier to place ligands during the crystallographic model-building process by using homologous models of the ligand-protein complex for the initial placement of the ligand.

In the situation where ligands are being added to a crystallographic protein model, a crystallographer must first build the protein model, identify the difference electron density, and then build the ligand into this density.

The GLR approach is particularly helpful in several cases:

  • In the case of large complex ligands, which have many degrees of freedom, it can take a long time to fit the ligand into the electron density. There may be many different conformations of the ligand that fit the difference electron density to a reasonable degree, and it is the job of the crystallographer to explore these different conformations. They must then identify the true model, or perhaps an ensemble of models in the case where the ligand is mobile or present in different, distinct, binding modes. GLR makes this process easier by using a template from a similar, previously-solved structure. The ligand position and orientation is then transplanted to the new structure to give a starting point for the crystallographer, reducing the tedium in the initial placing the ligand.
  • In the case of a series of related crystal structures, where the same protein structure is determined a number of times, bound to different (but similar) ligands. This is common in the case of structure based drug-design (SBDD), where a compound is developed and elaborated upon to improve binding affinity and specificity to a particular protein. This process generates a series of crystal structures of the protein, bound to a series of ligands, where the binding modes of the ligands are similar in all of the structures. Therefore, using the position and orientation of the ligand from a structure is a good starting point for the placement of further elaborations of that ligand in subsequent structures.
  • In the case of several copies of the protein in the asymmetric unit cell of the crystal. After one copy of the ligand has been built, it can be quickly populated throughout the unit cell, removing the need for the crystallographer to undertake this menial and tedious task.

Program Description:

The required inputs for GLR are standard, as required by any ligand-fitting algorithm, namely:

  • The APO structure of the protein (the structure of the protein without the ligand)
  • A description of the ligand (whether as a SMILES string, or as a cif file etc)
  • An mtz file containing the experimental diffraction data

Overview of the program:

GLR Program Overview

Fig 1. Program Overview.

> Identification of the reference structure

Firstly, the program must determine the reference structure to be used as a template. This can be specified by the user, or GLR can search a variety of sources to find the best template. The template selection process is outlined below. Reference structures are filtered by the protein sequence identity, similarity of the molecular weights of the ligands, and finally by the similarity of the binary chemical fingerprints of the ligands (as calculated by the Tanimoto coefficient).

Template Selection

Fig 2. Reference Structure selection flow diagram.

Little justification is given for these cutoffs, although it is generally accepted that proteins with above 70% sequence identity are highly structurally similar. The Tanimoto coefficient cutoff of 0.7 presumably only serves to remove the possibly of very low scoring matches, as if multiple potential reference structures are available, the highest Tanimoto-scored ligand-match is used. They do not, however, say how they balance the choice in the final stage where they take the ligand with the highest Tanimoto score and resolution.

The method for assigning the binary chemical fingerprints can be found here (small error in link in paper).

> Superposition of Reference and Target structures

Once a reference structure has been selected, GLR uses graph-matching techniques from eLBOW to find the correspondences between atoms in the reference and target ligands. These atomic mappings are used to orient and map the target ligand onto the reference ligand.

Once the reference protein-ligand structure is superposed onto the target protein, these atomic mappings are used to place the target ligand.

The target complex then undergoes a real-space refinement to adjust the newly-placed ligand to the electron density. This allows the parts of the target ligand that differ from the reference ligand to adopt the correct orientation (as they will have been orientated arbitrarily by the graph-matching and superposition algorithms).

> Summary, Problems & Limitations

GLR allows the rapid placement of ligands when a homologous complex is available. This reduces the need for computationally intensive ligand-fitting programs, or for tedious manual building.

For complexes where a homologous complex is available, GLR will be able to quickly provide the crystallographer with a potential placement of the ligand. However, at the moment, GLR does not perform any checks on the validity of the placement. There is no culling of the placed ligands based on their agreement with the electron density, and the decision as to whether to accept the placement is left to the crystallographer.

As the authors recognise in the paper, there is the problem that GLR currently removes any overlapping ligands that are placed by the program. This means that GLR is unable to generate multiple conformations of the target ligand, as all but one will be removed (that which agrees best with the electron density). As such, the crystallographer will still need to check whether the proposed orientation of the ligand is the only conformation present, or whether they must build additional models of the ligand.

As it is, GLR seems to be a useful time-saving tool for crystallographic structure solution. Although it is possible to incorporate the tool into automated pipelines, I feel that it will be mainly used in manual model-building, due to the problems above that require regular checking by the crystallographer.

There are several additions that could be made to overcome the current limits of the program, as identified in the paper. These mainly centre around generating multiple conformations and validating the placed ligands. If implemented, GLR will become a highly useful module for the solution of protein-ligand complexes, especially as the number of structures with ligands in the PDB continues to grow.

Journal Club: Human Germline Antibody Gene Segments Encode Polyspecific Antibodies

This week’s paper by Willis et al. sought to investigate how our limited antibody-encoding gene repertoire has the ability to recognise the unlimited array of antigens. There is a finite number of V, D, and J genes that encode our antibodies, but it still has the capacity to recognise an infinite number of antigens. Simply, the authors’ notion is that an antibody from the germline (via V(D)J recombination; see entry by James) is able to adopt multiple conformations, thus allowing the antibody to bind multiple antigens.

Three antibodies derived from the germline gene 5*51-01, all binding to very different antigens.

Three antibodies derived from the germline gene 5*51-01 bind to very different antigens.

To test this hypothesis, the authors performed a multiple sequence alignment for the amino acid sequence between the mature antibodies and the germline antibody sequence from which the antibodies are derived from. if a single position from ONE mature antibody showed a difference to the germline sequence, it was identified as a ‘variable’ position, and allowed to be changed by Rosetta’s multi-state design (MSD) and single-state design (SSD) protocols.

Pipeline: align mature antibodies (2XWT, 2B1A, 3HMX) to the germline sequence (5-51) , identify 'variable' positions from the alignment, then allow Rosetta to change those residues during design.

Figure 1) from Willis et al., showing the pipeline: align mature antibodies (2XWT, 2B1A, 3HMX) to the germline sequence (5-51) , identify ‘variable’ positions from the alignment, then allow Rosetta to change those residues.

Surprisingly, without any prior information of the germline sequence, the MSD yielded a sequence that was closer to the germline sequence, and the SSD for each mature antibody had retained the mature sequence. In short, this indicated that the germline sequence is a harmonising sequence that can accommodate the conformations of each of the mature antibodies (as proven by MSD), whereas the mature sequence was the lowest energy amino acid sequence for the particular antibody’s conformation (as proven by SSD).

To further demonstrate that the germline sequence is indeed the more ‘flexible’ sequence, the authors then aligned the mature antibodies and determined the deviation in ψ-ϕ angles at each of the variable positions that were used in the Rosetta study. They found that the ψ-ϕ angle deviation in the positions that recovered to the germline residue was much larger than the other variable positions along the antibody. In other words, for the positions that tend to return to the germline amino acid in MSD, the ψ-ϕ angles have a much larger degree of variation compared to the other variable positions, suggesting that the positions that returned to the germline amino acid are prone to lots of movement.

In addition to the many results that corroborate the findings mentioned in this entry, it’s neat that the authors took a ‘backwards’ spin to conventional antibody design. Most antibody design regimes aim to find amino acid(s) that give the antibody more ‘rigidity’, and hence, mature its affinity, but this paper went against the norm to find the most FLEXIBLE antibody (the most likely germline predecessor*). Effectively, they argue that this type of protocol can be exported to extract new antibodies that can bind to multiple antigens, thus increasing the versatility of antibodies as potential therapeutic agents.

Journal Club: Large-scale analysis of somatic hypermutations

This week I presented a paper by Burkovitz et al from Bar Ilan University in Israel.  The study investigates the mutations that occur in B-cell maturation and how the propensity for a change to be selected is affected by where in the antibody structure it is located. It nicely combines analysis of both DNA and amino-acid sequence with structural considerations to inform conclusions about how in vivo affinity maturation occurs.

Before being exposed to an antigen, an antibody has a sequence determined by a combination of genes (V and J for the light chain; V, D and J for the heavy chain). Once exposed, B-cells (the cells that produce antibodies), undergo somatic hyper-mutation (SHM) to optimise the antibody-antigen (ab-ag) interaction. These mutations are commonly thought to be promoted at activation-induced deaminase (AID) hotspots.

The authors’ first finding is that the locations of SHMs do not correlate well with the positions of AID hotspots and that the distribution of their distance to a hotspot is not much different to that of the background distribution. They conclude that although perhaps a mechanism to promote mutation, AID hotspots are not a strong factor that indicate whether a mutation will fix.

Motivated to find other determinants for SHM preferences, the study turns to examining structural features and energetics of the molecules. SHMs are found to be more prevalent on the VH domain of an Fv than the VL. However, when present, the energetic importance of an SHM is not related to the domain it is on. In contrast, the contribution an SHM makes to the binding energy is related to its structural location. As one might perhaps expect, those SHMs in positions that can make contact with the antigen have more affect than those that do not. Consideration of their propensity instead of raw frequency also shows that SHMs are more prevalent in antibody-antigen interfaces than in the rest of the molecule. However, they are also likely to occur in the VH-VL interface suggesting an importance for this region in fine-tuning the geometry and flexibility of the binding site.

figure

Figure taken from Burkovitz et al shows a) the location of different structural regions on the Fv b) the energetic contribution of the SHMs in each region c) the fraction of SHMs in the regions and their relative size d) the propensity for an SHM to occur in each of the five structural regions.

Perhaps the most interesting result of this study is the authors’ conclusions about the propensity of SHMs to mutate germline residues to particular amino-acids. It is found that whilst germline amino-acid usage in binding sites is distinctive from other protein-protein interfaces, the residue profiles of SHMs are less diverged. They therefore act to bring the properties ab-ag interaction towards those seen in normal interactions. This may suggest, as proposed by other studies, that the somatic hyper-mutation process is similar to mutation properties observed in evolution. In addition, it is found that five amino-acids, asparagine, arginine, serine, threonine and aspartic acid are the most common substitutions made in SHM. Finally, positions where SHMs most often have an important effect on binding energy are presented. These positions, and the amino-acid preferences provide promising targets for use in rational antibody design procedures.