Tag Archives: journal club

Journal club: Human enterovirus 71 protein interaction network prompts antiviral drug repositioning

Viruses are small infectious agents, which possess genetic code but have no independent metabolism. They propagate by infecting host cells and hijacking their machinery, often killing the cells in the process. One of the key challenges in developing effective antiviral therapies is the high mutation rate observed in viral genomes. A way to circumvent this issue is to target host proteins involved in virion assembly (also known as essential host factors, or EHFs), rather than the virion itself.

In their recent paper, Lu Han et al. [1] consider human virus protein-protein interactions in order to explore possible host drug targets, as well as drugs which could potentially be re-purposed as antivirals. Their study focuses on enterovirus 71 (EV71), one of the leading causes of hand, foot, and mouth disease.

Human virus protein-protein interactions and target identification

EHFs are typically detected by knocking out genes in the host organism and determining which of the knockouts result in virus control. Low repeat rates and high costs make this technique unsuitable for large scale studies. Instead, the authors use an extensive yeast two-hybrid screen to identify 37 unique protein-protein interactions between 7 of the 11 virus proteins and 29 human proteins. Pathway enrichment suggests that the human proteins interacting with EV71 are involved in a wide range of functions in the cell. Despite this range in functionality, as many as 17 are also associated with other viruses, either through known physical interactions, or as EHFs (Fig 1).

Fig. 1. Interactions between viral and human proteins (denoted as EIPs), and their connection to different viruses.

One of these is ATP6V0C, a subunit of vacuole ATP-ase. It interacts with the EV71 3A protein, and is a known essential host factor for five other viruses. The authors analyse the interaction further, and show that downregulating ATP6V0C gene expression inhibits EV71 propagation, while overexpressing it enhances virus propagation. Moreover, treating cells with bafilomycin A1, a selective inhibitor for vacuole ATP-ase, inhibits EV71 infection in a dose-dependent manner. The paper suggests that therefore ATP6V0C may be a suitable drug target, not only against EV71, but also perhaps even for a broad-spectrum antiviral. While this is encouraging, bafilomycin A1 is a toxic antibiotic used in research, but not suitable for human or drug use. Rather than exploring other compounds targeting ATP6V0C, the paper shifts focus to re-purposing known drugs as antivirals.

Drug prediction using CMap

A potential antiviral will ideally disturb most or all interactions between host cell and virion. One way to do this would be to inhibit the proteins known to interact with EV71. In order to check whether any known compounds already do so, the authors apply gene set enrichment analysis (GSEA) to data from the connectivity map (CMap). CMap is a database of gene expression profiles representing cellular response to a set of 1309 different compounds.  Enrichment analysis of the database reveals 27 potential EV71 drugs, of which the authors focus on the top ranking result, tanespimycin.

Tanespimycin is an orphan cancer drug, originally designed to target tumor cells by inhibiting HSP90. Its complex effects on the cell, however, may make it an effective antiviral. Following their CMap analysis, the authors show that tanespimycin reduces viral count and virus-induced cytopathic effects in a dose-dependent manner, without evidence of cytotoxicity.

Overall, the paper presents two different ways to think about target investigation and drug choice in antiviral therapeutics — by integrating different types of known host virus protein-protein interactions, and by analysing cell response to known compounds. While extensive further study is needed to determine whether the results are directly clinically relevant to the treatment of EV71, the paper shows how  interaction data analysis can be employed in drug discovery.


[1] Han, Lu, et al. “Human enterovirus 71 protein interaction network prompts antiviral drug repositioning.” Scientific Reports 7 (2017).


Proteins evolve on the edge of supramolecular self-assembly

Inspired by Eoin’s interesting talks on prions and prion diseases, and Nick’s discussion of how Cyro-Electron microscopy is going to be the end of an era for Crystallography. I thought I’d look at a paper that discusses aggregation of protein complexes, with some cryo-electron microscopy thrown in for good measure.

Supramolecular assembbly

a, A molecule gaining a single self-interacting patch forms a finite dimer. A self-interacting patch repeated on opposite sides of a symmetric molecule can result in infinite assembly. b, A point mutation in a dihedral octamer creates a new self-interacting patch (red), triggering assembly into a fibre.

Supramolecular assemblies are folded protein complexes forming into much larger units. This formation can be triggered by a mutation on a copy of the constituent homomers of the complex, acting as a self-interacting patch. If this patch were to form in a non-symmetric complex, it would likely form a finite assemble with a limited number of copies of the complex. However, if the complex has dihedral symmetry such that a patch is accessible at multiple separated locations, then complex can potentially form near infinite supramolecular assemblies. Continue reading

Is contacts-based protein-protein affinity prediction the way forward?

The binding affinity of protein interactions is useful information for a range of protein engineering and protein-protein interaction (PPI) network challenges. Obvious applications include the development of therapeutic antibodies to given drug targets or the engineering of novel interfaces for synthetic protein complexes. An accurate model would furthermore allow us to predict a large proportion of affinities in existing PPI networks, and enable the identification of new PPIs, which is critical for our ability to model protein network dynamics effectively.


“The design of an ideal scoring function for protein−protein docking that would also predict the binding affinity of a complex is one of the challenges in structural proteomics.” Adapted from Kastritis, Panagiotis L., and Alexandre MJJ Bonvin. Journal of proteome research 9.5 (2010): 2216-2225.

In last week’s paper a new binding-affinity prediction method based on interfacial contact information was described. Contacts have long been used to in docking methods but surprisingly this was the first time that binding affinity was predicted with them. Largely, this was due to the lack of a suitable benchmark data set that contained structural as well as affinity data . In 2011, however, Kastritis et al. presented a curated database of 144 non-redundant protein–protein complexes with experimentally determined Kd (ΔG) as well as x-ray structures.
Using this data set they trained and validated their method, compared it against others and concluded that interfacial contacts `can be considered the best structural property to describe binding strength`. This claim may be true but as we discussed in the meeting there is still some work to do before we take this model an run with it. A number of flags were raised:

  • Classification of experimental methods into reliable and non-reliable is based on what gives the best results with their method. Given that different types of protein complexes are often measured with different methods, some protein classes for which contact-based predictions are less effective may be excluded.
  • Number of parameters for model 6 is problematic without exact AIC information. As Lyuba righlty pointed out, the intercept in model 6 `explodes`. It is no surprise that the correlation improves with more parameters. Despite their AIC analysis, overfitting is still a worry due to the lack of details presented in the paper.


  • Comparison against other methods is biased in their favour; their method was trained on the same data set, the others were not. In order to ensure a fair comparison all methods should be trained on the same data set. Of course this is hard to do in practice, but the fact remains that a comparison of methods that has been trained on different data sets will be flawed.

Paper: Vangone, A., Bonvin, A. M. J. J., Alberts, B., Aloy, P., Russell, R., Andrusier, N., … Zhou, Y. (2015). Contacts-based prediction of binding affinity in protein-protein complexes. eLife, 4, e07454. http://doi.org/10.7554/eLife.07454

Slow and steady improvements in the prediction of one-dimensional protein features

What do you do when you have a big, complex problem whose solution is not necessarily trivial? You break the problem into smaller, easier to solve parts,  solve each of these sub-problems and merge the results to find the solution of the original, bigger problem. This is an algorithm design paradigm known as the divide and conquer approach.

In protein informatics, we use divide and conquer strategies to deal with a plethora of large and complicated problems. From protein structure prediction to protein-protein interaction networks, we have a wide range of sub and sub-sub problems whose solutions are supposed to help us with the bigger picture.

In particular, prediction of the so called one-dimensional protein features are fundamental sub-problems with a wide range of applications such as protein structure modelling,  homology detection, functional characterization and others. Here, one-dimensional protein features refer to secondary structure, backbone dihedral and C-alpha angles, and solvent accessible surface area.

In this week’s group meeting, I discussed the latest advancements in prediction of one-dimensional features as described in an article published by Heffernan R. and colleagues in Scientific Reports (2015):

“Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning.”

In this article, the authors describe the implementation of SPIDER2, a deep learning approach to predict secondary structure, solvent accessible surface area, and four backbone angles (the traditional dihedrals phi and psi, and the recently explored theta and tau).

“Deep learning” is the buzzword (buzz-two-words or buzzsentence, maybe?) of the moment. For those of you who have no idea what I am talking about, deep learning is an umbrella term for a series of convoluted machine learning methods. The term deep comes from the multiple hidden layers of neurons used during learning.

Deep learning is a very fashionable term for a reason. These methods have been shown to produce state-of-the-art results for a wide range of applications in several fields, including bioinformatics. As a matter of fact, one of the leading methods for contact prediction (previously introduced in this blog post), uses a deep learning approach to improve the precision of predicted protein contacts.

Machine learning has already been explored to predict one-dimensional protein features, showing promising (and more importantly, useful) results. With the emergence of new, more powerful machine learning techniques such as deep learning, previous software are now becoming obsolete.

Based on this premise, Heffernan R. and colleagues implemented and applied their deep learning approach to improve the prediction of one-dimensional protein features. Their training process was rigorous: they performed a 10-fold cross validation using their training set of ~4500 proteins and, on top of that, they also had two independent test sets (a ~1200 protein test set and a set based on the targets of CASP11).  Proteins in all sets did not share more than 25% (30% sequence identity for the CASP set) to any other protein in any of the sets.

The method described in the paper, SPIDER2, was thoroughly compared with state-of-the art prediction software for each of the one-dimensional protein features that it  is capable of predicting. Results show that SPIDER2 achieves a small, yet significant improvement compared to other methods.

It is just like they say, slow and steady wins the race, right? In this case, I am not so sure. It would be interesting to see how much the small increments in precision obtained by SPIDER2 can improve the bigger picture, whichever your bigger picture is. The thing about divide and conquer is that if you become marginally better at solving one of the parts, that doesn’t necessarily imply that you will improve the solution of the bigger, main problem.

If we think about it, during the “conquer” stage (that is, when you are merging the solution of the smaller parts to get to the bigger picture),  you may make compromises that completely disregard any minor improvements for the sub-problems. For instance, in my bigger picture, de novo protein structure prediction, predicted local properties can be sacrificed to ensure a more globally consistent model. More than that, most methods that perform de novo structure prediction already account for a certain degree of error or uncertainty for, say, secondary structure prediction. This is particularly important for the border regions between secondary structure elements (i.e. where an alpha-helix ends and a loop begins). Therefore, even if you improve the precision of your predictions for those border regions, the best approach for structure prediction may still consider those slightly more precise border predictions as unreliable.

The other moral of this story is far more pessimistic. If you think about it, there were significant advancements in machine learning, which led to the creation of ever-more-so complicated neural network architectures. However, when we look back to how much improvement we observed when these highly elaborate techniques were applied to an old problem (prediction of one-dimensional protein features), it seems that the pay-off wasn’t as significant (at least as I would expect). Maybe, I am a glass half-empty kind of guy, but given the buzz surrounding deep learning, I think minor improvements is a bit of a let down. Not to take any credit away from the authors. Their work was rigorous and scientifically very sound. It is just that maybe we are reaching our limits when it comes to applying machine learning to predict secondary structure. Maybe when the next generation of buzzword-worthy machine learning techniques appear, we will observe an even smaller improvement to secondary structure prediction. Which leaves a very bitter unanswered question in all our minds: if machine learning is not the answer, what is?

MAMMOTH: a case study in protein structure alignment

I’ve talked about protein structure alignment before in the context of a rather novel, mathematical approach. This time I wanted to revisit the topic in a general sense, using a more established algorithm as a case study. MAMMOTH stands for Matching Molecular Models Obtained from Theory and was first published in 2002. Since then it has been cited nearly 400 times and the underlying algorithm has been extended to a multiple alignment program: MAMMOTH-mult.

Establishing biologically relevant and consistent alignments between protein structures is one of the major unsolved problems in computational bioinformatics. However, it’s an important part of many challenges that we face: such as establishing homology between distantly related proteins, functional inference for unannotated proteins, and evaluating the accuracy of models of predicted structure for competitions such as CASP.

Problem Outline

In essence the problem of protein structure alignment can be outlined by considering two ordered sets of coordinates, A = {a1,a2,…,an} and B = {b1,b2,…,bm}, representing points in 3D space. In most cases these points will be the location of the Cα atoms along each structure’s backbone. The sets A and B might be completely different lengths and, if an alignment exists, are almost certainly orientated differently compared to each other.


Establishing an alignment between these sets is equivalent to two steps:

  1. Establish a match M = {(ai,bj) | ai ∈ A, bj ∈ B}
  2. Rotate and translate A onto B so that equivalent atoms are as close as possible.

Of course, it is not immediately clear how to best fulfill these requirements. In particular, we don’t really know what features to prioritise for a biologically relevant match. Should we try to match secondary structure elements and what penalty should we attach to mismatched elements? How about maintaining the correct hydrogen bonding patterns between matched residues? And how much weight should we put on the matched atoms being consecutive in each set (i.e. how should we penalise gaps)?

The second step is equally ambiguous. Especially as there is no consensus on what the correct interpretation of close is. Minimising the RMSD between equivalent atoms is a popular choice of distance measure. However, as the MAMMOTH paper points out, RMSD is often dominated by the mismatched portions of remotely related structures and is thus largely inappropriate in these cases. Furthermore, even if we have a well-defined distance metric, should the superposition prioritise minimising the distances between nearly identical parts of the different structures, at the expense of less similar substructures? Or should the emphasis be on maintaining as lengthy a match as possible at the possible cost of a lower closeness of fit? How about the relative importance of a close fit for atoms in the core of the structure vs. those on the surface?

The majority of these questions remain unanswered and as a result it is often hard to validate alignments as we simply do not know what the right answer is. In fact, in many cases, manual analysis is preferred over any of the available computational techniques.

In this post I’ll go through how the MAMMOTH algorithm approaches each of these steps. For many of the above questions MAMMOTH does not postulate a solution, primarily because, as its name suggests, it was designed to assess prediction models which are often at low resolutions and lacking secondary structure or hydrogen bonding information. I think it’s important to keep these questions in mind, but there’s certainly no necessity to design a programme which deals with them all.

Step 1: Pairing up residues (similarity matrix)

In order to establish a match between equivalent atoms in A and B, MAMMOTH, like several other structural alignment algorithms, uses a well-established alignment technique: a similarity matrix (often inferred from and referenced as a distance matrix). A similarity matrix for alignment is an n x m array where each entry S(ai,bj) represents the pairwise similarity score between residues ai and bj. An alignment between the residues of A and B is any non-decreasing path (that is, a pair (ai,bj) in the path must appear later in the ordering of coordinates of both A and B than the preceding pair of residues in the path) from the top left corner of the array (a1,b1) to the bottom right corner (an,bm). For example the following path can be interpreted as an alignment between A = {a1, …, a11} and B = {b1, …, b8}


Any alignment can be scored by summing up the similarity scores along this path, while penalising any gaps in an appropriate way (normally, these algorithms use trial and error to decide on sensible penalties). For example, the above alignment would have the score S = S(a1,b1) + S(a2,b2) + S(a3,b3) + S(a7,b4) + S(a8,b5) + S(a9,b6) + S(a10,b7) + S(a11,b8) + α + 2β, where α and β are gap opening and gap extension penalties respectively. The optimal alignment is simply the alignment which maximises this score.

For sequence alignments similarity scores can be assigned to residues from substitution tables like BLOSUM. However, it is not immediately clear of an appropriate equivalent for structures. MAMMOTH, like several other algorithms, defines the similarity between different residues by examining their local structural landscape. Specifically, this means comparing fragments of each backbone, centred on the residue of interest. MAMMOTH uses the URMS distance between heptapeptide fragments. This distance is illustrated below using 2D chains and tripeptide fragments.


Comparing residues a2 and b3 involves looking at the directions between each successive residue of the fragment. Each vector is mapped to the unit sphere, beginning at the origin and ending at the surface of the sphere (in this case 2 vectors are considered, and in MAMMOTH’s case 6 different 3D vectors are mapped). The optimal rotation is found, superposing equivalent vectors as best as possible, and then the RMSD of the endpoints on the surface of the sphere is calculated as URMS(ai,bj).

Aside: The optimal superposition of a set of vectors is actually a non-trivial problem. It is essentially equivalent to step 2 in our alignment protocol outlined above, but is significantly easier for the 6 vectors characterising a fragment in MAMMOTH’s algorithm.

Finally, S(ai,bj) is calculated by converting the distance into a similarity measure:


where URMSR is the expected URMS of a random set of vectors and:


The optimal alignment through this MAMMOTH matrix is the path which maximises the sum of similarities between matched residues (each residue being at the centre of the heptapeptide fragment) using gap opening and extension penalties of 7.00 and 0.45 respectively.

Step 2: Global superposition (MaxSub)

The above alignment results in a match M’ optimising the local structural similarity of residues in each structure, however, their is no guarantee that this will result in a set of coordinates close in global space. In order to finalise the match set M ⊆ M’ as well as calculating the optimal superposition of the paired residues of A onto their equivalent points in B, MAMMOTH use the MaxSub algorithm. This is a very efficient algorithm (worth a read if you’re that way inclined) for calculating the maximal subset from a set of paired up atoms which are close in global space. MAMMOTH decide that close means < 4A away after superposition. They do not try to optimise a closer superposition than that but attempt to find the largest possible set of matched residues.

The MaxSub algorithm relies on the assumption (made for computational tractability) that the final subset M ⊆ M’ will have, somewhere, a set of at least four residues consecutive in M’. The algorithm then starts with every possible seed of four consecutive residues (just to illustrate the power of the assumption in reducing computational time: for a 150 residue protein there are just 147 such seeds, but over 2 million sets of four non-consecutive residues!! And it’s a pretty reasonable assumption to make as well). The MaxSub algorithm then calculates the superposition for those four matched pairs, extending the set of residues that are <4A away from their partners, recalculating the superposition using these new pairs as well, then removing any pairs which are no longer within the threshold of each other. It repeats these steps, gradually extending the set M, until the algorithm converges.

Scoring the alignment

Using the two approaches outlined above, MAMMOTH generates an alignment between the two input structures. In order to summarise the significance of this alignment, the algorithm generates the PSI score: the percentage structural identity (which is simply the size of the maximum subset divided by the length of the shortest protein). As a global measure of the strength of similarity the PSI score is poorly constructed and scales with protein length. In order to adjust for this bias, MAMMOTH fits a Gumbel distribution to PSI scores obtained from random structure comparisons between unrelated proteins at bins of different lengths. This results in a z-score measuring, instead of the PSI of an alignment, the likelihood of obtaining a PSI score as good as that by chance between any two proteins of the same lengths.

Journal Club: Statistical Quality Indicators for Electron Density Maps

This Week I presented Ian Tickle’s 2012 Paper “Statistical quality indicators for electron-density maps”. This paper presented new, statistically robust metrics for describing the agreement between an atomic model and the experimentally derived electron density.

Previous metrics such as the Real-space R (RSR) and Real-Space Correlation Coefficient (RSCC) (Brandon & Jones, 1991, and others) are popular electron density metrics, and can inform on the quality of an atomic model. However, as Tickle claims, they cannot inform on how the model is good, or bad, as they give no indication of the accuracy, or the precision, of the electron density.


Ian Tickle describes accuracy as – “How close are the results on average to the truth (regardless of precision)?” This is more often referred to as ‘error’. The most accurate model is the one that best agrees with the electron density.


Precision is described as – “If you were to repeat the experiment, how much would you expect the results to vary (regardless of accuracy)?” This is more often described as ‘uncertainty’. Precision is a property of the crystal and the experiment. It is independent of the model.

A pictographic representation is shown below –

Pictographic representation of accuracy and precision. Taken from Tickle, 2012.

Before the discussion of the new metrics proposed, there are several assumptions that must be made and several influencing factors to be considered.


  • The electron density, and the phases used to generate it, are accurate. This assumption is reasonable because density-based validation is generally done near to the end of refinement when the model is mostly correct.

Metric usefulness depends critically on:

  • Accurate calculation and representation of the electron density from our atomic model.
  • Accurate scaling of the observed and model density (neither calculated nor observed density is not on an absolute scale).
  • Accurate determination of the area to be considered for the calculation of the metric. If too large an area is considered, noise and density from other features will influence the metric. Too small an area will not encompass the whole model and its environment.

Calculating the Model Density:

Accurate calculation of the model’s electron density is essential, as the profile of the atoms will of course affect the comparison of the model to the experimental density. Often (as in Jones 1991, and others) a fixed profile is assumed for all atoms. Of course, in reality the profile will depend on atom type, B-factors, data completeness, and resolution limits.

Due to the resolution limits, the electron density from an atom is the convolution of a 3d gaussian and a sphere of constant scattering power (Blundell & Johnson, 1976). The truncated density function for an atom then becomes:

Screen Shot 2014-04-08 at 19.50.14

Scaling the calculated density:

This, fortunately, is already available and calculated by refinement programs (when calculating 2mFo – DFc maps), and the correct scaling factor is the resolution-dependent D.

Visualising the quality of a model:

To demonstrate how the (global) quality of a model can easily be seen, Tickle calculates and normalises difference density maps for a good, and a bad, model. If the model is ‘correct’, then the difference density should be gaussian noise, but if the model is ‘incorrect’, it will be skewed. This can easily be seen in Figure 8 from the paper.

Screen Shot 2014-04-08 at 20.26.06

A difference density map is calculated, sorted by value and normalised to give a density distribution. For a good model, this should look like (a), where the density function is a gaussian ~ N(0,1). For a bad model, (b), the distribution is skewed.

The main feature that appears in a ‘bad’ model is the increased weight in the tails of the distribution. Extra weight on the left-hand side indicates model that is not supported by the evidence, and extra weight on the right-hand side indicates under-modelled density.

The New Accuracy Metric

Using the ideas from above (that the difference density between a model and the experimental density should be distributed as a gaussian) Tickle goes on to develop metrics for determining the likelihood that a model density and an experimental density differ only by the addition of random noise.

The metric he describes tests a hypothesis – Does the distribution of the difference density reflect that obtained from the propagation of random errors in the experimental data (and phases)?

To do this, statistical tests are developed. First we define the difference density Z-score (ZD)

Screen Shot 2014-04-08 at 21.05.19

This quantity is the difference between the calculated electron density and the experimental density (delta rho), divided by the error in the difference density, giving the normal definition of a normalised Z-score.

The difference density (the numerator) has been discussed above, so we now discuss the error in the difference density. Assuming that the experimental data and the phases are ‘correct’, any non-random errors arise only from the model.

That is, errors arising in the experimental data will appear only as random noise, whereas errors in the model will manifest as the signal that we are trying to detect.

To calculate the strength of the noise (that of the experimental data and the phases), we look at the bulk-solvent regions. Here, the atoms are unordered, and so should give uniform density. Any deviations from uniform should be approximately the random noise from the experimental data and the phases.

Maximum Z-score analysis

Tickle considers using the maximum ZD of a sample as a test for model accuracy, and discusses merits and failings. In brief, if we were to sample from a difference density distribution, and take only the most significant ZD score, “focusing only on the maximum value inevitably overstates the significance of the results”.

A Chi-Squared test for ZD scores

The solution that Tickle proposes is to allow that all sample values may be significant (rather than just the largest values). He creates a joint probability density function of the absolute sample values (assumed half-normal and iid). This probability density function then becomes a chi-squared distribution.

Screen Shot 2014-04-08 at 22.34.30By calculating the CDF of the chi-squared (a lower regularised gamma function), Tickle is able to attain p-values for a set of observations.

Screen Shot 2014-04-08 at 22.38.11These can then be converted back to Z-scores, which crystallographers are more comfortable using. As Tickle states, just because the metric is in terms of Z-scores does not mean that the distribution is normal (here it is clearly a chi-squared).

Problem: Large Samples

The problem with the above is that for large samples, a small number of significant values will be drowned out by noise and the signal may be missed. The failure of the above test in this situation is put down to the choice of null hypothesis. Multiple null hypotheses are needed in order to cover all possibilities.

When distinguishing between multiple hypotheses, we must aim to avoid type II errors wherever possible, whilst attempting to minimise type I errors. We must select the hypothesis “that maximises the probability of obtaining a result less extreme that the one actually observed (…) or equivalently the one that minimises the probability of obtaining a result more extreme than that observed”.

Solution: A new JPDF and CDF

To solve this, Tickle takes a subset of the highest values of the original sample of n, say from i=k to n (the n-k highest scores), and calculates the chi-squared and its associated cumulative probability. We will then choose the value of k such that it gives us the highest probability,

Screen Shot 2014-04-08 at 22.56.34However, the cumulative probability of chi-squared is no longer the regularised gamma function due to the bias introduced by selected the largest values. Recalculating the JPDF and integrating analytically and numerically to obtain the CDF, we could arrive at a result. This, however, has the problem of a large dimensionality, which requires the use of very accurate Monte Carlo integration (accuracy of much better 0.27% is required, since we are interested in the p-values between 0.9973 and 1 – greater than 3 sigma).

Fortunately, an approximation can be made to bring about a practical solution.

Examples of Significant Distributions:

Tickle generates a table which gives several scenarios that will give a significant result, for different extremes of Z-value, and different sample sizes. One particularly key point is

…small samples are statistically less reliable so require a higher proportion of significant data points to achieve the same overall level of significance. Large samples require relatively fewer data points but they must have higher values to overcome the ‘multiple comparisons’ effect, where large values are more likely to occur occur purely as a result of random error.



Tickle shows early in the paper that the RSR and RSCC are correlated with the B-factor of the model. RSZD shows no correlation with the B-factor, as is desired.


More useful scores can be generated by scoring the negative and positive values of delta-rho separately. This gives two scores, RSZD+ and RSZD-. RSZD+ gives the significance/prevalence of unexplained density (missing atoms) and RSZD- gives the significance/prevalence of unjustified model/misplaced atoms.

Precision & Reliability:

Although not discussed in as much depth in the paper, Tickle also proposes a metric to account for the precision of the electron density

Screen Shot 2014-04-08 at 23.24.34

This is clearly independent of the model, and is the signal-to-noise ratio of the average observed density in a specified region. Weak density (or large noise) will lead to a small RSZO, implying that any model placed here should be considered unreliable.

Journal Club: Native contacts in protein folding

Like your good old headphone cables, strings of amino acids have the potential to fold into a vast number of different conformations given the appropriate conditions. A conservative estimation for the time it would take a 100 residue protein to explore all theoretically possible conformations would exceed the age of the Universe several times. This is obviously not feasible and was pointed out by Levinthal when he published his “How To Fold Graciously” in 1969.

The so called Protein-Folding Problem has since been under intense study, which inevitably has led to a few theories and models about its nature. Due to the lack of appropriate wet-lab methods to study this phenomenon theoretical, computational approaches have been key to devising impactful frameworks for formally describing protein folding. One of these goes under the name of principle of minimum frustration introduced by Bryngelson and Wolynes in the late 80s (1). It states that proteins by evolution were enriched for sequences with the propensity to fold into low-energy structures, while actively selecting against traps. By avoiding mis-folding and non-native contacts, the theory says, a smooth funnel-like energy landscape with native-state minima is created that ensures robust and fast folding.

This implies that native contacts, i.e. residues that interact in the fully folded protein play a major role in the folding process. Gō models (2), named after Nobuhiro Gō who first proposed this method, are based around this assumption with the energetic contributions of native interactions acting as the sole driving forces in the folding process. While this approach has yielded promising results, many of which were in concordance with experiments, its underlying principles have never been validated in a statistically meaningful way.

native contact schematic

A schematic for native-contact-driven protein folding

In 2013 a study by Best, Hummer and Eaton (3) formally addressed this question. By devising a set of statistical quantities aimed at weighting the importance of native and non-native interactions for folding and applying these to the analysis of several long MD folding simulations they were able to show a “native-centric mechanism” for small fast-folding proteins.

In a first step it was assessed whether the fraction of native contacts  provided a suitable reaction coordinate for the simulated folding events. From their equilibrium simulations two thresholds of native-contact-fractions  were chosen that defined folded and unfolded states (a two-state model is assumed). Overlaying the values for the most visited native-contact-fractions during simulation against these thresholds revealed a strong correlation between the two equilibrium probability density maxima and the protein’s fold state. In addition they showed that the range of native-contact-fractions between those found to represent unfolded and folded thresholds were indicative of being on a transition path (defined as the  “.. regions of the trajectories that cross directly from the unfolded well to the folded well ..”).

A further measure was introduced with the contact lifetime test. The log-ratio of the time a contact spent on a transition path vs the time it existed in the unfolded state was calculated and compared in a heat-map to the native contact map coloured by the number of contacts between residues.


Contact life time test for a selected protein.
Adapted from (3).

Among others this result revealed a clear connection between contacts with longer transition path life times and the number of contacts they made in the native structure.

So what about non-native interactions?

Screenshot from 2014-03-27 12:47:04

One of the measures addressing this question was the Bayesian measure for non-native contacts on transition paths. In the examples used in this paper, no obvious link between being on a transition path given a non-native contact was found unless they were close to native contacts. Further criteria such as the complementary quantity, which is the probability of being on a transition path when a contact is not made, concluded in a similar fashion.

Interestingly, it was found that the one protein that was influenced by non-native contacts was the designed α3D. Best et al. reasoned that additional frustration introduced when building a protein with artificially introduced stability has led to a shifting of helix register giving rise to this outlier.

When taken together, these results lay a robust foundation for further studies along the same lines. It is too early to accept or reject the presented findings as universal truth, but strong arguments for the native-centric mechanism being a reasonable model in small fast-folding proteins have been made. It would not be far-fetched to think that larger proteins would adhere to similar principles with non-native contacts modulating the landscape, especially when considering individual downhill folding modules.


(1) Bryngelson, J.D. et al., 1995. Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins, 21(3), pp.167–95.

(2) Taketomi, H., Ueda, Y. & Gō, N., 1975. Studies on protein folding, unfolding and fluctuations by computer simulation. I. The effect of specific amino acid sequence represented by specific inter-unit interactions. International journal of peptide and protein research, 7(6), pp.445–59.

(3) Best, R.B., Hummer, G. & Eaton, W.A., 2013. Native contacts determine protein folding mechanisms in atomistic simulations. Proceedings of the National Academy of Sciences of the United States of America, 110(44), pp.17874–9.

Kinetic Modelling of Co-translational Protein Folding (Journal Club)

Following up on last week’s entry, this post will explore the same topic: polypeptide chains assuming native-like conformations as they are extruded from the ribosome, or for the less intimate with the concept, co-translational protein folding.

Before addressing some important questions concerning co-translational protein folding, I would like to make a parenthesis: I want to dedicate a paragraph or two to talk about time.

Biological processes are dynamic. They are events that occur over a period of time. For instance, one can quantify the effect of mutations propagated and accumulated over millions of years of evolution. One can also quantify the femtoseconds in which subtle conformational changes occur in photoreceptor proteins like rhodopsin, when they respond to light. Time is fundamental to understand and model any sort of biological event.

Albeit it might seem obvious to the reader that time is so crucial to amass biological knowledge, those of us more theoretically inclined (bioinformaticians, computational biologists, biostatisticians,  mathematical biologists and so on and so forth) are usually  presented with models that tend to over-simplify reality. Surprisingly enough, there are many over-simplistic models that neglect the effect of time in order to “better” represent whatever they claim to model. Take Protein Docking for instance. The biological process at hand presents a complicated dynamic. There is a kinetic equilibrium, in which a vast amount of protein and ligand molecules interact, associating into complexes and dissociating. Nonetheless, Protein Docking is traditionally reduced to the binding affinity between a pair of molecules. As one might say, this is only a problem if I can present a solution… Luckily, Protein Docking is not my subject of expertise, so I will leave this question open to more tenacious minds than my own.

One of the areas in which I am truly interested in is the co-translational aspect of protein folding. If one performs a quick Google Images search, using the terms “Protein Synthesis” or “Protein Translation”, the results tell a very interesting story.  The vast majority of nascent protein chains are represented as fully elongates peptide chains. In a majority of pictures, the growing peptides do not even present secondary structure. They are mostly represented by long, unfolded, almost linear polymers.

Now, any first year Biochemistry student learns about something called Hydrophobicity (or hydrophilicity depending on whether you are a glass half empty or half full type of person). It is biochemistry-introductory-text-book stuff that some residues are polar and some residues are apolar, and hence will hide from water, forming a hydrophobic core. That (hydrophobicity) is one of the main driving forces of  protein folding.

Hence, most of the images that appear in our Google Images search are not very representative. They are plain wrong. It is simple physics that the growing peptide chains will form secondary and tertiary structures during the process of protein synthesis. One has to remember that this process is dynamic, it is happening over time. Under these circumstances, time should not be neglected. The time scale at which extrusion occurs is slow enough to allow the nascent chain to probe conformations and simply abide to the laws of physics. A fully elongated, completely unfolded and denatured peptide chain would not exist during protein synthesis. These nascent chains would adopt intermediate conformations simply as a result of apolar residues trying to hide from water.

Ok. Now, the BIG question that can be raised is whether those intermediate conformations actually resemble the native state of the fully elongated protein. I do not want to incur in Baby Kicking, but one thing that evolution has taught us is that cells have evolved to be highly efficient systems. There is no room for wasted energy. It makes sense to hypothesize that over millions of years, the cellular machinery has adapted to explore these intermediate conformations in order to make the process of protein folding more efficient.

Over the past couple of years, substantial evidence has been amassed that codon usage and the degeneracy of the genetic code could be exploited by cells to ensure that protein folding occurs accurately and efficiently. There are many theoretical ways that such exploitation could occur: the codon translation speed could facilitate the formation of certain intermediates that are beneficial for protein folding, that increase stability or that prevent protein aggregation. There is even a biomedical impact given that some observed pathologies have been associated with synonymous codon mutations that may lead to misfolded proteins.

In the paper I presented during this journal club [1], O’Brien and colleagues have devised and described a very interesting kinetic model for protein translation. Their model was used to describe possible scenarios in which both fast and slow translation speed codons are coordinators of co-translational protein folding. Please note that, in this context, co-translational protein folding is perceived as an enrichment of intermediate conformations of  the nascent chains, which resemble the native structure of the fully elongated protein.

In the model described in the paper, they opted for a probabilistic approach instead of an analytical (differential equations) approach. The time is modelled by the use of probabilities. The authors derived a formula to quantify the expected proportion of nascent chains of a given length that would be in a Folded intermediate state (one that resembles the native structure). They have managed to express this in terms of a rate of codon translation. Therefore, they stablish a direct relationship between Co-Translational protein folding and codon translation speed.

Their analysis is robust as none of the constants and kinetic rates need to be experimentally derived in order to provide insights about the protein folding process. Overall, I think the way the model was built was quite ingenious and very interesting. I would suggest any interested reader to read the article if they want to understand how the whole modelling was carried out.

Overall, I think the authors present a compelling argument for how cells could explore codon degeneracy and co-translational aspects of protein folding to improve folding efficiency. One of their results present a scenario in which fast translation speed codons can be used to assist in the fold of unstable protein regions, preventing the formation of misfolded intermediates.

One of the many functions of mathematical models is to provide insights into the underlying biology of the phenomena they attempt to model. The lack of any experimental evidence to support this paper’s results does not make it any less interesting. The article presents to the readers a sound and solid mathematical argument as to how co-translational aspects of protein folding could be beneficial for cell efficiency. If anything, they provide interesting hypotheses that might drive experimentalists in the future.

[1] Kinetic modelling indicates that fast-translating codons can coordinate cotranslational protein folding by avoiding misfolded intermediates.

Stepwise Assembly For Protein Loop Prediction


Loop modeling is used frequently in designing the structure of new proteins or refining protein structures with limited X-ray and NMR data. In 2011, Sripakdeevong et al. introduced a hypothesis called “Stepwise Ansatz” for modeling RNA loops with atomic accuracy. They believed that current knowledge-based RNA loop predictors which aimed at predicting loops with atomic accuracy, failed to sample models within 1.5 Å RMSD of the native structures. The bottleneck in these methods is related to inefficient sampling of the conformational space. To overcome the limitation of sampling, Sripakdeevong et al. introduced an ‘ab initio’ (de novo) buildup strategy to allow for high resolution sampling of loops instead of restricting the search space to available fragments. But with current computational power, exhaustive enumeration of N-length (N>1) loops with atomic resolution is impossible. If N=1, considering all the degrees of freedom for nucleotide will result in 1 million conformations. Performing Rosetta energy minimization on these models will need 1 hour CPU time which is computationally reasonable. Every time a new nucleotide is added the conformational size will be multiplied exponentially by the RNA loop length (for a N=5 computational time ~ 10^23 CPU year).

Since enumeration of one nucleotide long loop is possible, the entire loop can be modeled by stepwise enumerative building of one nucleotide at a time on low energy conformations which are well-packed and hydrogen bonded. Therefore, their implementation of stepwise assembly (SWA) protocol in a dynamic programming-like recursion style enables sampling of 12 length loops with achievable CPU time. SWA being successful in prediction of RNA loops, was first used to predict protein loops with atomic accuracy by Rhiju Das . Loop regions in protein structures have particular characteristics compared to the regions of regular secondary structure. Loops have similar number of hydrogen bonds (on average 1.1 per residue), mainly contain polar/charged side chains and have less contact with the non-polar core of the protein. Current Loop modeling methods with atomic resolution start off with a reduced representation of the protein with simplified or no side-chains. Although coarse graining of proteins will assist in reducing large number of local minima but will fail in capturing non-polar and hydrogen bond interactions involving side chains.Therefore, SWA is used to build up a loop at its atomic resolution by sampling the possible conformation space which is energetically favorable and also computationally possible.


SWA is implemented in c++ in Rosetta framework. SWA uses a dynamic programming matrix (example is shown below in Figure 1D for a 6 length loop) to allow de novo buildup of loops from residue k to l. To achieve this, at each step SWA adds loop residue to build up forward from the N-terminus (from residue k-1 to i) and backward from the C-terminus (l+1 to j). Therefore, each circle point in figure 1D represents a (i,j) stage. SWA contains 5 main steps:

  1. Pre-packing the starting point : To start, all atoms of the loop region is removed from the model and side-chains are added to the model. This stage (k-1,l+1) is shown as green circle in figure 1D. Side chains are added and their torsion are minimized. Note that the non-loop backbones are kept fix in all stages of SWA.
  2.  Adding one loop residue to n-terminal: This stage is shown by orange arrows (moving downward) in Figure 1D. To generating possible conformations after adding the loop residue, backbone torsion angles (Φ,Ψ) of the added residue  and the backbone residue before that are sampled (Figure 1A). Φ,Ψ combinations which do not correspond to the Ramachandram are discarded. This procedure, can result in generating tens to thousands of conformations. For all the generated models, side chains are added to the sampled residues (i and i-1) and these side-chain along with the potential neighboring side chains are optimized. Afterward, clustering is performed, in which models are ranked in order of the energy and if a lower energy model has backbone RMSD of residue (i and i-1) <0.10Å compared to a higher energy model then the low energy model is removed (otherwise kept as a seed for a new cluster). After clustering the top 400 models are selected for all atom energy minimization on sampled residue backbone torsion and its neighbouring side-chain. Then, a final clustering is performed on the these models as described above.
  3. Adding one loop residue to c-terminal: This stage is shown by pink arrows (moving left) in Figure 1D. This is similar to step2, in which residue j and j+1 are considered for backbone sampling (Figure 1B), side-chain packing, model clustering and torsional minimization and final clustering.
  4. Closing loop chains :All models where the gap difference between C-terminal and N-terminal are 1,2 or 3 are subjected to chain closure. To generate closed loops, residue i+1 is added to N-terminal and i+2 and j-1 are added to C-terminal. For i+1, Φ and Ψ torsion are sampled by performing grid search as described above while backbone of i+2 and j-1 undergo Cyclic Coordinate Descent (CCD) which changes the Φ and Ψ torsion of i+2 and j-1 till it closes the gap to i+2. Models with chain closure < 0.01Å are then subject to side chain optimization, clustering, and torsional minimization. This procedure differs to above since all loop side chains and all loop backbones are affected by minimization. This stage is shown by blue arrows in Figure 1D just for gap lengths of one. In addition, to this procedure for loop closure, all models were closed by adding the last gap residue by grid search and trying to close the loop by CCD on the preceding residue. Also, models created by only sampling C-terminal or N-terminal are also used along with CCD loop closure to create full length models.
  5. Clustering: For each stage 400 models are generated, where the next stage uses these models to generate new conformations resulting in thousands models. Also several path can be used to get reach a specific stage, adding up to the numbers of generated models. Therefore, since SWA works on only low-energy models, only the 4000 lowest energy models are kept for further clustering. Clustering is similar to procedure above but with RMSD of 0.25Å and is applied on the entire loop segment which is build up to that stage. Then, the 400 lowest energy is used to move on to the next stage. At the loop closure stage also when the whole loops are modeled clustering is also used with RMSD of 1.0Å and the five lowest energy models are considered as SWA prediction.

Figure 1: Stepwise Assembly Schematic Representation

For short loops of (6 to 9 residue long), it was shown that solutions can be found just by creating models from N-terminal onward and separately by C-terminal backward and joining them by loop closure (or simply be moving just along the first column and first row of the dynamic matrix). Figure 1E shows a directed acyclic graph (DAG) of this procedure. The positive point is that in these cases computational time reduces to O(N) instead of O(N^2). Therefore, for such cases this procedure is tested first. If the difference between the lowest energy model and the second lowest is less than 1 kBT (a Rosetta energy unit is approximately 1 kBT) we can argue that modeling has not converged toward one model and the whole O(N^2) calculation should take place (Except for loops of length 24)


A difficult case study:

Predicting 1oyc loop (residue 203-214) has always been a challenge by loop predictors since most of its side-chains are polar/charged where hydrogen bonds play an important for stabilising the loop. All these factors are not considered in ab initio predictors with coarse-grained representation. Figure 2 of paper, displays the SWA build up procedure for 1oyc loop.The final best model (Figure 2:I) with the lowest energy has a c-alpha RMSD of 0.39 Å to the native structure. Following the build up path of 1oyc shows that the intermediate steps which lead to this final best model have not always been the lowest energy, therefore it is important to keep all the possible intermediate conformations. It is important to consider that different search paths allows sampling of totally diverse configurations. For example in Figure 2 (below), for 1oyc, 5 different configurations with comparable low energy generated by different build up paths are shown. Two totally different paths (blue and brown) may result in similar configurations while reasonably similar paths (pink, green and orange) have resulted in substantially different loop models.

SWA for 1oyc. 5 different configurations with comparable low energy

Figure 2: prediction of SWA for 1oyc loop. Five different configurations with comparable low energy are shown.

SWA on 35 loop test set:

SWA was used on a data set of 35 protein loops, where 20 of them allowed comparison with PLOP and Rosetta KIC and 15 where difficult cases with loop ranging between 8 to 24 residue. Comparing the median of RMSDs of lowest energy models (Table S1 of paper) shows SWA achieves better quality models (0.63 Å) with the same computational time as PLOP and Rosetta KIC. For the other 15 difficult cases SWA performance reduced by median RMSD of 1.4 Å for the lowest energy models.But, the highlight of SWA is prediction of 24 residue long loops,where it achieves sub-angstrom accuracy. Since SWA uses the O(n) strategy to solve the problem, in comparison to Rosetta KIC it requires less computational time.

In total, considering the best of 5 models, for 27 of 35 cases SWA produces sub-angstrom accuracy. But looking at the five best models of these 27 models show that best RMSD does not corresponds to the best lowest energy model. Also, in some cases Rosetta KIC produces better RMSD models while energy wise it is worse than SWA. This shows Rosetta energy function requires improvement specially in its solvent model (where it fails the most).

SWA and blind test:

  • SWA was used to predict 4 loops of a protein which its native structure was not released. SWA started with a model where the loop regions were removed and achieved sub-angstrom accuracy (Rosetta KIC achieved this for 3 out of the 4 cases).
  • SWA loop prediction accuracy of 0.53 Å for a RNA/protein target on a comparative model (instead of X-ray model) shows its ability in prediction complex structures.


SWA method has been successful in predicting protein loops with sub-angstrom accuracy. Of significance are prediction of RNA-Protein model target and loop lengths of 24 residue. Although it provides atomic-accuracy predictions, SWA requires 5000 CPU hours (which is achievable with current processing powers) for 12 length loops. While Monte Carlo and refinement-based methods can predict loops in hundreds of CPU hours. SWA computational time can be improved by considering minimization of several factors in the build up pathway and the use of experimental constraints.

SWA method can be used to guide and assist ab-initio prediction of protein structures and in protein folding. Also it may have application in ab inito modeling problems such as fitting high-res protein structures in low-res electron density maps or prediction of NMR structures using sparse chemical shift data. In addition, stepwise ansatz offers solutions to design of protein interfaces which require simultaneous optimizing of side-chain conformation, side-chain identity and back bones.

Journal Club: A Novel Insight into Gene Ontology Semantic Similarity

In a past journal club, I presented a paper by Xu et al on their method of quantifying semantic similarity of proteins based on the Gene Ontology. The Gene Ontology (GO) is a directed acyclic graph (DAG, c.f. Figure 1) of terms with which a protein or a gene can be annotated. These terms are designed to describe the process the protein is involved in, what role it fulfils in this process, and where in the cell it is localized. Thus, when comparing the function of two proteins in a quantifiable way, it has become standard to refer back to the GO terms these proteins are annotated with and compare these based on their relationship in the DAG.

Schematic Diagram of a Directed Acyclic Graph (DAG).

Figure 1: Schematic Diagram of a Directed Acyclic Graph (DAG).

As opposed to many methods, which measure the importance of a node (GO-term) in the DAG as its information content given an external database, the method proposed by Xu et al measures semantic similarity independently of external resources, which gives it the appearance of an independent measure. Furthermore, it claims to be good at dealing with badly annotated proteins, which is often a big problem in functional similarity calculations.

The similarity measure is a hybrid between node-based and edge-based methods, and is seemingly inspired by Wang et al’s 2007 paper and Shen et al’s 2010 paper. It is based on what they call “Shortest Semantic Differentiation Distance” or (SSDD), which is calculated over the shortest distance between two GO-terms on the DAG. When comparing the GO-terms A and B, the shortest path is measured by traversing the DAG upwards from node A to the lowest common ancestor of both nodes and down to node B.

The SSDD calculation over the shortest path is based on their so-called semantic Totipotency values assigned to the terms in the DAG that are part of the shortest path. The semantic Totipotency, T, of a term is calculated by:

Semantic Totipotency Measure

Semantic Totipotency Measure

where the weight, ω, is given by:

Weight, ω

Weight, ω

Here, Dst(t) denotest the number of descendents of the term t, and tp denotes the parent term of term t. Thus, the T-value of every node is both an expression of the depth of the DAG in this area and the coverage.

Finally, the SSDD is calculated by:

Semantic Similarity Differentiation Distance

Shortest Semantic Differentiation Distance

And subsequently the similarity of two GO terms is measured by:

Screenshot from 2014-01-27 12:47:46



In their paper Xu et al showed the method to be competitive compared to other methods which compute protein functional similarity by pairwise GO-term comparisons, while also outperforming a common graph-based method in simUI. While these results look promising, the biological interpretability of such a semantic similarity measure remains difficult.

The strongest advantage of the SSDD method proposed was however its alleged invariance to annotation richness of proteins, which was presented as shown in Figure 2 below (Figure 5 in the paper).

Figure 2: The performance of difference methods dealing with sets of proteins with difference annotation richness.

Figure 2: The performance of difference methods dealing with sets of proteins with difference annotation richness.

The results in this figure show that SSDD exhibits only a slight decrease in Pearson Correlation Coefficient to a set of reference similarity values for proteins which are less well annotated. This ability to deal with badly annotated proteins is the true value of the SSDD method proposed by Xu et al. However, this investigation was performed by sets of proteins selected by the authors, and should thus be validated independently to confirm these surprising results.