Author Archives: Samuel Demharter

Is contacts-based protein-protein affinity prediction the way forward?

The binding affinity of protein interactions is useful information for a range of protein engineering and protein-protein interaction (PPI) network challenges. Obvious applications include the development of therapeutic antibodies to given drug targets or the engineering of novel interfaces for synthetic protein complexes. An accurate model would furthermore allow us to predict a large proportion of affinities in existing PPI networks, and enable the identification of new PPIs, which is critical for our ability to model protein network dynamics effectively.

“The design of an ideal scoring function for protein−protein docking that would also predict the binding affinity of a complex is one of the challenges in structural proteomics.” Adapted from Kastritis, Panagiotis L., and Alexandre MJJ Bonvin. Journal of proteome research 9.5 (2010): 2216-2225.

In last week’s paper a new binding-affinity prediction method based on interfacial contact information was described. Contacts have long been used to in docking methods but surprisingly this was the first time that binding affinity was predicted with them. Largely, this was due to the lack of a suitable benchmark data set that contained structural as well as affinity data . In 2011, however, Kastritis et al. presented a curated database of 144 non-redundant protein–protein complexes with experimentally determined Kd (ΔG) as well as x-ray structures.
Using this data set they trained and validated their method, compared it against others and concluded that interfacial contacts `can be considered the best structural property to describe binding strength`. This claim may be true but as we discussed in the meeting there is still some work to do before we take this model an run with it. A number of flags were raised:

Classification of experimental methods into reliable and non-reliable is based on what gives the best results with their method. Given that different types of protein complexes are often measured with different methods, some protein classes for which contact-based predictions are less effective may be excluded.
Number of parameters for model 6 is problematic without exact AIC information. As Lyuba righlty pointed out, the intercept in model 6 `explodes`. It is no surprise that the correlation improves with more parameters. Despite their AIC analysis, overfitting is still a worry due to the lack of details presented in the paper.

Comparison against other methods is biased in their favour; their method was trained on the same data set, the others were not. In order to ensure a fair comparison all methods should be trained on the same data set. Of course this is hard to do in practice, but the fact remains that a comparison of methods that has been trained on different data sets will be flawed.

Paper: Vangone, A., Bonvin, A. M. J. J., Alberts, B., Aloy, P., Russell, R., Andrusier, N., … Zhou, Y. (2015). Contacts-based prediction of binding affinity in protein-protein complexes. eLife, 4, e07454. http://doi.org/10.7554/eLife.07454

“Identifying Allosteric Hotspots with Dynamics”: Using molecular simulation to find critical sites for the functional motions in proteins

Allosteric (allo-(“other”) + steric (repulsion of atoms due to closeness or arrangement)) sites regulate protein function from a position other than the active site or binding site. Consider the latch on a pair of gardening scissors (Figure 1): depending on the position of the latch (allosteric site) the blades are prevented from cutting things at the other end (active site).

Figure 1 Allostery explained: A safety latch in gardening scissors.

Due to the non-trivial positions of allosteric sites in proteins their identification has been challenging. Selected well characterised systems such as GPCRs have known allosteric sites that are being used as targets in drug development. However, large scale identification of allosteric sites across the Protein Data Base (PDB) has not yet been feasible, partly because of the lack of tools.

To tackle this problem the Gerstein Lab developed a computational protocol based on various molecular simulations and network methods to find allosteric hotspots in proteins across the PDB. They introduce two different pipelines; one for identifying allosteric residues on the surface (surface-critical) and one for buried residues (interior-critical).

For the search of exterior-critical residues they us MC simulations to repeatedly probe the surface of the protein with short Monte Carlo (MC) with a short peptide. Based on hard spheres and simple energy calculations this method seems to be an efficient way of detecting possible binding pockets. Once the binding pockets have been found, the collective motions of the structure are simulated using an elastic mass-and-spring network (an anisotropic network model [ANM]). Binding pockets that undergo significant deformation during these simulations are considered to be surface-critical.

For interior-critical residues they start by weighting residue-residue contacts on the basis of collective movement. Communities within the weighted network are then identified and the residues with the highest betweenness interactions between communities are chosen as interior-critical residues. Thus, interior-critical residues have the highest information flow between two densely inter-connected groups of residues.

The protocol as been implemented in STRESS (STRucturally identified ESSential residues) and is freely available at stress.molmovdb.org.

Publication: http://www.ncbi.nlm.nih.gov/pubmed/27066750

A designed conformational shift to control protein binding specificity

Proteins and their binding partners with complementary shapes form complexes. Fisher was onto something when he introduced the “key and lock mechanism” in 1896. For the first time the shape of molecules was considered crucial for molecular recognition. Since then there have been various attempts to improve upon this theory in order incorporate the structural diversity of proteins and their binding partners.

Koshland proposed the “induced fit” mechanism in 1956, which states that interacting partners undergo local conformational changes upon complex formation to strengthen binding. An additional mechanism “conformational selection” was introduced by Monod, Wyman and Changeux who argued that the conformational change occurred before binding driven by the inherent conformational heterogeneity of proteins. Given a protein that fluctuates between two states A and B and a substrate C that only interacts with one of these states, the chance of complex formation depends on the probability of our protein being in state A or B. Furthermore, one could imagine a scenario where a protein has multiple binding partners, each binding to a different conformational state. This means that some proteins exists in an equilibrium of different structural states, which determines the prevalence of interactions with different binding partners.

Figure 1. The “pincer mode”.

Based on this observation Michielssens et al. used various in-silico methods to manipulate the populations of binding-competent states of ubiquitin in order to change its protein binding behaviour. Ubiquitin is known to take on two equally visited states along the “pincer mode” (the collective motion describing the first PCA-eigenvector); closed and open.

Figure 2. A schematic of the conformational equilibrium of ubiquitin that can take on a closed or open state. Depending on its conformation i can bind different substrates.

Different binding partners prefer either the closed, open or both states. By introducing point mutations in the core of ubiquitin, away from the binding interface, Michielssens et al. managed to shift the conformational equilibrium between open and closed states, thereby changing binding specificity.

Point mutations were selected according to the following criteria:

⁃ residues must be located in the hydrophobic core
⁃ binding interface must be unchanged by the mutation
⁃ only hydrophobic residues may be introduced (as well as serine/threonine)
⁃ glycine and tryptophan were excluded because of their size

Figure 3. Conformational preference of ubiquitin mutants. ddG_mut = dG_open – dG_closed.

Fast growth thermal integration (FGTI), a method based on molecular dynamics, was used to calculate the relative de-/stabilisation of the open/closed state caused by each mutation. Interestingly, most mutations that caused stabilisation of the open states were concentrated on one residues, Isoleucine 36 (Slide 7).
For the 15 most significant mutations a complete free energy profile was calculated using Umbrella sampling.

Figure 4. Free energy profiles for six different ubiquitin mutants, calculated using umbrella sampling simulations. Mutants preferring the closed substate are shown in red, open substate stabilizing mutants are depicted in blue, those without a preference are shown in gray.

Figure 5. Prediction of binding free energy differences between wild-type ubiquitin and different point mutations (ddG_binding = dG_binding,mutant􏰵 – dG_binding,wild-type).

To further validate that they correctly categorised their mutants into stabilising the open or closed state, six X-ray structure of ubiquitin in complex with a binding partner that prefers either the open or closed state were simulated with each of their mutations. Figure 5 shows the change in binding free energy that is caused by the mutation in compatible, neutral and incompatible complexes (compatible may refer to an “open favouring mutation” (blue) in an open complex (blue) and vice versa).

Figure 6. Comparison of change in binding free energy predicted from the calculated results for ubiquitin and the experimental result.

In their last step a selection of open and closed mutations was introduced into an open complex and the change in binding free energy was compared between experiment (NMR) and their simulations. For this example, their mutants behaved as expected and an increase in binding free energy was observed when the closed mutations were introduced into the open complex while only subtle changes were seen when the “compatible” closed mutations were introduced.

The authors suggest that in the future this computational protocol may be a corner stone to designing allosteric switches. However, given that this approach requires pre-existing knowledge and is tailored to proteins with well defined conformational states it may take some time until we discover its full potential.

Investigating structural mechanisms with customised Natural Moves

This is a brief overview of my current work on a protocol for studying molecular mechanisms in biomolecules. It is based on Natural Move Monte Carlo (NMMC), a technique pioneered by one of my supervisors, Peter Minary.

NMMC allows the user to decompose a protein or nucleic acid structure into segments of atoms/residues. Each segment is moved collectively and treated as a single body. This gives rise to a sampling strategy that considers the structure as a fluid of segments and probes the different arrangements in a Monte Carlo fashion.

Traditionally the initial decomposition would be the only one that is sampled. However, this decomposition might not always be optimal. Critical degrees of freedom might have been missed out or chosen sub-optimally. Additionally, if we want to test the causality of a structural mechanism it can be informative to perform NMMC simulations for a variety of decompositions. Here I show an example of how customised Natural Moves may be applied on a DNA system.

Investigating the effect of epigenetic marks on the structure of a DNA toy model

Figure 1. Three levels at which epigenetic activity takes place. Figure adopted from Charles C. Matouk, and Philip A. Marsden Circulation Research. 2008;102:873-887

Epigenetic marks on DNA nucleotides are involved in regulating gene expression (Point 1 in figure 1). We have a limited understanding of the underlying molecular mechanism of this process. There are two mechanisms that are thought to be involved: 1) Direct recognition of the epigenetic mark by DNA binding proteins and 2) indirect recognition of changes in the local DNA structure caused by the epigenetic mark. Using customised Natural Moves we are currently trying to to gain insight into these mechanisms.

One type of epigenetic mark is the 5-hydroxymethylation (5hm) on cytosines. Lercher et al. have recently solved a crystal structure of the Drew-Dickerson Dodecamer with two of these epigenetic marks attached. Given the right sequence context (e.g. CpG) they have found that this epigenetic mark can form a hydrogen bond between two neighbouring bases on the same strand. This raises the question: Can an intra-strand CpG hydrogen bond alter the DNA helical parameters? We used this as a toy model to test our technology.

Figure 2b

Figure 2a

Figure 2a shows the Drew-Dickerson Dodecamer schematically. Figure 2b shows the three sets of degrees of freedom that we used as test cases.

Top (11): Default sampling – Serves as a reference simulation.
Middle (01): Fixed 5hm torsions – By forcing the hydroxyl group of 5hmC towards the guanine we significantly increase the chance of the hydrogen bond forming.
Bottom (00): Collective movement of the neighbouring bases 5hmC and G – By grouping the two bases into a segment we aim to emulate the dampening effect that a hydrogen bond may have on their movements relative to each other.

By modifying these degrees of freedom (1=active/ 0=inactive) we attempted to amplify any effects that the CpG hydrogen bond may have on the DNA structure.

Below you can see animations of the three test cases 11, 01, 00 during simulation, zoomed in on the 5hm modification and the neighbouring base pair:

It appeared that the default degrees of freedom were not sufficient to detect a change in the DNA structure when comparing simulations of unmodified and modified structures. The other two test cases with customised Natural Moves, however, showed alterations in some of the helical parameters.

Lercher et al. saw no differences in their modified and unmodified crystal structures. It seems that single 5hm epigenetic marks are not sufficient to significantly alter DNA structure. Rather, clusters of these modifications with accumulated structural effects may be required to cause significant changes in DNA helical parameters. CpG islands may be a promising candidate.

@samdemharter

Augmented Modelling with Natural Move Monte Carlo Simulations

In the last group meeting I reported on the progress that I have made regarding the development of a protocol for the systematic use of Natural Move Monte Carlo simulations.

Natural Move Monte Carlo simulations
Natural Moves are degrees of freedom that describe the collective motion of groups of residues. In DNA this might be the concerted motion of a double helix; in proteins this could be the movement of a stable secondary structure element such as a beta-sheet. These segments are joined by so called melting areas. At each simulation step the segments are propagated independently in an MC fashion. The resulting chain breaks are resolved by a chain closure algorithm that acts on the melting areas. This results in a reduction of degrees of freedom of several orders of magnitude. Therefore, large complexes and conformational changes can be sampled more effectively.

In order to get sensible results, however, the initial decomposition of the system is important. The challenge is to accurately represent the plasticity of the system, while keeping the number of degrees of freedom as small as possible. Detailed insight into the flexibility of the system might be gained from experimental sources such as NMR or computational methods such as MD simulations and Normal Mode Analysis. This can help with defining segments and melting areas. However, there are many systems for which this data is not available. Even if it is, there is no guarantee that the segmentation is correct.

Therefore, I am developing a protocol that allows for the evaluation of a range of different test cases that each reflect a unique set of segments and melting areas.

Augmented Modelling Protocol
This protocol is aimed at the systematic evaluation of NMMC segmentations. It allows researchers to feed experimental information, biological knowledge and educated guesses into molecular simulations and so provides a framework for testing competing hypotheses. The protocol has four steps.

Step 1: Segmentation of the system into low-level segments
The initial segmentation contains all possible areas of flexibility that may play a role in conformational changes in the system of interest. This decision may be influenced by many sources. For now, however, we only consider secondary structure information. Helices and beta strands are treated as potential segments. Unstructured regions such as kinks, loops and random coils are treated as melting areas. For a small fold with four helices we get the segmentation shown in figure 1a.

Step 2: Formulate test cases
Generate multiple test cases that reflect hypotheses about the mechanism of interest. In this step we try to narrow down the degrees of freedom as much as possible in order to retain sampling efficiency. This is done by selectively deactivating some melting areas that were defined in step 1. For a system with three melting areas that can either be on or off, 2^3 = 8 different test cases may be generated (example shown in figure 1b).

Figure 1 a) Segmentation of a small α-fold. The blue rectangles represent α-helices. The dashed lines indicate the presence of melting areas I, II and III. Each melting area can be switched on or off (1/0) b) Example of a test case in which the first of three melting area is switched off. c) The six degrees of freedom along which a segment is propagated.

Step 3: Perform simulations
Sample the conformational space of all test cases that were generated in step 2. We generally use Parallel Tempering or Simulated Tempering algorithm to accelerate the sampling process. These methods rely on the modulation of temperature to overcome energy barriers.

Step 4: Evaluate results
Score the results against a given control and rank the test cases accordingly. The scoring might be done by comparing experimental distributions of observables with those generated by simulations (e.g. Kullback-Leibler divergence). A test case that reproduces desired expectation values of observables might then be considered as a candidate hypothesis for a certain structural mechanism.

What’s next?
I am currently working on example uses for this protocol. These include questions regarding aspects of protein folding and the stability of the empty MHC II binding groove.

Natural Move Monte Carlo: Sampling Collective Motions in Proteins

Protein and RNA structures are built up in a hierarchical fashion: from linear chains and random coils (primary) to local substructures (secondary) that make up a subunit’s 3D geometry (tertiary) which in turn can interact with additional subunits to form homomeric or heteromeric multimers (quaternary). The metastable nature of the folded polymer enables it to carry out its function repeatedly while avoiding aggregation and degradation. These functions often rely on structural motions that involve multiple scales of conformational changes by moving residues, secondary structure elements, protein domains or even whole subunits collectively around a small set of degrees of freedom.

The modular architecture of antibodies, makes them amenable to act as an example for this phenomenon. Using MD simulations and fluorescence anisotropy experiments Kortkhonjia et al. observed that Ig domain motions in their antibody of interest were shown to correlate on two levels: 1) with laterally neighbouring Ig domains (i.e. VH with VL and CH1 with CL) and 2) with their respective Fab and Fc regions.

Correlated motion between all residue pairs of an antibody during an MD simulation. The axes identify the residues whereas the colours light up as the correlation in motion increases. The individual Ig domains as well as the two Fabs and the Fc can be easily identified. ref: Kortkhonjia, et al., MAbs. Vol. 5. No. 2. Landes Bioscience, 2013.

This begs the question: Can we exploit these molecular properties to reduce dimensionality and overcome energy barriers when sampling the functional motions of metastable proteins?

In 2012 Sim et al. have published an approach that allows for the incorporation of these collective motions (they call them “Natural Moves”) into simulation. Using simple RNA model structures they have shown that explicitly sampling large structural moves can significantly accelerate the sampling process in their Monte Carlo simulation. By gradually introducing DOFs that propagate increasingly large substructures of the molecule they managed to reduce the convergence time by several orders of magnitude. This can be ascribed to the resulting reduction of the search space that narrows down the sampling window. Instead of sampling all possible conformations that a given polynucleotide chain may take, structural states that differ from the native state predominantly in tertiary structure are explored.

Reducing the conformational search space by introducing Natural Moves. A) Ω1 (residue-level flexibility) represents the cube, Ω2 (collective motions of helices) spans the plane and Ω3 (collective motions of Ω2 bodies) is shown as a line. B) By integrating multiple layers of Natural Moves the dimensionality is reduced. ref: Sim et al. (2012). PNAS 109(8), 2890–5. doi:10.1073/pnas.1119918109

It is important to stress, however, that in addition to these rigid body moves local flexibility is maintained by preserving residue level flexibility. Consequently, the authors argue, high energy barriers resulting from large structural rearrangements are reduced and the resulting energy landscape is smoothened. Therefore, entrapment in local energy minima becomes less likely and the acceptance rate of the Monte Carlo simulation is improved.

Although benchmarking of this method has mostly relied on case studies involving model RNA structures with near perfect symmetry, this method has a natural link to near-native protein structure sampling. Similarly to RNA, proteins can be decomposed into local substructures that may be responsible for the main functional motions in a given protein. However, due to the complexity of protein motion and limited experimental data we have a limited understanding of protein dynamics. This makes it a challenging task to identify suitable decompositions. As more dynamic data emerges from biophysical methods such as NMR spectroscopy and databases such as www.dynameomics.org are extended we will be able to better approximate protein motions with Natural Moves.

In conclusion, when applied to suitable systems and when used with care, there is an opportunity to breathe life into the static macromolecules of the pdb, which may help to improve our understanding of the heterogeneous structural landscape and the functional motions of metastable proteins and nanomachines.

Journal Club: Native contacts in protein folding

Like your good old headphone cables, strings of amino acids have the potential to fold into a vast number of different conformations given the appropriate conditions. A conservative estimation for the time it would take a 100 residue protein to explore all theoretically possible conformations would exceed the age of the Universe several times. This is obviously not feasible and was pointed out by Levinthal when he published his “How To Fold Graciously” in 1969.

The so called Protein-Folding Problem has since been under intense study, which inevitably has led to a few theories and models about its nature. Due to the lack of appropriate wet-lab methods to study this phenomenon theoretical, computational approaches have been key to devising impactful frameworks for formally describing protein folding. One of these goes under the name of principle of minimum frustration introduced by Bryngelson and Wolynes in the late 80s (1). It states that proteins by evolution were enriched for sequences with the propensity to fold into low-energy structures, while actively selecting against traps. By avoiding mis-folding and non-native contacts, the theory says, a smooth funnel-like energy landscape with native-state minima is created that ensures robust and fast folding.

This implies that native contacts, i.e. residues that interact in the fully folded protein play a major role in the folding process. Gō models (2), named after Nobuhiro Gō who first proposed this method, are based around this assumption with the energetic contributions of native interactions acting as the sole driving forces in the folding process. While this approach has yielded promising results, many of which were in concordance with experiments, its underlying principles have never been validated in a statistically meaningful way.

A schematic for native-contact-driven protein folding

In 2013 a study by Best, Hummer and Eaton (3) formally addressed this question. By devising a set of statistical quantities aimed at weighting the importance of native and non-native interactions for folding and applying these to the analysis of several long MD folding simulations they were able to show a “native-centric mechanism” for small fast-folding proteins.

In a first step it was assessed whether the fraction of native contacts provided a suitable reaction coordinate for the simulated folding events. From their equilibrium simulations two thresholds of native-contact-fractions were chosen that defined folded and unfolded states (a two-state model is assumed). Overlaying the values for the most visited native-contact-fractions during simulation against these thresholds revealed a strong correlation between the two equilibrium probability density maxima and the protein’s fold state. In addition they showed that the range of native-contact-fractions between those found to represent unfolded and folded thresholds were indicative of being on a transition path (defined as the “.. regions of the trajectories that cross directly from the unfolded well to the folded well ..”).

A further measure was introduced with the contact lifetime test. The log-ratio of the time a contact spent on a transition path vs the time it existed in the unfolded state was calculated and compared in a heat-map to the native contact map coloured by the number of contacts between residues.

Contact life time test for a selected protein.
Adapted from (3).

Among others this result revealed a clear connection between contacts with longer transition path life times and the number of contacts they made in the native structure.

So what about non-native interactions?

Screenshot from 2014-03-27 12:47:04

One of the measures addressing this question was the Bayesian measure for non-native contacts on transition paths. In the examples used in this paper, no obvious link between being on a transition path given a non-native contact was found unless they were close to native contacts. Further criteria such as the complementary quantity, which is the probability of being on a transition path when a contact is not made, concluded in a similar fashion.

Interestingly, it was found that the one protein that was influenced by non-native contacts was the designed α3D. Best et al. reasoned that additional frustration introduced when building a protein with artificially introduced stability has led to a shifting of helix register giving rise to this outlier.

When taken together, these results lay a robust foundation for further studies along the same lines. It is too early to accept or reject the presented findings as universal truth, but strong arguments for the native-centric mechanism being a reasonable model in small fast-folding proteins have been made. It would not be far-fetched to think that larger proteins would adhere to similar principles with non-native contacts modulating the landscape, especially when considering individual downhill folding modules.

References:

(1) Bryngelson, J.D. et al., 1995. Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins, 21(3), pp.167–95.

(2) Taketomi, H., Ueda, Y. & Gō, N., 1975. Studies on protein folding, unfolding and fluctuations by computer simulation. I. The effect of specific amino acid sequence represented by specific inter-unit interactions. International journal of peptide and protein research, 7(6), pp.445–59.

(3) Best, R.B., Hummer, G. & Eaton, W.A., 2013. Native contacts determine protein folding mechanisms in atomistic simulations. Proceedings of the National Academy of Sciences of the United States of America, 110(44), pp.17874–9.

Oxford Protein Informatics Group

or "OPIG" to friends