3Dsig and ISMB 2014 (Boston)

This year we head to Boston for 3Dsig and ISMB 2014. The outcome was excellent with James Dunbar giving a talk at the 3Dsig on “Examining variable domain orientations in antigen receptors gives insight into TCR-like antibody design” and Alistair Martin oral poster presentation at ISMB on “Unearthing the structural information contained within mRNA ”. The Deane group received the most votes by different judges for the poster competition at 3Dsig with James Dunbar poster winning the best poster prize and Jinwoo Leem and Reyhaneh Esmaielbeiki posters receiving the honorary mentioned (all presented posters are Here).

Soooo Excited!

This blog post contains a brief description of what we found most interesting at the conference.

N-terminal domains in two-domain proteins are biased to be shorter and predicted to fold faster than their C-terminal counterparts.
Authors: Jacob, Etai; Unger, Ron; Horovitz, Amnon

Chosen by: Alistair Martin

Is is not surprising that protein misfolding is selected against in the genome, with aggregation of misfolded proteins being associated to a an array of diseases. It is suggested that multi-domain proteins are more prone to misfolding and aggregation due to an effective higher local protein concentration. Jacob et al. investigated what mechanisms have developed to avoid this, focussing on ~3000 two domain proteins contained within Swiss-Prot.

They found that there are notable differences between the N- and C-terminal domains. Firstly, there exists a large propensity for the C-terminal domain to be longer than the N-terminal domain (1.6 times more likely). Secondly, the absolute contact order (ACO) is greater in the C-terminal domain when compared to the N-terminal domain. Both length and ACO inversely correlate to folding speed and thus they draw the conclusion that the first domain produced by the ribosome is under pressure to fold faster than the latter domain. These observations are enhanced in prokaryotes and lessened in eukaryotes, thereby suggesting a relationship to translational speed and cotranslational folding.

A novel tool to identify molecular interaction field similarities
Authors: Matthieu Chartier and Rafael Najmanovich

Chosen by: Claire Marks

One of the prize-winning talks at 3Dsig this year was “A novel tool to identify molecular interaction field similarities”, presented by Matthieu Chartier from the Université de Sherbrooke. The talk introduced IsoMIF – a new method for finding proteins that have similar binding pockets to one another. Unlike previous methods, IsoMIF does not require the superposition of the two proteins, and therefore the binding sites of proteins with very different folds can be compared.
IsoMIF works by placing probes at various positions on a grid in the binding pocket and calculating the interaction potential. Six probe types are used: hydrogen bond donor and acceptor, cation, anion, aromatic group, and hydrophobic group. A graph-matching algorithm then finds similar pockets to a query using the interaction potentials for each probe type at each grid point. On a dataset of nearly 3000 proteins, IsoMIF achieved good AUC values of 0.74-0.92.

The promise of evolutionary coupling for 3D biology
Presenter: Debora Marks

Chosen by: Reyhaneh Esmaielbeiki

I was impressed by the keynote talk given by Debora Marks at the 3Dsig. She gave an overall talk on how their group have worked on detecting evolutionary couplings (EC) between residues in proteins and how they use this information in predicting folds. In general, looking at the interacting residues in a 3D structure and comparing these position in a MSA displays co-evolving relationship. But the challenge is to solve the inverse, from sequence to structure, since not necessary all co-evolving residues are close in the 3D space (this relationship is shown in the figure below).

Evoulationary coupling in sequnce and structure

Debora showed that previous studies for detecting co-evolving residues used Mutual Information (MI). But, comparing the prediction out of MI with contacts maps shows that these methods perform poorly. This is because MI looks at the statistics of pair of residue at a time while residues in proteins are highly coupled and pairs are not independent from other pairs. Therefore, MI works good for RNA but not for proteins. Debora’s group have used mean field and pseudo likelihood maximization to overcome the limitation of MI and introduced the EVcoupling tool for predicting EC (Marks et al. PLoS One, 6(12), 2011). They have used the predicted EC as a distance restraint to predict the 3D structure of proteins using EVfold. Using EVfold they have managed to build structure with 2-5Å accuracy.

In a more recent work, they have built EVfold-membrane which is specific for membrane proteins (Hopf et al. Cell, 149(7), 1607-1621, 2012) and they tried modeling membrane proteins with unknown experimental structures. Recently close homologues to these structures were released and comparisons show that EVfold-membrane structures have accuracy of 3 to 4Å.

She also discussed the usage of detecting EC in identifying functional residues involved in ligand binding and conformational changes with an interesting example of two GPCRs, adrenergic beta-2 receptor and an opioid receptor (paper link).

She concluded her talk by talking about her recent work EVcomplex (paperlink). The aim is to use the detected EC between two different protein chains and use this information in the docking software as a distance restraint. Although, this method has provided models of the E.coli ATP synthase but there are currently several limitation (all mentioned in the Discussion of the paper) for using this work in large scale.

in general, EC was a popular topic at the conference with interesting posters from the UCL Bioinformatics group.

Characterizing changes in the rate of protein-protein dissociation upon interface mutation using hotspot energy and organization
Authors: Rudi Agius, Mieczyslaw Torchala, Iain H. Moal, Juan Fernández-Recio, Paul A. Bates

Chosen by:Jinwoo Leem

This was a paper by Agius et al. published in 2013 in PLOS Comp Biol. Essentially, the work was centralised around finding a set of novel descriptors to characterise mutant proteins and ultimately predict the koff (dissociation rate constant) of mutants. Their approach is quite unique; they perform two rounds of alanine-scanning mutagenesis, one on the wild-type (WT) and one on the mutant protein. They identify ‘hotspot’ residues as those that have a change of G of 2kcal/mol (or more) from alanine scanning, and the descriptor is formed from the summation of the energy changes of the hotspots.

The results were very encouraging, and from their random forest-trained model with hotspot descriptors, they see correlations to koff up to 0.79. However, the authors show that traditional ‘molecular’ descriptors (e.g. statistical potentials) perform just as well, with a correlation of 0.77. The exact contribution of their ‘hotspot’ descriptors to the prediction of koff seems unclear, especially considering how well molecular descriptors perform. Having said this, the paper shows a very unique way to approach the issue of predicting the effects of mutations on proteins, and on a larger dataset with a more diverse range of proteins (not necessarily mutants, but different proteins altogether!) these ‘hotspot’-specific methods may prove to be much more predictive.

On the Origin and Completeness of Ligand Binding Pockets with applications to drug discovery
Authors: Mu Gao & Jeffrey Skolnick.
Presented by Jeffrey Skolnick

Chosen by:Nicholas Pearce

The prevalence of ligand-binding pockets in proteins enables a wide range of biochemical reactions to be catalysed in the cell. Jeffrey Skolnick presented research which proposes that ligand-binding pockets are inherent in proteins. One mechanism that he hypothesised for the creation of these pockets is the mis-stacking of secondary structure elements, leading to imperfections in their surfaces – pockets. Using their method for calculating pocket similarity – APoc – Gao & Skolnick characterised the space of ligand-binding pockets, using artificial structures and structures from the PDB, and find it to be ~400-1000 pockets in size.

They suggest that the relatively small size of pocket-space could be one of the reasons that such a large amount of off-target promiscuity is seen in drug design attempts.

From this result, Skolnick went on to discuss several interesting possibilities for the evolutionary history of ligand-binding pockets. One of the interesting hypotheses is that many, if not all, proteins could have inherent low-level catalytic ability across a wide range of biochemical reactions. Motivated by the small size of pocket-space, it is conceivable that one protein would be able to catalyse many different reactions – this could give an insight into the evolutionary history of protein catalysis.

If primordial proteins could catalyse many different reactions, albeit inefficiently, this would give a possibility for how the first lifeforms developed. Nature need only then work on increasing specificity and efficiency from a background of weak catalytic ability. However, even through the course of evolution to produce more specific proteins, this background of activity could remain, and explain the background of ‘biochemical noise’ that is seen in biological systems.

Drug promiscuity and inherent reactivity may not only be present due to the small size of pocket-space – they may be a direct consequence of evolution and the fundamental properties of proteins.

Alistair Martin & codon!

James Dunbar presenting good stuff about Antibodies

Antibody CDR-H3 Modelling with Prime

In a blog post from last month, Konrad discussed the most recent Antibody Modelling Assessment (AMA-II), a CASP-like blind prediction study designed to test the current state-of-the-art in antibody modelling. In the second round of this assessment, participants were given the crystal structure of ten antibodies with their H3 loops missing – the loop usually found in the centre of the binding site that is largely responsible for the binding properties of the antibody. The groups of researchers were asked to model this loop in its native environment. Modelling this loop is challenging, since it is much more variable in sequence and structure than the other five loops in the binding site.

For eight out of the ten loops, the Prime software from Schrodinger (the non-commercial version of which is called PLOP) produced the most accurate predictions. Prime is an ab initio method, meaning that loop conformations are generated from scratch (unlike knowledge-based methods, which use databases of known loop structures). In this algorithm, described here, a ‘full’ prediction job is made up of consecutive ‘standard’ prediction jobs. A standard prediction job involves building loops from dihedral angle libraries – for each residue in the sequence, random phi/psi angles are chosen from the libraries. Loops are built in halves – lots of conformations of the first half are generated, along with many of the second half, and then all the first halves are cross-checked against the second halves to see whether any of them meet in the middle. If so, then the two halves are melded and a full loop structure is made. All loop structures are then clash-checked using an overlap factor (a cutoff on how close two atoms can get to each other). Finally, the loops are clustered, and a representative structure has its side chain conformations predicted and its energy minimised.

A full loop prediction job is made up of a series of standard jobs, with the goal of guiding the conformational search to focus on structures with low energy. The steps are as follows:

Initial – five standard jobs are run, with slightly different overlap factors.
Ref1 – the first refinement stage. The conformational space around the top 10 loops from each standard job of the Initial stage is explored further by constraining the distance between Ca atoms.
Fixed – the top 10 loops of all those generated so far are passed to this series of stages. To begin with, the first and last residues of the loop are excluded from the prediction and the rest of the loop is re-modelled. The top 10 loops after this are then taken to the second Fixed stage, where two residues at each end of the loop are kept fixed. This is repeated five times, with the number of fixed residues at each end of the loop being increased by one each time.
Ref2 – a second refinement stage, which is the same as the first, except tighter distance constraints are used.
Final – all the loop structures generated are ranked according to their energy, and the lowest energy conformation is chosen as the final prediction.

In a recent paper, Prime was used to predict the structures of 53 antibody H3 loops (using the dataset of a previous RosettaAntibody paper). 91% of the targets were predicted with sub 2-angstrom accuracy, and 81% predictions were sub-angstrom. Compared to RosettaAntibody, which achieved 53% and 17% for predictions below 2A and 1A respectively, this is very impressive. For AMA-II, however, where each group was required to give five predictions, and some poor models were included in each group’s top five, it is apparent that ranking loop conformations is still a major challenge in loop modelling.

Sampling Conformations of Antibodies using MOSAICS

Much work has been done to study the conformational changes taking place in antibodies, particularly during the event of binding to an antigen. This has been done through comparison of crystal structures, circular dichroism, and recently with high resolution single particle electron microscopy. The ability to resolve domains within an antibody from single particles without any averaging made it possible to show distributions of properties such as the shape of a Fab domain, measured by the ratio of width to length. Some of the variation in structure seen involves very large scale motions, but it is not known how conformational changes may be transmitted from the antigen binding region to the Fc, and therefore influence effector function. Molecular dynamics simulations have been performed on some large antibody systems, however none have been possible on a time scale which would be able to provide information on the converged distributions of large scale properties such as the angle between the Fab and Fc fragments.

In my short project with Peter Minary, I used MOSAICS to investigate the dynamics of an antibody Fab fragment, using the coarse-grained natural move Monte Carlo approach described by Sam a few weeks ago. This makes it possible to split a structure into units which are believed to move in a correlated way, and propose moves for the components of each region together. The rate of sampling is accelerated in degrees of freedom which may have functional significance, for example the movement of the domains in a Fab fragment relative to one another (separate regions shown in the diagram below). I used ABangle to analyse the output of each sampling trajectory and observe any changes in the relative orientations of The VH and VL domains.

Fab region definitions for MOSAICS

Of particular interest would be any correlations between conformational changes in the variable and constant parts of the Fab fragment, as these could be involved in transmitting conformational changes between remote parts of the antibody. We also hoped to see in our model some effect of including the antigen in the simulation, bound to the antibody fragment as seen in the crystal structure. In the time available for the project, we was able to set up a model representing the Fab fragment and run some relatively short simulations to explore favoured conformational states and see how the set up of regions affects distributions seen. In order to draw conclusions about the meaning of the results, a much greater number of simulations will need to be run to ensure sampling of the whole conformational space.

Computational Antibody Affinity Maturation

In this week’s journal club, we reviewed a paper by Lippow et al. in Nature Biotechnology, which features a computational pipeline that is capable of maturing antibodies (Abs) by up to 140-fold. The paper itself discusses 4 test case Abs (D44.1, cetuximab, 4-4-20, bevacizumab) and uses changes in electrostatic energy to identify favourable mutations. Up to the point when this paper was published back in 2007, computational antibody design was an (almost) unexplored field of research – except for a study by Clark et al. in 2006, no one else had done anything like the work presented in this paper.

The idea behind the paper is to identify certain positions within the Ab structure for mutation and hopefully find an Ab with a higher binding affinity.

Pipeline

Briefly speaking, the group generated a mutant Ab-antigen (Ag) complex using a series of algorithms (dead-end elimination and A*), which was then scored by the group’s energy function for identifying favourable mutations. Lippow et al. used the electrostatics term of their binding affinity prediction in order to estimate the effects of mutations on an Ab’s binding affinity. In other words, instead of examining their entire scoring function, which includes terms such as van der Waal’s energy, the group only used changes in the electrostatic energy term as an indicator for proposing mutations. Overall, in 2 of the 4 mentioned test cases (D44.1 & cetuximab), the proposed mutations were experimentally tested to confirm their computational design pipeline – a brief overview of these two case studies will be described.

Results

In the case of the D44.1 anti-lysozyme Ab, the group proposed 9 single mutations by their electrostatics-based calculation method; 6/9 single mutants were confirmed to be beneficial (i.e., the mutant had an increased binding affinity). The beneficial single mutants were combined, ultimately leading to a quadruple mutant structure with a 100-fold improvement in affinity. The quadruple mutant was then subjected to a second round of computer-guided affinity maturation, leading to a new variant with six mutations (effectively a 140-fold improvement over the wild-type Ab). This case study was a solid testimony to the validity of their method; since anti-lysozyme Abs are often used as model systems, these results demonstrated that their design pipeline had taken, in principle, a suitable approach to maturing Abs in silico.

The second case study with cetuximab was arguably the more interesting result. Like the D44.1 case above, mutations were proposed to increase the Ab’s binding affinity on the basis of the changes in electrostatics. Although the newly-designed triple mutant only showed a 10-fold improvement over its wild-type counterpart, the group showed that their protocols can work for therapeutically-relevant Abs. The cetuximab example was a perfect complement to the previous case study — it demonstrated the practical implications of the method, and how this pipeline could potentially be used to mature existing Abs within the clinic today.

Effectively, the group suggested that mutations that either introduce hydrophobicity or a net charge at the binding interface tend to increase an Ab’s binding affinity. These conclusions shouldn’t come with huge surprise, but it was remarkable that the group had reached these conclusions with just one term from their energy function.

Conclusions

Effectively, the paper set off a whole new series of possibilities and helped us to widen our horizons. The paper was by no means perfect, especially with respect to predicting the precise binding affinities of mutants – much of this error could be bottled down to the modelling stage of their pipeline. However, the paper showed that computational affinity maturation is not just a dream – in fact, the paper showed that it’s perfectly doable, and immediately applicable. Interestingly, Lippow et al.’s manipulation of an Ab’s electrostatics seemed to be a valid approach, with recent publications on Ab maturation showing that introducing charged residues can enhance binding affinity (e.g. Kiyoshi et al., 2014).

More importantly, the paper was a beautiful showcase of how computational analyses could inform the decision making process in an in vitro framework, and I believe it exemplified how we should approach our problems in bioinformatics. We should not think of proteins as mere text files and numbers, but realise that they are living systems, and we’re not yet at a point where we fully understand how proteins behave. This shouldn’t discourage us from research; instead, it should give us the incentive to take things more slowly, and develop a method/product that could be used to solve greater, pragmatic problems.

Antibody modeling via AMA II and RosettaAntibody

Intro

Protein modeling is one of the most challenging problems in bioinformatics. We still lack a clear theoretical framework which would allow us to link linear protein sequence to its native 3D coordinates. Given that we only have the structures for about a promile of the known seqs, homology modeling is still one of the most successful methods to obtain a structure from a sequence. Currently, using homology modeling and the 1393 known folds we can produce models for more than half known domains. In many cases this is good enough to get an overall idea of the fold but for actual therapeutic applications, there is still a need for high-resolution modeling.

There is one group of molecules whose properties can be readily exploited via computational approaches for therapeutic applications: antibodies. With blockbuster drugs such as Humira, Avastin or Remicade, they are the leading class of biopharmaceuticals. Antibodies share a great degree of similarity with one another (<50-60% sequence identity) and there are at least 1865 antibody structures in the PDB. Therefore, homology modeling of these structures at high resolution becomes tractable, as exemplified by WAM and PIGS. Here, we will review the antibody modeling paradigm using one of the most successful antibody modeling tools, RosettaAntibody, concluding with the most recent progress from AMA II (antibody CASP).

General Antibody-antigen modeling

Modeling of antibody structures can be divided into the following steps:

Identification of the Framework template
Optimizing Vh/Vl orientation of the template
Modeling of the non-H3 CDRs
Modeling of H3

Most of the diversity of antibodies can be found in the CDRs. Therefore, the bulk of the protein can be readily copied from the framework region. This however needs to undergo an optimization of the Vh/Vl orientation. Prediction of the CDRs is more complicated since they are much more variable than the rest of the protein. Non-H3 CDRs can be modeled using canonical structure paradigms. Prediction of H3 is much more difficult since it does not appear to follow the canonical rules.

When the entire structure is assembled, it is recommended to perform refinement using some sort of relaxation of the structure, coupled with an energy function which should guide it.

RosettaAntibody

RosettaAntibody protocol roughly follows this described above. In the first instance, an appropriate template is identified by highest BLAST bit scores. The best heavy and light chains aligned to the best-BLAST-scoring Fv region. The knowledge-base here is a set of 569 antibody structures form SACS with resolutions 3.5A and better. The Vh/Vl orientation is subsequently refined using local relaxation, guided by Charmm.

Non-H3 CDRs are modeled using the highest-scoring BLAST hit of the same length. Canonical information is not taken into account. Loops are grafted on the framework using the residues overlapping with the anchors.

H3 loops are modeled using a fragment based approach. The fragment library is Rosetta+H3 from the knowledge base of antibody structures created for the purpose of this study. The low-resolution search consists of Monte Carlo attempts to fit 3-residue fragments followed by Cyclic Coordinate Descent loop closure. This is followed by high resolution search when the H3 loop and Vh/Vl are repacked using a variety of moves.

Each decoy coming from the repacking is scored using Rosetta function. The lower the Rosetta score the better the decoy (according to Rosetta).

Results

RosettaAntibody can produce high-quality models (1.4A) on its 54 structure benchmark test. The major limitation of the method (just like any other antibody modeling method) is the H3 loop modeling. It is believed that H3 is the most important loop and therefore getting this loop right is a major challenge.

Right framework and the correct orientation of Vh/Vl have a great effect on the quality of H3 predictions. When the H3 was modeled on using the correct framework, the predictions are order of magnitude better than by using the homology model. This was demonstrated using the native recovery in RosettaAntibody study as well as during ‘Step II’ of the Antibody Modeling assessment where participants were asked to model H3 using the correct framework.

Journal club (Bernhard Knapp): MMPBSA Binding Free Energy Calculations

This week’s topic of the Journalclub was about Molecular Mechanics Poisson−Boltzmann Surface Area (MMPBSA) binding free energy calculations between ligand and receptor using Molecular Dynamics simultions (MD). As an example I selected:

David W. Wright, Benjamin A. Hall, Owain A. Kenway, Shantenu Jha, and Peter V. Coveney. Computing Clinically Relevant Binding Free Energies of HIV-1 Protease Inhibitors. J Chem Theory Comput. Mar 11, 2014; 10(3): 1228–1241

The first question is: Why do we need such rather complex and computationally expensive approaches if other (e.g. empirical) scoring functions can do similar things? The main challenges thereby is that simple scoring functions often do not work very well for systems where they were not calibrated on (e.g. Knapp et al. 2009 (http://www.ncbi.nlm.nih.gov/pubmed/19194661)). The reasons for that are manifold. MD-based approaches can improve two major limitations of classical docking/scoring functions:

1) Proteins are not static. Ligand as well as receptor can undergo various slightly different configurations even for one binding site. Therefore the view of scoring one ligand configuration against one receptor configuration is not the whole picture. The first improvement is to consider a lot of different configurations for one position score of the ligand:

2) A more physics based scoring function can be more reliable than a simple and run-time efficient scoring function. On the basis of the MD simulations a variety of different terms can be deduced. These include:

– MM stands for Molecular Mechanics. It’s internal energy includes bond stretch, bend, and torsion. The electrostatic part is calculated using a Coulomb potential while the Van der Waals term is calculated using a Lennard-Jones potential.
– PB stands for Poisson−Boltzmann. It covers the polar solvation part i.e. the electrostatic free energy of solvation.
– SA stands for Surface Area. It covers the non-polar solvation part via a surface tension weighted solvent accessible surface area calculation.
– TS stands for the entropy loss of the system. This term is necessary because the non-polar solvation incorporates an estimate of the entropy changes implicitly but does not account for an entropy change upon receptor/ligand formation in vacuo. This term is calculated on the basis of a normal mode analysis.

If all these terms are calculated for each single frame of the MD simulations and those single values are averaged an estimate of the binding free energy of the complex can be obtained. However, this estimate might not represent the actual mean of the spatial distribution. Therefore at least 50 replica MD simulations are needed per investigated complex. In this aspect replica means an identically parameterized simultion of the same complex where only the inital forces are assinged randomly.

On the basis of the described MMPBSA-TS approach in combination with 50 replicas the authors achieve a reasonable correlation (0.63) for the 9 FDA-approved HIV-1
protease inhibitors with know experimental binding affinities. If the two largest complexes are excluded the correlation improves to an excellent value (0.93).

In a current study we are using the same methodology for peptide/MHC interactions. This system is completely different from the protease inhibitor study of Wright et al.: The ligands are peptides and the binding site is a groove consisting of two alpha-helices. The methods was applied as it is (without calibration or any kind of training). Prelimiary data still shows a high correlation with experimental values for the peptide/MHC system. This indicates that this MMPBSA approach can yield reliable predictions for very different systems without further modification.

Natural Move Monte Carlo: Sampling Collective Motions in Proteins

Protein and RNA structures are built up in a hierarchical fashion: from linear chains and random coils (primary) to local substructures (secondary) that make up a subunit’s 3D geometry (tertiary) which in turn can interact with additional subunits to form homomeric or heteromeric multimers (quaternary). The metastable nature of the folded polymer enables it to carry out its function repeatedly while avoiding aggregation and degradation. These functions often rely on structural motions that involve multiple scales of conformational changes by moving residues, secondary structure elements, protein domains or even whole subunits collectively around a small set of degrees of freedom.

The modular architecture of antibodies, makes them amenable to act as an example for this phenomenon. Using MD simulations and fluorescence anisotropy experiments Kortkhonjia et al. observed that Ig domain motions in their antibody of interest were shown to correlate on two levels: 1) with laterally neighbouring Ig domains (i.e. VH with VL and CH1 with CL) and 2) with their respective Fab and Fc regions.

Correlated motion between all residue pairs of an antibody during an MD simulation. The axes identify the residues whereas the colours light up as the correlation in motion increases. The individual Ig domains as well as the two Fabs and the Fc can be easily identified. ref: Kortkhonjia, et al., MAbs. Vol. 5. No. 2. Landes Bioscience, 2013.

This begs the question: Can we exploit these molecular properties to reduce dimensionality and overcome energy barriers when sampling the functional motions of metastable proteins?

In 2012 Sim et al. have published an approach that allows for the incorporation of these collective motions (they call them “Natural Moves”) into simulation. Using simple RNA model structures they have shown that explicitly sampling large structural moves can significantly accelerate the sampling process in their Monte Carlo simulation. By gradually introducing DOFs that propagate increasingly large substructures of the molecule they managed to reduce the convergence time by several orders of magnitude. This can be ascribed to the resulting reduction of the search space that narrows down the sampling window. Instead of sampling all possible conformations that a given polynucleotide chain may take, structural states that differ from the native state predominantly in tertiary structure are explored.

Reducing the conformational search space by introducing Natural Moves. A) Ω1 (residue-level flexibility) represents the cube, Ω2 (collective motions of helices) spans the plane and Ω3 (collective motions of Ω2 bodies) is shown as a line. B) By integrating multiple layers of Natural Moves the dimensionality is reduced. ref: Sim et al. (2012). PNAS 109(8), 2890–5. doi:10.1073/pnas.1119918109

It is important to stress, however, that in addition to these rigid body moves local flexibility is maintained by preserving residue level flexibility. Consequently, the authors argue, high energy barriers resulting from large structural rearrangements are reduced and the resulting energy landscape is smoothened. Therefore, entrapment in local energy minima becomes less likely and the acceptance rate of the Monte Carlo simulation is improved.

Although benchmarking of this method has mostly relied on case studies involving model RNA structures with near perfect symmetry, this method has a natural link to near-native protein structure sampling. Similarly to RNA, proteins can be decomposed into local substructures that may be responsible for the main functional motions in a given protein. However, due to the complexity of protein motion and limited experimental data we have a limited understanding of protein dynamics. This makes it a challenging task to identify suitable decompositions. As more dynamic data emerges from biophysical methods such as NMR spectroscopy and databases such as www.dynameomics.org are extended we will be able to better approximate protein motions with Natural Moves.

In conclusion, when applied to suitable systems and when used with care, there is an opportunity to breathe life into the static macromolecules of the pdb, which may help to improve our understanding of the heterogeneous structural landscape and the functional motions of metastable proteins and nanomachines.

Protein Folding: Man vs Machine

In 1996 Gary Kasparov, the reigning world chess champion, played IBM’s Deep blue, a computer whose sole purpose was to play chess better than any human. Losing the first match, Gary sprung back swiftly defeating Deep Blue 4-2 over the remaining matches. However, his success was short lived. In a rematch with an updated Deep Blue the following year, the score was 3.5-2.5 to the computer. The media (and IBM) declared this as a pivotal moment in history, where a machine had proven itself better than humanities champion at a game deemed a highly intellectual pursuit. The outcry was that the age of machines had arrived. Was it true? Should humanity have surrendered to machine overloads at that moment? Obviously the answer is a large and resounding no. However, this competition allows for insightful comparison between the manner in which humans and computers play chess and think. By comparing the two, we learn the strengths and weaknesses of both parties from which we can make combined approaches that may exceed either.

Firstly, lets discuss the manner in which a computer “plays” chess. They simply search all possible configurations of moves that are available and pick the most optimal. However, things are not that simple. Consider only the opening sequence, there are 20 possible moves a player can make, so after only a single move by each player there is 400 possible chess positions. This count grows exponentially fast, after 5 moves by each player there is approximately 5 million combinations. For example, it was estimated that Deep Blue could analyse 2 million positions per second. However, since this is not nearly fast enough to examine all possible games from start to end in a reasonable time scale, computers cannot foresee lines of plays which are far in the distance. To overcome this, in the early game the computer will use a reference table developed by grandmasters that list both common openings and the assumed best manner to respond to them. Obviously, these are only assumed as optimal and have never been completely tested. In short, machines participate through a brute force, utilising their intricate ability to perform calculations at high speed to find the best move. However, the search is too large in the initial and end stages of a game to be completely thorough, a reference table is instead used to “inform” of the correct move at these times.

While a human can quite easily see that the following board leads to a draw, computers cannot draw the same conclusion without huge effort.

In contrast, human players use far more visual and spatial recognition alongside both memory and calculation to pick their moves. Like a computer, a player will analyse a portion of the moves available at any given moment. Though since a human cannot compare on computation speed to that of a computer, they cannot analyse nearly the same magnitude of moves. Hence, this subset of moves chosen for analysis must contain the most optimal move(s) to compete against the computer’s raw power. This is where the visual and spatial recognition abilities of humans come to bare. Firstly, a human can easily dissect the board into pieces worth considering and those to be ignored. For example, consider a possible move that would result in your queen being exposed and then taken. A human would conclude this as bad (normally) and discard further moves leading from such a play. A computer, however, would explore the resultant board state. One can see how this immediately and drastically reduces the required search. Another human ability is that a player will often be able to able to see sub-structures within a full set-up that are common in the game and hence can be processed in a known manner. In other words, the game is broken down into fragments which can be processed far easier and with less computation. Obviously, both of the above techniques rely on prior knowledge of chess to be useful, but they based upon our human ability to perceive both the substructure of the game and the overall picture with relative ease.

So how does all this chess talk relate to protein folding? In 2010, the Baker group and creators of the ROSETTA protein fold prediction program produced the protein folding game “Foldit”. In Foldit the general public could attempt to fold proteins for themselves and try to get closer to the native structure than the computer algorithms. Obviously, simplified in presentation to that of academic structural biology, it was hoped that the visual and spatial reasoning abilities of humans, the same ones that differentiated them from machines at chess, would prove useful in protein structure prediction. A key issue within ROSETTA drove this train of thought, the fact that is is relatively bad at exploring fully the confirmation space. Often, it will get stuck in the one general configuration and not explore the fold space fully. Furthermore, due to the size of configuration space, this is not easily overcome with simulated annealing due to the sheer scale of the problem. The ability of humans to view the overall picture meant that it should be easier for them to see other possible configurations. As end goals for Foldit, it was hoped that structures that proved unsolvable by current algorithms would be solved by humans and also that new techniques would emerge as “moves” employed by players to achieve high scores could be studied.

To make a comparison of the structures produced by Foldit players and ROSETTA viable, the underlying energy “scores” that judge a structure is the same between the programs. It is assumed, though is not always true, that the better the score the closer you are to the native fold. In addition, Foldit players were also able to use a set of optimisation tools that were deterministic and would alter the backbone and side chains to the most optimal local configuration to the arrangement the player would make. This meant that players could focus predominantly on altering the overall structure of the protein rather than the fine detail, such as the position of sidec hains. To make the game as approachable as possible, technical terms were replaced by common analogues and visual cues where displayed to highlight poor scoring areas of the protein. For example, clashes between atoms are shown via large spiked red orbs, while the backbone is coloured from green to red depending on how well buried the hydrophobic residues on that segment are. To drive players, gamification elements were also included such as leader boards and rewarding “fireworks” as graphical effects.

To objectively compare the ability of the player base to that of the ROSETTA algorithm, they performed blind predictions on a set of 10 proteins whose structure were not in the public domain. This was run in a similar manner to CASP for those familiar with that set-up. The results exemplified the innate human ability of visual and spatial recognition. In 5 of the cases the playerbase performed significantly better than the ROSETTA program. In 3 of the cases they performed similar. And in the remaining 2 cases the ROSETTA algorithm performed better, though in both of these the model produced was still extremely far from the native structure. Looking through the cases individually, it was identified that the most crucial element used by players was that they were able to deal with large rearrangements that ROSETTA struggled to deal with, including register shifts and strand swapping. This highlights the ability of humans to view the overall picture and to persevere through “bad scoring patches” to reach a more optimal configuration.

Comparison of foldit player’s solutions (green) to ROSETTA’s solutions (red) and the native 2KPO protein structure (blue). The players correctly identified a strand swap needed to reach the native form, while this large reconfiguration was not seen by ROSETTA.

Since the release of the game and the accompanying paper in 2010, Foldit has received much praise in conveying the field of protein folding in an approachable manner to so many people. In addition, the player base has contributed to science as whole. In 2011 the player base successfully solved the structure of a M-PMV protein, a retrovirus whose structure was unobtainable via normal means. Then in 2012, by analysing the common set of moves employed by the player base, they collectively produced an algorithm that outperforms previously published fold prediction methods. Personally, I think of Foldit as a fun and relative intuitive game that introduces the core elements of the protein folding problem. As to its scientific merit, I’m unsure as to how much impact it will continue to have. As Saulo discussed last week, if infinite monkeys have infinite time then Shakespeare will be reproduced. Likewise, if enough people manipulate a protein structure, eventually the best structure will be found. Though who am I to judge, if people find the game fun, then there are far worse past-times one can have than trying to solve structures. As a finishing note I would be extremely interested in using Foldit to teach structural biology in the future, though feel it is overall too simple for a university setting.

Oxford Protein Informatics Group

or "OPIG" to friends