Category Archives: Group Meetings

What we discuss during cake at our Tuesday afternoon group meetings

Research Talk: Ligand Fitting in X-ray Crystallography

In the last group meeting, I reported on the success of ligand-fitting programs for the automated solution of ligand structures.

In Fragment Screens by X-ray Crystallography, a library of small compounds (fragments) is soaked into protein crystals, and the resulting structures are determined by diffraction experiments. Some of the fragments will bind to the protein (~5% of the library), and these are detected by their appearance in the derived electron density.

The models of binding fragments can be used to guide structure-based drug-design efforts, but first they must be built. Due to the large number of datasets (200-1000), the automated identification of the fragments that bind, and the automated building of atomic models is required for efficient processing of the data.

Density Blobs

Anecdotally, available ligand-fitting programs are unreliable when modelling fragments. We tested three ligand fitting programs in refitting a series of ligand structures. We found that they fail more frequently when the electron density for the ligand is weak. Many fragments that are seen to bind in screens do so only weakly, due to their size. So the weaker the fragment binds, the harder it will be for the automated programs to model.

Success Rates Identifying the Correct Model

Models are usually ranked by the Real-Space Correlation Coefficient (RSCC) between the model and the experimental electron density. This metric is good at identifying ‘correct’ models, and an RSCC > 0.7 normally indicates a correct, or at least mostly correct, model.

Typically, the binding locations of ligands are found by searching for un-modelled peaks in the electron density map. Models are then generated in these locations, and are then scored and ranked. Good models can be identified and presented to the user. However, if a ‘good’ model is not generated, to be scored and ranked, the RSCCs of the ‘bad’ models will not tell you that there is something to be modelled, at a particular place, and binding may be missed…

This is especially true for weak-binding ligands, which will not give a large electron density peak to give evidence that there is something there to be modelled.

Currently, all of the datasets must be inspected manually, to check that a weak-binding fragment has not been missed…

Augmented Modelling with Natural Move Monte Carlo Simulations

In the last group meeting I reported on the progress that I have made regarding the development of a protocol for the systematic use of Natural Move Monte Carlo simulations.

Natural Move Monte Carlo simulations
Natural Moves are degrees of freedom that describe the collective motion of groups of residues. In DNA this might be the concerted motion of a double helix; in proteins this could be the movement of a stable secondary structure element such as a beta-sheet. These segments are joined by so called melting areas. At each simulation step the segments are propagated independently in an MC fashion. The resulting chain breaks are resolved by a chain closure algorithm that acts on the melting areas. This results in a reduction of degrees of freedom of several orders of magnitude. Therefore, large complexes and conformational changes can be sampled more effectively.

In order to get sensible results, however, the initial decomposition of the system is important. The challenge is to accurately represent the plasticity of the system, while keeping the number of degrees of freedom as small as possible. Detailed insight into the flexibility of the system might be gained from experimental sources such as NMR or computational methods such as MD simulations and Normal Mode Analysis. This can help with defining segments and melting areas. However, there are many systems for which this data is not available. Even if it is, there is no guarantee that the segmentation is correct.

Therefore, I am developing a protocol that allows for the evaluation of a range of different test cases that each reflect a unique set of segments and melting areas.

Augmented Modelling Protocol
This protocol is aimed at the systematic evaluation of NMMC segmentations. It allows researchers to feed experimental information, biological knowledge and educated guesses into molecular simulations and so provides a framework for testing competing hypotheses. The protocol has four steps.

Step 1: Segmentation of the system into low-level segments
The initial segmentation contains all possible areas of flexibility that may play a role in conformational changes in the system of interest. This decision may be influenced by many sources. For now, however, we only consider secondary structure information. Helices and beta strands are treated as potential segments. Unstructured regions such as kinks, loops and random coils are treated as melting areas. For a small fold with four helices we get the segmentation shown in figure 1a.

Step 2: Formulate test cases
Generate multiple test cases that reflect hypotheses about the mechanism of interest. In this step we try to narrow down the degrees of freedom as much as possible in order to retain sampling efficiency. This is done by selectively deactivating some melting areas that were defined in step 1. For a system with three melting areas that can either be on or off, 2^3 = 8 different test cases may be generated (example shown in figure 1b).

Segmentation of a small α-fold.

Figure 1 a) Segmentation of a small α-fold. The blue rectangles represent α-helices. The dashed lines indicate the presence of melting areas I, II and III. Each melting area can be switched on or off (1/0) b) Example of a test case in which the first of three melting area is switched off. c) The six degrees of freedom along which a segment is propagated.

Step 3: Perform simulations
Sample the conformational space of all test cases that were generated in step 2. We generally use Parallel Tempering or Simulated Tempering algorithm to accelerate the sampling process. These methods rely on the modulation of temperature to overcome energy barriers.

Step 4: Evaluate results
Score the results against a given control and rank the test cases accordingly. The scoring might be done by comparing experimental distributions of observables with those generated by simulations (e.g. Kullback-Leibler divergence). A test case that reproduces desired expectation values of observables might then be considered as a candidate hypothesis for a certain structural mechanism.

What’s next?
I am currently working on example uses for this protocol. These include questions regarding aspects of protein folding and the stability of the empty MHC II binding groove.

The origins of exponential random graph models

The article An Exponential Family of Probability Distributions for Directed Graphs, published by Holland and Leinhardt (1981), set the foundation for the now known exponential random graph models (ERGM) or p* models, which model jointly the whole adjacency matrix (or graph) X. In this article they proposed an exponential family of probability distributions to model P(X=x), where x is a possible realisation of the random matrix X.

The article is mainly focused on directed graphs (although the theory can be extended to undirected graphs). Two main effects or patterns are considered in the article: Reciprocity, which relates to appearance of symmetric interactions (X_{ij}=1 \iff X_{ji}=1) (see nodes 3-5 of the Figure below).

Stochastic_block_model_directed

and, the Differential attractiveness of each node in the graph, which relates to the amount of interactions each node “receives” (in-degree) and the amount of interactions that each node “produces” (out-degree) (the Figure below illustrates the differential attractiveness of two groups of nodes).

Stochastic_block_model_directed2 The model of Holland and Leinhardt (1981), called p1 model, that considers jointly the reciprocity of the graph and the differential attractiveness of each node is:

p_1(x)=P(X=x) \propto e^{\rho m + \theta x_{**} + \sum_i \alpha_i x_{i*} + \sum_j \beta_j x_{*j}},

where \rho,\theta,\alpha_i,\beta_j are parameters, and \alpha_*=\beta_*=0 (identifying constrains).  \rho can be interpreted as the mean tendency of reciprocation\theta can be interpreted as the density (size) of the network, \alpha_i can be interpreted as as the productivity (out-degree) of a node, \beta_j can be interpreted as the attractiveness (in-degree) of a node.

The values m, x_{**}, x_{i*} and x_{*j} are: the number of reciprocated edges in the observed graph, the number of edges, the out-degree of node i and the in-degree of node j; respectively.

Taking D_{ij}=(X_{ij},X_{ji}), the model assumes that all D_{ij} with i<j are independent.


 

To better understand the model, let’s review its derivation:

Consider the pairs D_{ij}=(X_{i,j},X_{j,i}),\,i<j and describe the joint distribution of \{D_{ij}\}_{ij}, assuming all D_{ij} are statistically independent. This can be done by parameterizing the probabilities

P(D_{ij}=(1,1))=m_{ij} \text{ if } i<j,

P(D_{ij}=(1,0))=a_{ij} \text{ if } i\neq j,

P(D_{ij}=(0,0))=n_{ij} \text{ if } i<j,

where m_{ij}+a_{ij}+a_{ji}+n_{ij}=1,\, \forall \, i<j .

Hence leading

P(X=x)=\prod_{i<j} m_{ij}^{x_{ij}x_{ji}} \prod_{i\neq j}a_{ij}^{x_{ij}(1-x_{ji})} \prod_{i<j}n_{ij}^{(1-x_{ij})(1-x_{ji})}    =e^{\sum_{i<j} {x_{ij}x_{ji}} \rho_{ij} + \sum_{i\neq j}{x_{ij}} \theta_{ij}} \prod_{i<j}n_{ij},

where \rho_{ij}=log(m_{ij}n_{ij} / a_{ij}a_{ji}) for i<j, and \theta_{ij}=log(a_{ij}/n_{ij}) for i\neq j.

It can be seen that the parameters \rho_{ij} and \theta_{ij} can be interpreted as the reciprocity and differential attractiveness, respectively. With a bit of algebra we get:

exp(\rho_{ij})=[ P(X_{ij}=1|X_{ji}=1)/P(X_{ij}=1|X_{ji}=0) ]/[ P(X_{ij}=1|X_{ji}=0) / P(X_{ij}=0|X_{ji}=0) ],
and
exp(\theta_{ij})=P(X_{ij}=1|X_{ji}=0)/P(X_{ij}=0|X_{ji}=0).

Now, if we consider the following restrictions:

\rho_{ij}=\rho,\, \forall i<j, and \theta_{ij}=\theta+\alpha_i + \beta_j,\, \forall i\neq j where \alpha_*=\beta_*=0 .

With some algebra we get the proposed form of the model

p_1(x)=P(X=x) \propto e^{\rho m + \theta x_{**} + \sum_i \alpha_i x_{i*} + \sum_j \beta_j x_{*j}}.

 

 

HHSearch

Introduction and Methods:

Today I prsented the paper which introduces HHsearch. HHSearch is mainly used for template detection and still is one of the best known methods. The early methods for template detection simply performed sequence-sequence comparison between a query protein and a database of targets: such as BLAST or FASTA. But then more advanced methods such as PSI-BLAST were introduced which perform sequence-profile comparisons. Profiles are build using multiple sequence alignments (MSA) of protein families which describes the probability of the occurrence of an amino acid in a column of a MSA. In other words, profiles describe the conservation of amino acids among families. A remarkable advance in template detection was introduced by methods performing profile-profile comparisons such as COMPASS, PROF-SIM and LAMA.

Hidden Markov Model (HMM) introduced a new way of building profiles which resulted in methods performing sequence-HMM comparisons to detect templates. A HMM is similar to a state machine and is build using a MSA where each column is given a ‘M’ (match) state. A match state emits amino acids based on probability calculated from the MSA. In addition of a match state, all columns of the MSA will have an ‘I’ (insert) state and ‘D’ (delete/gap) state (See below figure for an example). There is a transition between states (shown by arrows) where all transitions also have probabilities. Having a sequence and a HMM, the sequence can be alignment on the HMM. In other words there is path in the HMM which associates to emitting the sequence and a log-odd score associated with this path. Dynamic programming (Viterbi algorithm) is used to detect this path (similar to needleman and wunsch algorithm for sequence-sequence alignments). More detail can be found here.

Example of an HMM. Taken from Bioinformatics Sequence and Genome Analysis, David W. Mount (http://compbio.pbworks.com/w/page/16252909/Multiple%20Sequence%20Alignment)

Example of an HMM. Taken from Bioinformatics Sequence and Genome Analysis, David W. Mount
(http://compbio.pbworks.com/w/page/16252909/Multiple%20Sequence%20Alignment)

The novel idea of HHsearch was that instead of performing sequence-HMM alignment, HMM-HMM alignments can be used for template detection. Therefore, they first discuss the formula used to calculate the log-odd scores which are required by the Viterbi algorithm to find the best aligned path. The score of aligning two columns in two HMMs (query profile q and template profile t) are calculated as:

1

Using this score, the best alignment between two profile HMMs is generated by maximizing the log-sum-odds score as in the formula below:

2

Now that the scoring function is defined, Viterbi algorithm is used to maximize this function. For simplicity HHsearch has described five pair states and the allowed transition between them (see figure 1.C of the paper). Therefore five dynamic program matrices are needed to align the two HMMs.

There are two other main parameters that can contribute to the final score of the HHsearch functions: 1) Correlation score: this score is based on the idea that if two proteins are homologs then once they are aligned high score columns (conserved columns) should cluster together. Which means the higher this score is the more homolog the sequences are. This score can be simply added to the ‘final best alignment score’ from the Viterbi algorithm. 2) While aligning two columns of the two HMMs, Secondary Structure (SS) elemnts are scored using statistical scores generated by the authors, which take into account the confidence values of the SS predictions. HHsearch provides two set of scores for SS comparison: 1- predicted vs predicted 2- predicted vs known. The second one is mainly used when performing 3D structure prediction. These acores are added to the Viterbi algorithm scoring functions in the formula 5,6 and 7 of the paper.

Results:

Dataset:

HHsearch performance was compared to BLAST(sequence-sequence), PSI-BLAST(profile-sequence) , HMMER(HMM-sequence) and COMPASS and PROF-SIM(profile-profile) methods. five version of HHsearch was benchmarked:

HHserach 0 -> basic profile-profile comparison with fix gap penalties
HHserach 1 -> basic HMM-HMM comparison
HHsearch 2 -> HMM-HMM comparison including correlation score
HHsearch 3 -> HMM-HMM comparison including correlation score + predicted vs predicted SS
HHsearch 4 -> HMM-HMM comparison including correlation score + predicted vs known SS

The dataset used in the comparison study consists of 3691 sequences taken from the SCOP database filtered at 20% sequence identity (SCOP-20). For each sequence an alignment is prepared using PSI-BLAST. These alignments are used to compare methods.

Detection of homolog:

The ability of methods in detecting homologs are compared against each other. Homolog and non-homolog definitions are: If two domains of SCOP-20 are in the same SCOP superfamily then they are considered as homologs. If the two domains are from different classes they are considered as non-homologs. All other pairs are classified as unknown.
The performance are compared by drawing TP (number ofhomolog pairs) vs FP (number of non-homolog pairs) curves, for all-against-all comparison of data in SCOP-20. The highest curves represent the best performing method. This curve is shown in Figure 2 of the paper. In this dataset the total number of TP are 41505 and the total number of FP are 1.08 x 10^7. The worse method is BLAST which at a error rate of 10% will only find 22% of the homologs. carrying on from BLAST the detection ability grows as:

BLAST < PSI_BLAST < HMMER < PROF-SIM < COMPASS < HHsearch0 < HHsearch1 < HHsearch2 < HHsearch3 < HHsearch4 Studying the results from HHsearch in detail the authors realised that in some cases HHSearch (with high confidence) groups two domains from different superfamily or even different folds as homologs. Looking at the structures of these proteins the authors noticed that the structure of these proteins are either locally or globally similar. Therefore, proteins defined by SCOP to be in different superfamilies or fold might actually be homologs. Therefore, they repeated the same test but changing the definition of TP and F: -homologs (TPs) are domains from the same superfamily OR if their sequence-based alignment resulted in an structure alignment with MaxSub score of 0.1. (MaxSub ranges from 0-1, where 0 is insignificant and 1 very significant.) -non-homologs (FPs) are those domains from different classes AND zero MaxSub score. Domains not in these two categories are grouped as unknow. Figure 3 of the paper displays the new results. Although the figures 2 and 3 look similar there are few main points concluded from these diagrams: 1- At the same error rate all methods except BLAST detect more homologs compared to Figure 2 of the paper. 2- With the new definitions, sensitive tools improve more than others in homolog detection such as HHsearch 3 and 4. Since the new set of homologs are harder to detect and therefore only sensitive tools can detect them 3- COMPASS and PRO-SIM perform better than HHsearch 0 which means they are better at remote homolog detection. To check this hypothesis, they draw TP vs FP (Figure 4 of paper) only for TPs from different families. Not only they confirm the hypothesis, they notice that using SS (HHsearch 3 and 4) they detect more TPs. Interesting enough as the pairs under study evolutionary diverge, the power of SS in detection becomes bolder. Alignment quality:

The quality of a homology model really depends on how well the query protein is aligned with the homolog. Therefore, in this paper all of the methods are compared on their ability on building accurate alignments. To do so, using an aligned pair of sequences, the 3D structures of two proteins
are superposed and the spatial distances are evaluated (similar to MaxSub method). The authors also introduced the Sbalance score which is similar to MaxSub but considers over and under predictions that MaxSub fails to consider.

Comparing alignment qualities using MaxSub score (Fig 5 of paper) we can roughly conclude the performance as:

BLAST < PSI_BLAST < PROF-SIM < COMPASS < HHsearch0 < HMMER < HHsearch1 < HHsearch2 < HHsearch3 < HHsearch4 again more sensitive methods are able to build better alignments for distant homologs. Comparing alignment qualities using Sbalance score (Fig 6 of paper) we can roughly conclude the performance as: BLAST < PSI_BLAST < HMMER < PROF-SIM < COMPASS < HHsearch0 < HHsearch1 < HHsearch2 < HHsearch4 < HHsearch3 HMMER now is inferior to profile-profile alignment methods. In addition HHsearch 3 is the winner. In general HMM-HMM methods are superior to all other methods for homolog detection and building alignments. HHsearch 4 has shown to be able to detect related structures (more than other methods) in the the same superfamily and folds. Using the same idea more accurate tools have been developed since this paper such as HHblits from the same group. Also, recently a method has introduced Markov Random Fields to detect homologs, with better performance than HHsearch and HHblits.

Antibody CDR-H3 Modelling with Prime

In a blog post from last month, Konrad discussed the most recent Antibody Modelling Assessment (AMA-II), a CASP-like blind prediction study designed to test the current state-of-the-art in antibody modelling. In the second round of this assessment, participants were given the crystal structure of ten antibodies with their H3 loops missing – the loop usually found in the centre of the binding site that is largely responsible for the binding properties of the antibody. The groups of researchers were asked to model this loop in its native environment. Modelling this loop is challenging, since it is much more variable in sequence and structure than the other five loops in the binding site.

For eight out of the ten loops, the Prime software from Schrodinger (the non-commercial version of which is called PLOP) produced the most accurate predictions. Prime is an ab initio method, meaning that loop conformations are generated from scratch (unlike knowledge-based methods, which  use databases of known loop structures). In this algorithm, described here,  a  ‘full’ prediction job is made up of consecutive ‘standard’ prediction jobs. A standard prediction job involves building loops from dihedral angle libraries – for each residue in the sequence, random phi/psi angles are chosen from the libraries. Loops are built in halves – lots of conformations of the first half are generated, along with many of the second half, and then all the first halves are cross-checked against the second halves to see whether any of them meet in the middle. If so, then the two halves are melded and a full loop structure is made. All loop structures are then clash-checked using an overlap factor (a cutoff on how close two atoms can get to each other). Finally, the loops are clustered, and a representative structure has its side chain conformations predicted and its energy minimised.

A full loop prediction job is made up of a series of standard jobs, with the goal of guiding the conformational search to focus on structures with low energy. The steps are as follows:

  • Initial – five standard jobs are run, with slightly different overlap factors.
  • Ref1 – the first refinement stage. The conformational space around the top 10 loops from each standard job of the Initial stage is explored further by constraining the distance between Ca atoms.
  • Fixed – the top 10 loops of all those generated so far are passed to this series of stages. To begin with, the first and last residues of the loop are excluded from the prediction and the rest of the loop is re-modelled. The top 10 loops after this are then taken to the second Fixed stage, where two residues at each end of the loop are kept fixed. This is repeated five times, with the number of fixed residues at each end of the loop being increased by one each time.
  • Ref2 – a second refinement stage, which is the same as the first, except tighter distance constraints are used.
  • Final  – all the loop structures generated are ranked according to their energy, and the lowest energy conformation is chosen as the final prediction.

In a recent paper, Prime was used to predict the structures of 53 antibody H3 loops (using the dataset of a previous RosettaAntibody paper). 91% of the targets were predicted with sub 2-angstrom accuracy, and 81% predictions were sub-angstrom. Compared to RosettaAntibody, which achieved 53% and 17% for predictions below 2A and 1A respectively, this is very impressive. For AMA-II, however, where each group was required to give five predictions, and some poor models were included in each group’s top five, it is apparent that ranking loop conformations is still a major challenge in loop modelling.

Sampling Conformations of Antibodies using MOSAICS

Much work has been done to study the conformational changes taking place in antibodies, particularly during the event of binding to an antigen. This has been done through comparison of crystal structures, circular dichroism, and recently with high resolution single particle electron microscopy. The ability to resolve domains within an antibody from single particles without any averaging  made it possible to show distributions of properties such as the shape of a Fab domain, measured by the ratio of width to length. Some of the variation in structure seen involves very large scale motions, but it is not known how conformational changes may be transmitted from the antigen binding region to the Fc, and therefore influence effector function. Molecular dynamics simulations have been performed on some large antibody systems, however none have been possible on a time scale which would be able to provide information on the converged distributions of large scale properties such as the angle between the Fab and Fc fragments.

In my short project with Peter Minary, I used MOSAICS to investigate the dynamics of an antibody Fab fragment, using the coarse-grained natural move Monte Carlo approach described by Sam a few weeks ago. This makes it possible to split a structure into units which are believed to move in a correlated way, and propose moves for the components of each region together. The rate of sampling is accelerated in degrees of freedom which may have functional significance, for example the movement of the domains in a Fab fragment relative to one another (separate regions shown in the diagram below). I used ABangle to analyse the output of each sampling trajectory and observe any changes in the relative orientations of The VH and VL domains.

Region definitions for MOSAICS

Fab region definitions for MOSAICS

Of particular interest would be any correlations between conformational changes in the variable and constant parts of the Fab fragment, as these could be involved in transmitting conformational changes between remote parts of the antibody. We also hoped to see in our model some effect of including the antigen in the simulation, bound to the antibody fragment as seen in the crystal structure. In the time available for the project, we was able to  set up a model representing the Fab fragment and run some relatively short simulations to explore favoured conformational states and see how the set up of regions affects distributions seen. In order to draw conclusions about the meaning of the results, a much greater number of simulations will need to be run to ensure sampling of the whole conformational space.

Antibody modeling via AMA II and RosettaAntibody

Intro

Protein modeling is one of the most challenging problems in bioinformatics. We still lack a clear theoretical framework which would allow us to link linear protein sequence to its native 3D coordinates. Given that we only have the structures for about a promile of the known seqs, homology modeling is still one of the most successful methods to obtain a structure from a sequence. Currently, using homology modeling and the 1393 known folds we can produce models for more than half known domains. In many cases this is good enough to get an overall idea of the fold but for actual therapeutic applications, there is still a need for high-resolution modeling.

There is one group of molecules whose properties can be readily exploited via computational approaches for therapeutic applications: antibodies.  With blockbuster drugs such as Humira, Avastin or Remicade, they are the leading class of biopharmaceuticals. Antibodies share a great degree of similarity with one another (<50-60% sequence identity) and there are at least 1865 antibody structures in the PDB. Therefore, homology modeling of these structures at high resolution becomes tractable, as exemplified by WAM and PIGS. Here, we will review the antibody modeling paradigm using one of the most successful antibody modeling tools, RosettaAntibody, concluding with the most recent progress from AMA II (antibody CASP).

General Antibody-antigen modeling

Modeling of antibody structures can be divided into the following steps:

  1. Identification of the Framework template
  2. Optimizing Vh/Vl orientation of the template
  3. Modeling of the non-H3 CDRs
  4. Modeling of H3

Most of the diversity of antibodies can be found in the CDRs. Therefore, the bulk of the protein can be readily copied from the framework region. This however needs to undergo an optimization of the Vh/Vl orientation. Prediction of the CDRs is more complicated since they are much more variable than the rest of the protein. Non-H3 CDRs can be modeled using canonical structure paradigms. Prediction of H3 is much more difficult since it does not appear to follow the canonical rules.

When the entire structure is assembled, it is recommended to perform refinement using some sort of relaxation of the structure, coupled with an energy function which should guide it.

RosettaAntibody

RosettaAntibody protocol roughly follows this described above. In the first instance, an appropriate template is identified by highest BLAST bit scores. The best heavy and light chains aligned to the best-BLAST-scoring Fv region. The knowledge-base here is a set of 569 antibody structures form SACS with resolutions 3.5A and better. The Vh/Vl orientation is subsequently refined using local relaxation, guided by Charmm.

Non-H3 CDRs are modeled using the highest-scoring BLAST hit of the same length. Canonical information is not taken into account. Loops are grafted on the framework using the residues overlapping with the anchors.

H3 loops are modeled using a fragment based approach. The fragment library is Rosetta+H3 from the knowledge base of antibody structures created for the purpose of this study. The low-resolution search consists of Monte Carlo attempts to fit 3-residue fragments followed by Cyclic Coordinate Descent loop closure. This is followed by high resolution search when the H3 loop and Vh/Vl are repacked using a variety of moves.

Each decoy coming from the repacking is scored using Rosetta function. The lower the Rosetta score the better the decoy (according to Rosetta).

Results

RosettaAntibody can produce high-quality models (1.4A) on its 54 structure benchmark test. The major limitation of the method (just like any other antibody modeling method) is the H3 loop modeling. It is believed that H3 is the most important loop and therefore getting this loop right is a major challenge.

Right framework and the correct orientation of Vh/Vl have a great effect on the quality of H3 predictions. When the H3 was modeled on using the correct framework, the predictions are order of magnitude better than by using the homology model. This was demonstrated using the native recovery in RosettaAntibody study as well as during ‘Step II’ of the Antibody Modeling assessment where participants were asked to model H3 using the correct framework.

Natural Move Monte Carlo: Sampling Collective Motions in Proteins

Protein and RNA structures are built up in a hierarchical fashion: from linear chains and random coils (primary) to local substructures (secondary) that make up a subunit’s 3D geometry (tertiary) which in turn can interact with additional subunits to form homomeric or heteromeric multimers (quaternary). The metastable nature of the folded polymer enables it to carry out its function repeatedly while avoiding aggregation and degradation. These functions often rely on structural motions that involve multiple scales of conformational changes by moving residues, secondary structure elements, protein domains or even whole subunits collectively around a small set of degrees of freedom.

The modular architecture of antibodies, makes them amenable to act as an example for this phenomenon. Using MD simulations and fluorescence anisotropy experiments Kortkhonjia et al. observed that Ig domain motions in their antibody of interest were shown to correlate on two levels: 1) with laterally neighbouring Ig domains (i.e. VH with VL and CH1 with CL) and 2) with their respective Fab and Fc regions.

Correlated Motion

Correlated motion between all residue pairs of an antibody during an MD simulation. The axes identify the residues whereas the colours light up as the correlation in motion increases. The individual Ig domains as well as the two Fabs and the Fc can be easily identified. ref: Kortkhonjia, et al., MAbs. Vol. 5. No. 2. Landes Bioscience, 2013.

This begs the question: Can we exploit these molecular properties to reduce dimensionality and overcome energy barriers when sampling the functional motions of metastable proteins?

In 2012 Sim et al. have published an approach that allows for the incorporation of these collective motions (they call them “Natural Moves”) into simulation. Using simple RNA model structures they have shown that explicitly sampling large structural moves can significantly accelerate the sampling process in their Monte Carlo simulation. By gradually introducing DOFs that propagate increasingly large substructures of the molecule they managed to reduce the convergence time by several orders of magnitude. This can be ascribed to the resulting reduction of the search space that narrows down the sampling window. Instead of sampling all possible conformations that a given polynucleotide chain may take, structural states that differ from the native state predominantly in tertiary structure are explored.

Reduced Dimensionality

Reducing the conformational search space by introducing Natural Moves. A) Ω1 (residue-level flexibility) represents the cube, Ω2 (collective motions of helices) spans the plane and Ω3 (collective motions of Ω2 bodies) is shown as a line. B) By integrating multiple layers of Natural Moves the dimensionality is reduced. ref: Sim et al. (2012). PNAS 109(8), 2890–5. doi:10.1073/pnas.1119918109

It is important to stress, however, that in addition to these rigid body moves local flexibility is maintained by preserving residue level flexibility. Consequently, the authors argue, high energy barriers resulting from large structural rearrangements are reduced and the resulting energy landscape is smoothened. Therefore, entrapment in local energy minima becomes less likely and the acceptance rate of the Monte Carlo simulation is improved.

Although benchmarking of this method has mostly relied on case studies involving model RNA structures with near perfect symmetry, this method has a natural link to near-native protein structure sampling. Similarly to RNA, proteins can be decomposed into local substructures that may be responsible for the main functional motions in a given protein. However, due to the complexity of protein motion and limited experimental data we have a limited understanding of protein dynamics. This makes it a challenging task to identify suitable decompositions. As more dynamic data emerges from biophysical methods such as NMR spectroscopy and databases such as www.dynameomics.org are extended we will be able to better approximate protein motions with Natural Moves.

In conclusion, when applied to suitable systems and when used with care, there is an opportunity to breathe life into the static macromolecules of the pdb, which may help to improve our understanding of the heterogeneous structural landscape and the functional motions of metastable proteins and nanomachines.

de novo Protein Structure Prediction software: an elegant “monkey with a typewriter”

In this week’s OPIG group meeting, I discussed the inner-works and the algorithm behind ROSETTA, one of the most well-known software for de novo protein structure prediction.

Before we even attempt to understand how ROSETTA works, let us start with a theorem.

Theorem: given an infinite number of monkeys with typewriters and an infinite amount of time, they are very likely to recreate the works of William Shakespeare.

Monkey with a typewriter… Time to write that Shakespeare!

Well, let us be a little more modest and attempt to recreate just a phrase of old Bill, instead of his whole works:

“The fool doth think he is wise, but the wise man knows himself to be a fool.”

Well, if we exclude spaces and punctuation marks, that leaves us 58 positions in our phrase (the length of the quote). Considering we have 26 possible letters for each position, we would expect to generate this phrase at random once every of 26^58 times. Wow!

That means that we need to evolve from monkeys (pun intended) and appeal to our over-developed encephalon!

In order to steer our Monkey typewriter, we can reduce this problem to a Global Optimisation problem. In a Global Optimisation problem, we define a function f (named an objective function) which we want to minimise for a given set of parameters x. Bare in mind that if we want to maximise a given function fwe can define g = -f 

In a global optimisation problem, we are interested in finding the values of X that minimise the function f(X).

Now, all we need is to define an objective function in order to guide our Monkey typewriter towards the right answer.

Let us define the following objective function: given our Shakespearean phrase and a sequence of 58 letters, the value of the objective function equals the number of letters that are different between the phrase and the sequence of letters.

We can now proceed to define a slightly more refined Monkey Typewriter:

1- Start with a random sequence of letters.
2- WHILE sequence != shakespearean_phrase:
3-________ Select a random position in the sequence.
4-________ Assign a new letter to that position.
5-________ IF score of new sequence < score of old sequence:
6-__________________ Accept the change.
7-________ ELSE:
8-__________________ Discard the change.

This way we can steer our Monkeys and reduce the time it would take to generate our Shakespearean phrase to a more feasible time.

Now, let’s talk about protein structure prediction (PSP). More specifically, let us talk about de novo protein structure prediction (different flavours of protein structure prediction have been discussed previously here).

One of the great ideas behind the creators of ROSETTA, was to use a combination of two different techniques to address the big problems of protein structure prediction:

1- Problem number #1 of PSP is the size of the conformational space. A protein can be represented by it’s backbone atoms, which, in turn, can be reconstructed from a sequence of torsion angles. A set of 3 torsion angles can be used to represent every protein residue. Therefore, for a protein with 100 residues, we would have a total of 300 angles. If we approximate each angle to assume one of 360 values (degrees), that gives us 360^300 possible conformations (not huge at all, han?).

One of the main ideas behind ROSETTA was to reduce the search space by using fragments extracted from known structures. The use of fragments restricts the possible angles to a set of values that are known to occur in nature. Therefore, instead of looking at 360^300 possible angles, we deal with a much more feasible search space.

The name ROSETTA is based on the Rosetta Stone, an archaeological artefact that allowed modern civilisation to interpret and convert between different alphabets. In reality, ROSETTA can be seen as a very elegant monkey typewriter. ROSETTA uses sequence and structure similarity to define a structural alphabet. For every single position in our protein sequence, we have a set of fragments extracted from now protein structures to represent that position.  Originally, each position would be represented by 25 fragments (letters?). If you combine the different pieces of known structures in the right order, you will get your Shakespearean Phrase in the end (the correct Protein Structure!).

2- Well, we still have a pretty big conformational space considering we have 25 fragments per position (approximately 25^100 possible conformations, for a protein with 100 residues). The second technique employed by ROSETTA is Simulated Annealing.

Simulated Annealing is a Global Optimisation heuristic. It attempts to find a good enough solution to the problem of minimising a given function f. It is very similar to our Monkey Typewriter algorithm. The main difference is that Simulated Annealing implements some tricks to avoid local minima entrapment. In simpler terms, if we ONLY accept favourable changes (Line 5 of Monkey Typewriter pseudo-code), once we reach a local minimum, we get trapped. No possible change would lead to an improvement, yet we are still far from finding the global minimum.

In order to mitigate that entrapment effect, Simulated Annealing defines a probability of accepting an unfavourable change. This probability is higher at the beginning of the simulation and it becomes lower and lower as the simulation progresses. This process is usually referred to as “cooling down”.

Ok! So we reduced our PSP problem to an elegant Monkey Typewriter. We have our Monkeys working to create the best possible Shakespeare, in a pretty clever and sophisticated manner. Well, we should be able to create some fine piece of literature, correct?

Not quite!

There are still several problems with this whole pipeline. I will mention a few:

  • When you define your structural alphabet, you may not have the right fragment to represent a certain position. This would be the same as trying to get to a Shakespearean phrase without using vowels for the first 10 letters or only using consonants in the middle of the sentence. It would never happen…
  • Despite the many efforts to define a very good objective function, no current software presents a function that truly mimics the behaviour of an energy function. This implies that we have a vague idea of how the Shakespearean phrase should look like, but we cannot precisely pinpoint where each letter goes.
  • No matter how elegant our Monkey typewriter becomes, the combinatorial problem still persists. We are still dealing with 25^100 possible conformations and it is impossible to try every single conformation.
  • The objective function, if plotted in a graph, would look completely hideous (unlike the picture above). We are talking about a gigantic multi-dimensional surface, filled with local minima that confuse and entrap our simulations. Combine that with the fact that our objective function is not accurate and you waste most of your computing power into generating solutions that are completely useless.
  • Another common technique to address the previous limitations is to increase the number of Monkeys in order to speed up the search process. If you use thousands and thousands of Monkeys (multiple runs of ROSETTA), each individual Monkey will get to a local minimum (decoy = something that looks like a phrase). In recent years, tens of thousands of decoys are generated in order to predict a single structure. A new problem arises, because out of these tens of thousands of phrases, we cannot tell apart Hamlet from Twilight. We don’t know which Monkeys got close to the right answer. All we know is that for some cases some of them did.

In conclusion, de novo Protein Structure Prediction still has a long way to go.

MAMMOTH: a case study in protein structure alignment

I’ve talked about protein structure alignment before in the context of a rather novel, mathematical approach. This time I wanted to revisit the topic in a general sense, using a more established algorithm as a case study. MAMMOTH stands for Matching Molecular Models Obtained from Theory and was first published in 2002. Since then it has been cited nearly 400 times and the underlying algorithm has been extended to a multiple alignment program: MAMMOTH-mult.

Establishing biologically relevant and consistent alignments between protein structures is one of the major unsolved problems in computational bioinformatics. However, it’s an important part of many challenges that we face: such as establishing homology between distantly related proteins, functional inference for unannotated proteins, and evaluating the accuracy of models of predicted structure for competitions such as CASP.

Problem Outline

In essence the problem of protein structure alignment can be outlined by considering two ordered sets of coordinates, A = {a1,a2,…,an} and B = {b1,b2,…,bm}, representing points in 3D space. In most cases these points will be the location of the Cα atoms along each structure’s backbone. The sets A and B might be completely different lengths and, if an alignment exists, are almost certainly orientated differently compared to each other.

coordinates

Establishing an alignment between these sets is equivalent to two steps:

  1. Establish a match M = {(ai,bj) | ai ∈ A, bj ∈ B}
  2. Rotate and translate A onto B so that equivalent atoms are as close as possible.

Of course, it is not immediately clear how to best fulfill these requirements. In particular, we don’t really know what features to prioritise for a biologically relevant match. Should we try to match secondary structure elements and what penalty should we attach to mismatched elements? How about maintaining the correct hydrogen bonding patterns between matched residues? And how much weight should we put on the matched atoms being consecutive in each set (i.e. how should we penalise gaps)?

The second step is equally ambiguous. Especially as there is no consensus on what the correct interpretation of close is. Minimising the RMSD between equivalent atoms is a popular choice of distance measure. However, as the MAMMOTH paper points out, RMSD is often dominated by the mismatched portions of remotely related structures and is thus largely inappropriate in these cases. Furthermore, even if we have a well-defined distance metric, should the superposition prioritise minimising the distances between nearly identical parts of the different structures, at the expense of less similar substructures? Or should the emphasis be on maintaining as lengthy a match as possible at the possible cost of a lower closeness of fit? How about the relative importance of a close fit for atoms in the core of the structure vs. those on the surface?

The majority of these questions remain unanswered and as a result it is often hard to validate alignments as we simply do not know what the right answer is. In fact, in many cases, manual analysis is preferred over any of the available computational techniques.

In this post I’ll go through how the MAMMOTH algorithm approaches each of these steps. For many of the above questions MAMMOTH does not postulate a solution, primarily because, as its name suggests, it was designed to assess prediction models which are often at low resolutions and lacking secondary structure or hydrogen bonding information. I think it’s important to keep these questions in mind, but there’s certainly no necessity to design a programme which deals with them all.

Step 1: Pairing up residues (similarity matrix)

In order to establish a match between equivalent atoms in A and B, MAMMOTH, like several other structural alignment algorithms, uses a well-established alignment technique: a similarity matrix (often inferred from and referenced as a distance matrix). A similarity matrix for alignment is an n x m array where each entry S(ai,bj) represents the pairwise similarity score between residues ai and bj. An alignment between the residues of A and B is any non-decreasing path (that is, a pair (ai,bj) in the path must appear later in the ordering of coordinates of both A and B than the preceding pair of residues in the path) from the top left corner of the array (a1,b1) to the bottom right corner (an,bm). For example the following path can be interpreted as an alignment between A = {a1, …, a11} and B = {b1, …, b8}

similarity_matrix

Any alignment can be scored by summing up the similarity scores along this path, while penalising any gaps in an appropriate way (normally, these algorithms use trial and error to decide on sensible penalties). For example, the above alignment would have the score S = S(a1,b1) + S(a2,b2) + S(a3,b3) + S(a7,b4) + S(a8,b5) + S(a9,b6) + S(a10,b7) + S(a11,b8) + α + 2β, where α and β are gap opening and gap extension penalties respectively. The optimal alignment is simply the alignment which maximises this score.

For sequence alignments similarity scores can be assigned to residues from substitution tables like BLOSUM. However, it is not immediately clear of an appropriate equivalent for structures. MAMMOTH, like several other algorithms, defines the similarity between different residues by examining their local structural landscape. Specifically, this means comparing fragments of each backbone, centred on the residue of interest. MAMMOTH uses the URMS distance between heptapeptide fragments. This distance is illustrated below using 2D chains and tripeptide fragments.

urms

Comparing residues a2 and b3 involves looking at the directions between each successive residue of the fragment. Each vector is mapped to the unit sphere, beginning at the origin and ending at the surface of the sphere (in this case 2 vectors are considered, and in MAMMOTH’s case 6 different 3D vectors are mapped). The optimal rotation is found, superposing equivalent vectors as best as possible, and then the RMSD of the endpoints on the surface of the sphere is calculated as URMS(ai,bj).

Aside: The optimal superposition of a set of vectors is actually a non-trivial problem. It is essentially equivalent to step 2 in our alignment protocol outlined above, but is significantly easier for the 6 vectors characterising a fragment in MAMMOTH’s algorithm.

Finally, S(ai,bj) is calculated by converting the distance into a similarity measure:

similarity

where URMSR is the expected URMS of a random set of vectors and:

delta

The optimal alignment through this MAMMOTH matrix is the path which maximises the sum of similarities between matched residues (each residue being at the centre of the heptapeptide fragment) using gap opening and extension penalties of 7.00 and 0.45 respectively.

Step 2: Global superposition (MaxSub)

The above alignment results in a match M’ optimising the local structural similarity of residues in each structure, however, their is no guarantee that this will result in a set of coordinates close in global space. In order to finalise the match set M ⊆ M’ as well as calculating the optimal superposition of the paired residues of A onto their equivalent points in B, MAMMOTH use the MaxSub algorithm. This is a very efficient algorithm (worth a read if you’re that way inclined) for calculating the maximal subset from a set of paired up atoms which are close in global space. MAMMOTH decide that close means < 4A away after superposition. They do not try to optimise a closer superposition than that but attempt to find the largest possible set of matched residues.

The MaxSub algorithm relies on the assumption (made for computational tractability) that the final subset M ⊆ M’ will have, somewhere, a set of at least four residues consecutive in M’. The algorithm then starts with every possible seed of four consecutive residues (just to illustrate the power of the assumption in reducing computational time: for a 150 residue protein there are just 147 such seeds, but over 2 million sets of four non-consecutive residues!! And it’s a pretty reasonable assumption to make as well). The MaxSub algorithm then calculates the superposition for those four matched pairs, extending the set of residues that are <4A away from their partners, recalculating the superposition using these new pairs as well, then removing any pairs which are no longer within the threshold of each other. It repeats these steps, gradually extending the set M, until the algorithm converges.

Scoring the alignment

Using the two approaches outlined above, MAMMOTH generates an alignment between the two input structures. In order to summarise the significance of this alignment, the algorithm generates the PSI score: the percentage structural identity (which is simply the size of the maximum subset divided by the length of the shortest protein). As a global measure of the strength of similarity the PSI score is poorly constructed and scales with protein length. In order to adjust for this bias, MAMMOTH fits a Gumbel distribution to PSI scores obtained from random structure comparisons between unrelated proteins at bins of different lengths. This results in a z-score measuring, instead of the PSI of an alignment, the likelihood of obtaining a PSI score as good as that by chance between any two proteins of the same lengths.