Monthly Archives: February 2015

Building accurate models of membrane protein structures

Today I gave a talk on my research project when I joined the group. My research focuses on modeling of membrane proteins (MPs). Membrane proteins are the main class of drug targets and their mechanism of function is determined by their 3D structure. Almost 30% of the proteins in the sequenced genomes are membrane proteins. But only ~2% of the experimentally determined structures in the PDB are membranes. Therefore, computational methods have been introduced to deal with this limitation.

Homology modeling is one of the best performing computational methods which gives “accurate” models of proteins. Many homology modeling methods have been developed, with Modeller being one of the best known ones. But these methods have been tested and customised primarily on the soluble proteins. As we know there are main physical difference between the MPS and water soluble proteins. Therefor, to build a homology modeling pipeline for membrane proteins, we need a pipeline which in all its steps the unique environment of the membrane protein is taken into account.

Memoir is a tool for homology-based prediction of membrane protein structure (Figure below). As input memoir takes a target sequence and a template. First, using imembrane the lipid bilayer boundaries are detected on the template. Using this information MP-T, with its membrane specific substitution matrices, aligns the target and template. Then, Medeller is used to build the core model and finally FREAD, a fragment-based loop modeling, is used to fill in the missing loops.

Memoir Pipeline

Memoir Pipeline

Memoir methodology builds accurate models but potentially incomplete. Homology modeling often entails a trade-off between the level of accuracy and the level of coverage that can be achieved in predicted models. Therefore we aim to build Memoir 2.0, in which we increase coverage by modelling the missing structural information only if such prediction is sensible. Therefore, to complete the models in the best way we aim at:

  • 1-Examine the best ways to maximise FREAD coverage, maintaining accuracy
  • 2-Examine the best ab initio loop predictor for membrane proteins
  • Fread has two main parameters which contribute to its accuracy and coverage. The nature of the chosen database to look for a loop (i.e. membrane or soluble (mem/sol)) and the choice of the sequence identity (ESSS) cut-off:

  • ESSS >= 25: more accurate loop models are built (Hiacc)
  • ESSS > 0: more coverage is met but not necessary accurate models (Hicov)
  • To test the effect of these parameters on the prediction accuracy and coverage we chose to test set:

  • Mem_DS: 280 loops taken from MP X-ray structures.
  • Model_DS: 156 loops from homology models of MPs. The loop length in both test ranges from 4 to 17 residues
  • The comparisons on both dataset confirm that to achieve the highest accuracy and coverage the FREAD Pipeline should be performed as:

  • 1. Hiacc-mem
  • 2. Hicov-mem
  • 3. Hiacc-sol
  • 4. Hicov-sol
  • Memoir with the new FREAD Pipeline, called Memoir 2.0, achieves higher coverage in comparison to the original Memoir 1.0.

    But there are still loops which are not modeled by FREAD Pipeline. These loops should be modeled using an ab initio method. To test the performance of soulable ab initio loop predictors on the membrane proteins we predicted the loops of the above testset sing six ab initio methods available for download: Loopy, LoopBuilder, Mechano, Rapper, Modeller and Plop.

    Comparison between ab initio methods on membrane proteins

    Comparison between ab initio methods on membrane proteins

    Comparisons in the image above shows that:

  • FREAD is more accurate but, doesn’t achieve complete coverage.
  • Greater coverage is achieved using ab initio predictors.
  • Mechano, LoopBuilder and Loopy are the best ab initio predictors.
  • We have selected Mechano for Memoir 2.0 because it:

  • provides higher coverage than Loopy whilst retaining a similar accuracy.
  • is faster than LoopBuilder (Mechano is ~30 min faster on loop length of 12)
  • is able to model terminals.
  • In memoir 2.0 the C and N terminals of up to 8 residues are built using Mechano. Then, Mechano decoy’s are ranked by their Dfire score , and accepted only if they have exited the membrane. This check improves the average RMSD up to 4Å on DS_280 terminals.

    In conclusion, Memoir 2.0 provides higher coverage models while maintaining a reasonable accuracy level. Our comparison results showed that FREAD is significantly more accurate than the ab initio methods. But, greater coverage is achieved using ab initio predictors.Comparison oshows that the top ab initio predictors in terms of accuracy are Mechano, LoopBuilder and Loopy. Similar patterns were also present in the model dataset. We have selected Mechano as it provides higher coverage than Loopy whilst retaining a similar accuracy and is also much faster than LoopBuilder. Mechano also has the advantage that it is able to model terminals. Only loops smaller than 17 residues were considered for modelling since above this threshold the accuracy of predicted loops drops significantly.

    Hypotheses and Perspectives onto de novo protein structure prediction

    Before I start with my musings about my work and the topic of my D. Phil thesis, I would like to direct you to a couple of previous entries here on BLOPIG. If you are completely new to the field of protein structure prediction or if you just need to refresh your brain a bit, here are two interesting pieces that may give you a bit of context:

    A very long introductory post about protein structure prediction

    and

    de novo Protein Structure Prediction software: an elegant “monkey with a typewriter”

    Brilliant! Now, we are ready to start.

    In this OPIG group meeting, I presented some results that were obtained during my long quest to predict protein structures.

    Of course, no good science can happen without the postulation of question-driving hypotheses. This is where I will start my scientific rant: the underlying hypotheses that inspired me to inquire, investigate, explore, analyse, and repeat. A process all so familiar to many.

    As previously discussed (you did read the previous posts as suggested, didn’t you?), de novo protein structure prediction is a very hard problem. Computational approaches often struggle to search the humongous conformational space efficiently. Who can blame them? The number of possible protein conformations is so astronomically large that it would take MUCH longer than the age of the universe to look at every single possible protein conformation.

    If we go back to biology, protein molecules are constantly undergoing folding. More so, they manage to do so efficiently and accurately. How is that possible? And can we use that information to improve our computational methods?

    The initial hypothesis we formulated in the course of my degree was the following:

    “We [the scientific community] can benefit from better understanding the context under which protein molecules are folding in vivo. We can use biology as a source of inspiration to improve existing methods that perform structure prediction.”

    Hence came the idea to look at biology and search for inspiration. [Side note: It is my personal belief that there should be a back and forth process, a communication, between computational methods and biology. Biology can inspire computational methods, which in turn can shed light on biological hypotheses that are hard to validate experimentally]

    To direct the search for biological inspiration, it was paramount to understand the limitations of current prediction methods. I have narrowed down the limitations of de novo protein structure prediction approaches to three major issues:

    1- The heuristics that rely on sampling the conformational space using fragments extracted from know structures will fail when those fragments do not encompass or correctly describe the right answer.

    2- Even when the conformational space is reduced, say, to fragment space, the combinatorial problem persists. The energy landscape is rugged and unrepresentative of the actual in vivo landscape. Heuristics are not sampling the conformational space efficiently.

    3- Following from the previous point, the reason why the energy landscape is unrepresentative of the in vivo landscape is due to the inaccuracy of the knowledge-based potentials used in de novo structure prediction.

    Obviously, there are other relevant issues with de novo structure prediction. Nonetheless, I only have a limited amount of time for my D.Phil and those are the limitations I decided to focus on.

    To counter each of these offsets, we have looked for inspiration in biology.

    Our understanding from looking at different protein structures is that several conformational constraints are imposed by alpha-helices and beta-strands. That is a consequence of hydrogen bond formation within secondary structure elements. Unsurprisingly, when looking for fragments that represent the correct structure of a protein, it is much easier to identify good fragments for alpha-helical or beta-strand regions. Loop regions, on the other hand, are much harder to be described correctly by fragments extracted from known structures. We have incorporated this important information into a fragment library generation software in an attempt to address limitation number 1.

    We have investigated the applicability of a biological hypothesis, cotranslational protein folding, into a structure prediction context. Cotranslational protein folding is the notion that some proteins begin their folding process as they are being synthesised. We further hypothesise that cotranslational protein folding restricts the conformational space, promoting the formation of energetically-favourable intermediates, thus steering the folding path towards the right conformation. This hypothesis has been tested in order to improve the efficiency of the heuristics used to search the conformational space.

    Finally, following the current trend in protein structure prediction, we used evolutionary information to improve our knowledge-based potentials. Many methods now consider correlated mutations to improve their predictions, namely the idea that residues that mutate in a correlated fashion present spatial proximity in a protein structure. Multiple sequence alignments and elegant statistical techniques can be used to identify these correlated mutations. There is a substantial amount of evidence that this correlated evolution can significantly improve the output of structure prediction, leading us one step closer to solving the protein structure prediction problem. Incorporating this evolution-based information into our routine assisted us in addressing the lack of precision of existing energy potentials.

    Well, does it work? Surprisingly or not, in some cases it does! We have participated in a blind competition: the Critical Assessment for protein Structure Prediction (CASP). This event is rather unique and it brings together the whole structure prediction community. It also enables the community to gauge at how good we are at predicting protein structures. Working with completely blind predictions, we were able to produce one correct answer, which is a good thing (I guess).

    All of this comes together nicely in our biologically inspired pipeline to predict protein structures. I like to think of our computational pipeline as a microscope. We can use it to prod and look at biology. We can tinker with hypotheses, implement potentials and test them, see what is useful for us and what isn’t. It may not be exactly what get the papers published, but the investigative character of our structure prediction pipeline is definitely the favourite aspect of my work. It is the aspect that makes me feel like a scientist.

    Protein Structure Prediction, my own metaphorical microscope…

     

    Improving the accuracy of CDR-H3 structure prediction

    When designing an antibody for therapeutic use, knowledge of the structure (in particular the binding site) is a huge advantage. Unfortunately, obtaining even one of these structures experimentally, for example by x-ray crystallisation, is very difficult and time-consuming – researchers have therefore been turning to models.

    The ‘framework’ regions of antibodies are well conserved between structures, and therefore homology modelling can be used successfully. However, problems arise when modelling the six loops that make up the antigen binding site – called the complementarity determining regions, or CDRs. For five of these loops, only a small number of conformations have actually been observed, forming a set of structural classes – these are known as canonical structures. The class that a CDR loop belongs to can be predicted from its structure, making the prediction of their structures quite accurate. However, this is not the case for the H3 loop (the third CDR of the heavy chain) – there is a much larger structural diversity, making H3 structure prediction a challenging problem.

    Antibody structure, showing the six CDR loops that make up the antigen binding site. The H3 loop is found in the centre of the binding site, shown in pink. PDB entry 1IGT.

    Antibody structure, showing the six CDR loops that make up the antigen binding site. The H3 loop is found in the centre of the binding site, shown in pink. PDB entry 1IGT.

    H3 structure modelling can be considered as a specific case of general protein loop modelling. Starting with the sequence of the loop, and the structure of the remaining parts of the protein, there are three stages in a loop modelling algorithm: conformational sampling, the filtering out of physically unlikely structures, and ranking. There are two types of loop modelling algorithm, which differ in the way they perform the conformational sampling step: knowledge-based methods, and ab initio methods. Knowledge-based methods use databases of known structures to produce loop conformations, while ab initio methods do this computationally, without knowledge of existing structures. My research involves the testing and development of these loop modelling algorithms, with the aim of improving the standard of H3 structure prediction.

    A knowledge-based method that I have tested is FREAD. FREAD uses a database of protein fragments that could possibly be used as loop structures. This database is searched, and possible structures are returned depending on the similarity of their sequence to the target sequence, and the similarity of the anchor structures (the two residues on either side of the loop). On a set of 55 unbound H3 loop targets, ranging between 8 and 18 residues long, FREAD (using a database of known H3 structures) produced an average best prediction RMSD of 2.7 Å (the ‘best’ prediction is the loop structure closest to the native of all those returned by FREAD). FREAD is obviously very sensitive to the availability of H3 structures: if no similar structure has been observed before, FREAD will either return a poor answer or fail to find any suitable fragments at all. For this reason there is huge variation in the FREAD results – for example, the best prediction for one target had an RMSD of 0.18 Å, while for another, the best RMSD was 10.69 Å. Fourteen of the targets were predicted with an RMSD of below 1 Å. The coverage for this particular set of targets was 80%, which means that FREAD failed to find an answer for one in five targets.

    MECHANO is an ab initio algorithm that we have developed specifically for H3 loop prediction. Loops are built computationally, by adding residues sequentially onto one of the anchors. For each residue, φ/ψ dihedral angles are chosen from a distribution at random – the distributions used by MECHANO are residue-specific, and are a combination of general loop data and H3 loop data. Loops conformations are closed using a modified cyclic coordinate descent algorithm (CCD), where the dihedrals of each residue are changed, one at a time, to minimise the distance between the free end of the loop and its anchor point, whilst keeping the dihedral angles in the allowed regions of the Ramachandran plot. I have tested MECHANO on the same set of targets as FREAD, generating 5000 loop conformations per target: the average best prediction RMSD was 2.1 Å, and the results showed a clear length dependence – this is expected, since the conformational space to explore becomes larger as the number of residues increases. Even though the average best prediction RMSD is better than that of FREAD, only one of the best RMSDs produced by MECHANO was sub-angstrom, compared to 14 for FREAD. Since the MECHANO algorithm does not depend on previously observed structures, predictions were made for all targets (i.e. coverage = 100%).

    My current work is focused upon developing a ‘hybrid’ method, which combines elements of the FREAD and MECHANO algorithms. In this way, we hope to make predictions with the accuracy that can be achieved by FREAD, whilst maintaining 100% coverage. In its current form, the hybrid method, when tested on the 55-loop dataset from before, produces an average best prediction RMSD of 1.68 Å, with 16 targets having a best RMSD of below 1 Å – a very promising result! However, possibly the most difficult part of loop prediction is the ranking of the generated loop structures; i.e. choosing the conformation that is closest to the native. This is therefore my next challenge!