Author Archives: Reyhaneh Esmaielbeiki

Building accurate models of membrane protein structures

Today I gave a talk on my research project when I joined the group. My research focuses on modeling of membrane proteins (MPs). Membrane proteins are the main class of drug targets and their mechanism of function is determined by their 3D structure. Almost 30% of the proteins in the sequenced genomes are membrane proteins. But only ~2% of the experimentally determined structures in the PDB are membranes. Therefore, computational methods have been introduced to deal with this limitation.

Homology modeling is one of the best performing computational methods which gives “accurate” models of proteins. Many homology modeling methods have been developed, with Modeller being one of the best known ones. But these methods have been tested and customised primarily on the soluble proteins. As we know there are main physical difference between the MPS and water soluble proteins. Therefor, to build a homology modeling pipeline for membrane proteins, we need a pipeline which in all its steps the unique environment of the membrane protein is taken into account.

Memoir is a tool for homology-based prediction of membrane protein structure (Figure below). As input memoir takes a target sequence and a template. First, using imembrane the lipid bilayer boundaries are detected on the template. Using this information MP-T, with its membrane specific substitution matrices, aligns the target and template. Then, Medeller is used to build the core model and finally FREAD, a fragment-based loop modeling, is used to fill in the missing loops.

Memoir Pipeline

Memoir Pipeline

Memoir methodology builds accurate models but potentially incomplete. Homology modeling often entails a trade-off between the level of accuracy and the level of coverage that can be achieved in predicted models. Therefore we aim to build Memoir 2.0, in which we increase coverage by modelling the missing structural information only if such prediction is sensible. Therefore, to complete the models in the best way we aim at:

  • 1-Examine the best ways to maximise FREAD coverage, maintaining accuracy
  • 2-Examine the best ab initio loop predictor for membrane proteins
  • Fread has two main parameters which contribute to its accuracy and coverage. The nature of the chosen database to look for a loop (i.e. membrane or soluble (mem/sol)) and the choice of the sequence identity (ESSS) cut-off:

  • ESSS >= 25: more accurate loop models are built (Hiacc)
  • ESSS > 0: more coverage is met but not necessary accurate models (Hicov)
  • To test the effect of these parameters on the prediction accuracy and coverage we chose to test set:

  • Mem_DS: 280 loops taken from MP X-ray structures.
  • Model_DS: 156 loops from homology models of MPs. The loop length in both test ranges from 4 to 17 residues
  • The comparisons on both dataset confirm that to achieve the highest accuracy and coverage the FREAD Pipeline should be performed as:

  • 1. Hiacc-mem
  • 2. Hicov-mem
  • 3. Hiacc-sol
  • 4. Hicov-sol
  • Memoir with the new FREAD Pipeline, called Memoir 2.0, achieves higher coverage in comparison to the original Memoir 1.0.

    But there are still loops which are not modeled by FREAD Pipeline. These loops should be modeled using an ab initio method. To test the performance of soulable ab initio loop predictors on the membrane proteins we predicted the loops of the above testset sing six ab initio methods available for download: Loopy, LoopBuilder, Mechano, Rapper, Modeller and Plop.

    Comparison between ab initio methods on membrane proteins

    Comparison between ab initio methods on membrane proteins

    Comparisons in the image above shows that:

  • FREAD is more accurate but, doesn’t achieve complete coverage.
  • Greater coverage is achieved using ab initio predictors.
  • Mechano, LoopBuilder and Loopy are the best ab initio predictors.
  • We have selected Mechano for Memoir 2.0 because it:

  • provides higher coverage than Loopy whilst retaining a similar accuracy.
  • is faster than LoopBuilder (Mechano is ~30 min faster on loop length of 12)
  • is able to model terminals.
  • In memoir 2.0 the C and N terminals of up to 8 residues are built using Mechano. Then, Mechano decoy’s are ranked by their Dfire score , and accepted only if they have exited the membrane. This check improves the average RMSD up to 4Å on DS_280 terminals.

    In conclusion, Memoir 2.0 provides higher coverage models while maintaining a reasonable accuracy level. Our comparison results showed that FREAD is significantly more accurate than the ab initio methods. But, greater coverage is achieved using ab initio predictors.Comparison oshows that the top ab initio predictors in terms of accuracy are Mechano, LoopBuilder and Loopy. Similar patterns were also present in the model dataset. We have selected Mechano as it provides higher coverage than Loopy whilst retaining a similar accuracy and is also much faster than LoopBuilder. Mechano also has the advantage that it is able to model terminals. Only loops smaller than 17 residues were considered for modelling since above this threshold the accuracy of predicted loops drops significantly.

    PLOS Comp Bio’s front page!

    The work of our molecular dynamic expert, Bernhard Knapp, has made the front page of the PLOS Computational Biology . The image shows how CDR3 loops scan a peptide/MHC complex (Figure below: Snapshot from the website). Immune cells in the human body screen other cells for possible infections. The binding of T-cell receptors (TCR) and parts of pathogens bound by major histocompatibility complexes (MHC) is one of the activation mechanisms of the immune system. There have been many hypotheses as to when such binding will activate the immune system. In this study we performed the, to our knowledge, largest set of Molecular Dynamics simulations of TCR-MHC complexes. We performed 172 simulations each of 100 ns in length. By performing a large number of simulations we obtain insight about which structural features are frequently present in immune system activating and non-activating TCR-MHC complexes. We show that many previously suggested structural features are unlikely to be causal for the activation of the human immune system.

    How CDR3 loops scan a peptide/MHC complex. Rendering is based on the 100 ns simulation of the wild-type TCRpMHC complex. Blue: peptide; Solid white cartoon and transparent surface: first frame of the MHC helices and β-sheet floor; Orange: equally distributed frames of the CDR3s over simulation time. Transparent white cartoon: first frame of all other TCR regions.

    How CDR3 loops scan a peptide/MHC complex. Rendering is based on the 100 ns simulation of the wild-type TCRpMHC complex. Blue: peptide; Solid white cartoon and transparent surface: first frame of the MHC helices and β-sheet floor; Orange: equally distributed frames of the CDR3s over simulation time. Transparent white cartoon: first frame of all other TCR regions.

    HHSearch

    Introduction and Methods:

    Today I prsented the paper which introduces HHsearch. HHSearch is mainly used for template detection and still is one of the best known methods. The early methods for template detection simply performed sequence-sequence comparison between a query protein and a database of targets: such as BLAST or FASTA. But then more advanced methods such as PSI-BLAST were introduced which perform sequence-profile comparisons. Profiles are build using multiple sequence alignments (MSA) of protein families which describes the probability of the occurrence of an amino acid in a column of a MSA. In other words, profiles describe the conservation of amino acids among families. A remarkable advance in template detection was introduced by methods performing profile-profile comparisons such as COMPASS, PROF-SIM and LAMA.

    Hidden Markov Model (HMM) introduced a new way of building profiles which resulted in methods performing sequence-HMM comparisons to detect templates. A HMM is similar to a state machine and is build using a MSA where each column is given a ‘M’ (match) state. A match state emits amino acids based on probability calculated from the MSA. In addition of a match state, all columns of the MSA will have an ‘I’ (insert) state and ‘D’ (delete/gap) state (See below figure for an example). There is a transition between states (shown by arrows) where all transitions also have probabilities. Having a sequence and a HMM, the sequence can be alignment on the HMM. In other words there is path in the HMM which associates to emitting the sequence and a log-odd score associated with this path. Dynamic programming (Viterbi algorithm) is used to detect this path (similar to needleman and wunsch algorithm for sequence-sequence alignments). More detail can be found here.

    Example of an HMM. Taken from Bioinformatics Sequence and Genome Analysis, David W. Mount (http://compbio.pbworks.com/w/page/16252909/Multiple%20Sequence%20Alignment)

    Example of an HMM. Taken from Bioinformatics Sequence and Genome Analysis, David W. Mount
    (http://compbio.pbworks.com/w/page/16252909/Multiple%20Sequence%20Alignment)

    The novel idea of HHsearch was that instead of performing sequence-HMM alignment, HMM-HMM alignments can be used for template detection. Therefore, they first discuss the formula used to calculate the log-odd scores which are required by the Viterbi algorithm to find the best aligned path. The score of aligning two columns in two HMMs (query profile q and template profile t) are calculated as:

    1

    Using this score, the best alignment between two profile HMMs is generated by maximizing the log-sum-odds score as in the formula below:

    2

    Now that the scoring function is defined, Viterbi algorithm is used to maximize this function. For simplicity HHsearch has described five pair states and the allowed transition between them (see figure 1.C of the paper). Therefore five dynamic program matrices are needed to align the two HMMs.

    There are two other main parameters that can contribute to the final score of the HHsearch functions: 1) Correlation score: this score is based on the idea that if two proteins are homologs then once they are aligned high score columns (conserved columns) should cluster together. Which means the higher this score is the more homolog the sequences are. This score can be simply added to the ‘final best alignment score’ from the Viterbi algorithm. 2) While aligning two columns of the two HMMs, Secondary Structure (SS) elemnts are scored using statistical scores generated by the authors, which take into account the confidence values of the SS predictions. HHsearch provides two set of scores for SS comparison: 1- predicted vs predicted 2- predicted vs known. The second one is mainly used when performing 3D structure prediction. These acores are added to the Viterbi algorithm scoring functions in the formula 5,6 and 7 of the paper.

    Results:

    Dataset:

    HHsearch performance was compared to BLAST(sequence-sequence), PSI-BLAST(profile-sequence) , HMMER(HMM-sequence) and COMPASS and PROF-SIM(profile-profile) methods. five version of HHsearch was benchmarked:

    HHserach 0 -> basic profile-profile comparison with fix gap penalties
    HHserach 1 -> basic HMM-HMM comparison
    HHsearch 2 -> HMM-HMM comparison including correlation score
    HHsearch 3 -> HMM-HMM comparison including correlation score + predicted vs predicted SS
    HHsearch 4 -> HMM-HMM comparison including correlation score + predicted vs known SS

    The dataset used in the comparison study consists of 3691 sequences taken from the SCOP database filtered at 20% sequence identity (SCOP-20). For each sequence an alignment is prepared using PSI-BLAST. These alignments are used to compare methods.

    Detection of homolog:

    The ability of methods in detecting homologs are compared against each other. Homolog and non-homolog definitions are: If two domains of SCOP-20 are in the same SCOP superfamily then they are considered as homologs. If the two domains are from different classes they are considered as non-homologs. All other pairs are classified as unknown.
    The performance are compared by drawing TP (number ofhomolog pairs) vs FP (number of non-homolog pairs) curves, for all-against-all comparison of data in SCOP-20. The highest curves represent the best performing method. This curve is shown in Figure 2 of the paper. In this dataset the total number of TP are 41505 and the total number of FP are 1.08 x 10^7. The worse method is BLAST which at a error rate of 10% will only find 22% of the homologs. carrying on from BLAST the detection ability grows as:

    BLAST < PSI_BLAST < HMMER < PROF-SIM < COMPASS < HHsearch0 < HHsearch1 < HHsearch2 < HHsearch3 < HHsearch4 Studying the results from HHsearch in detail the authors realised that in some cases HHSearch (with high confidence) groups two domains from different superfamily or even different folds as homologs. Looking at the structures of these proteins the authors noticed that the structure of these proteins are either locally or globally similar. Therefore, proteins defined by SCOP to be in different superfamilies or fold might actually be homologs. Therefore, they repeated the same test but changing the definition of TP and F: -homologs (TPs) are domains from the same superfamily OR if their sequence-based alignment resulted in an structure alignment with MaxSub score of 0.1. (MaxSub ranges from 0-1, where 0 is insignificant and 1 very significant.) -non-homologs (FPs) are those domains from different classes AND zero MaxSub score. Domains not in these two categories are grouped as unknow. Figure 3 of the paper displays the new results. Although the figures 2 and 3 look similar there are few main points concluded from these diagrams: 1- At the same error rate all methods except BLAST detect more homologs compared to Figure 2 of the paper. 2- With the new definitions, sensitive tools improve more than others in homolog detection such as HHsearch 3 and 4. Since the new set of homologs are harder to detect and therefore only sensitive tools can detect them 3- COMPASS and PRO-SIM perform better than HHsearch 0 which means they are better at remote homolog detection. To check this hypothesis, they draw TP vs FP (Figure 4 of paper) only for TPs from different families. Not only they confirm the hypothesis, they notice that using SS (HHsearch 3 and 4) they detect more TPs. Interesting enough as the pairs under study evolutionary diverge, the power of SS in detection becomes bolder. Alignment quality:

    The quality of a homology model really depends on how well the query protein is aligned with the homolog. Therefore, in this paper all of the methods are compared on their ability on building accurate alignments. To do so, using an aligned pair of sequences, the 3D structures of two proteins
    are superposed and the spatial distances are evaluated (similar to MaxSub method). The authors also introduced the Sbalance score which is similar to MaxSub but considers over and under predictions that MaxSub fails to consider.

    Comparing alignment qualities using MaxSub score (Fig 5 of paper) we can roughly conclude the performance as:

    BLAST < PSI_BLAST < PROF-SIM < COMPASS < HHsearch0 < HMMER < HHsearch1 < HHsearch2 < HHsearch3 < HHsearch4 again more sensitive methods are able to build better alignments for distant homologs. Comparing alignment qualities using Sbalance score (Fig 6 of paper) we can roughly conclude the performance as: BLAST < PSI_BLAST < HMMER < PROF-SIM < COMPASS < HHsearch0 < HHsearch1 < HHsearch2 < HHsearch4 < HHsearch3 HMMER now is inferior to profile-profile alignment methods. In addition HHsearch 3 is the winner. In general HMM-HMM methods are superior to all other methods for homolog detection and building alignments. HHsearch 4 has shown to be able to detect related structures (more than other methods) in the the same superfamily and folds. Using the same idea more accurate tools have been developed since this paper such as HHblits from the same group. Also, recently a method has introduced Markov Random Fields to detect homologs, with better performance than HHsearch and HHblits.

    3Dsig and ISMB 2014 (Boston)

    This year we head to Boston for 3Dsig and ISMB 2014. The outcome was excellent with James Dunbar giving a talk at the 3Dsig on “Examining variable domain orientations in antigen receptors gives insight into TCR-like antibody design” and Alistair Martin oral poster presentation at ISMB on “Unearthing the structural information contained within mRNA ”. The Deane group received the most votes by different judges for the poster competition at 3Dsig with James Dunbar poster winning the best poster prize and Jinwoo Leem and Reyhaneh Esmaielbeiki posters receiving the honorary mentioned (all presented posters are Here).

    Soooo Excited!

    Soooo Excited!

    This blog post contains a brief description of what we found most interesting at the conference.

    N-terminal domains in two-domain proteins are biased to be shorter and predicted to fold faster than their C-terminal counterparts.
    Authors: Jacob, Etai; Unger, Ron; Horovitz, Amnon

    Chosen by: Alistair Martin

    Is is not surprising that protein misfolding is selected against in the genome, with aggregation of misfolded proteins being associated to a an array of diseases. It is suggested that multi-domain proteins are more prone to misfolding and aggregation due to an effective higher local protein concentration. Jacob et al. investigated what mechanisms have developed to avoid this, focussing on ~3000 two domain proteins contained within Swiss-Prot.

    They found that there are notable differences between the N- and C-terminal domains. Firstly, there exists a large propensity for the C-terminal domain to be longer than the N-terminal domain (1.6 times more likely). Secondly, the absolute contact order (ACO) is greater in the C-terminal domain when compared to the N-terminal domain. Both length and ACO inversely correlate to folding speed and thus they draw the conclusion that the first domain produced by the ribosome is under pressure to fold faster than the latter domain. These observations are enhanced in prokaryotes and lessened in eukaryotes, thereby suggesting a relationship to translational speed and cotranslational folding.

    A novel tool to identify molecular interaction field similarities
    Authors: Matthieu Chartier and Rafael Najmanovich

    Chosen by: Claire Marks

    One of the prize-winning talks at 3Dsig this year was “A novel tool to identify molecular interaction field similarities”, presented by Matthieu Chartier from the Université de Sherbrooke. The talk introduced IsoMIF – a new method for finding proteins that have similar binding pockets to one another. Unlike previous methods, IsoMIF does not require the superposition of the two proteins, and therefore the binding sites of proteins with very different folds can be compared.
    IsoMIF works by placing probes at various positions on a grid in the binding pocket and calculating the interaction potential. Six probe types are used: hydrogen bond donor and acceptor, cation, anion, aromatic group, and hydrophobic group. A graph-matching algorithm then finds similar pockets to a query using the interaction potentials for each probe type at each grid point. On a dataset of nearly 3000 proteins, IsoMIF achieved good AUC values of 0.74-0.92.


    The promise of evolutionary coupling for 3D biology
    Presenter: Debora Marks


    Chosen by: Reyhaneh Esmaielbeiki

    I was impressed by the keynote talk given by Debora Marks at the 3Dsig. She gave an overall talk on how their group have worked on detecting evolutionary couplings (EC) between residues in proteins and how they use this information in predicting folds. In general, looking at the interacting residues in a 3D structure and comparing these position in a MSA displays co-evolving relationship. But the challenge is to solve the inverse, from sequence to structure, since not necessary all co-evolving residues are close in the 3D space (this relationship is shown in the figure below).

    Evoulationary coupling in sequnce and structure

    Evoulationary coupling in sequnce and structure

    Debora showed that previous studies for detecting co-evolving residues used Mutual Information (MI). But, comparing the prediction out of MI with contacts maps shows that these methods perform poorly. This is because MI looks at the statistics of pair of residue at a time while residues in proteins are highly coupled and pairs are not independent from other pairs. Therefore, MI works good for RNA but not for proteins. Debora’s group have used mean field and pseudo likelihood maximization to overcome the limitation of MI and introduced the EVcoupling tool for predicting EC (Marks et al. PLoS One, 6(12), 2011). They have used the predicted EC as a distance restraint to predict the 3D structure of proteins using EVfold. Using EVfold they have managed to build structure with 2-5Å accuracy.

    In a more recent work, they have built EVfold-membrane which is specific for membrane proteins (Hopf et al. Cell, 149(7), 1607-1621, 2012) and they tried modeling membrane proteins with unknown experimental structures. Recently close homologues to these structures were released and comparisons show that EVfold-membrane structures have accuracy of 3 to 4Å.

    She also discussed the usage of detecting EC in identifying functional residues involved in ligand binding and conformational changes with an interesting example of two GPCRs, adrenergic beta-2 receptor and an opioid receptor (paper link).

    She concluded her talk by talking about her recent work EVcomplex (paperlink). The aim is to use the detected EC between two different protein chains and use this information in the docking software as a distance restraint. Although, this method has provided models of the E.coli ATP synthase but there are currently several limitation (all mentioned in the Discussion of the paper) for using this work in large scale.

    in general, EC was a popular topic at the conference with interesting posters from the UCL Bioinformatics group.

    Characterizing changes in the rate of protein-protein dissociation upon interface mutation using hotspot energy and organization
    Authors: Rudi Agius, Mieczyslaw Torchala, Iain H. Moal, Juan Fernández-Recio, Paul A. Bates

    Chosen by:Jinwoo Leem

    This was a paper by Agius et al. published in 2013 in PLOS Comp Biol. Essentially, the work was centralised around finding a set of novel descriptors to characterise mutant proteins and ultimately predict the koff (dissociation rate constant) of mutants. Their approach is quite unique; they perform two rounds of alanine-scanning mutagenesis, one on the wild-type (WT) and one on the mutant protein. They identify ‘hotspot’ residues as those that have a change of G of 2kcal/mol (or more) from alanine scanning, and the descriptor is formed from the summation of the energy changes of the hotspots.

    The results were very encouraging, and from their random forest-trained model with hotspot descriptors, they see correlations to koff up to 0.79. However, the authors show that traditional ‘molecular’ descriptors (e.g. statistical potentials) perform just as well, with a correlation of 0.77. The exact contribution of their ‘hotspot’ descriptors to the prediction of koff seems unclear, especially considering how well molecular descriptors perform. Having said this, the paper shows a very unique way to approach the issue of predicting the effects of mutations on proteins, and on a larger dataset with a more diverse range of proteins (not necessarily mutants, but different proteins altogether!) these ‘hotspot’-specific methods may prove to be much more predictive.

    On the Origin and Completeness of Ligand Binding Pockets with applications to drug discovery
    Authors: Mu Gao & Jeffrey Skolnick.
    Presented by Jeffrey Skolnick

    Chosen by:Nicholas Pearce

    The prevalence of ligand-binding pockets in proteins enables a wide range of biochemical reactions to be catalysed in the cell. Jeffrey Skolnick presented research which proposes that ligand-binding pockets are inherent in proteins. One mechanism that he hypothesised for the creation of these pockets is the mis-stacking of secondary structure elements, leading to imperfections in their surfaces – pockets. Using their method for calculating pocket similarity – APoc – Gao & Skolnick characterised the space of ligand-binding pockets, using artificial structures and structures from the PDB, and find it to be ~400-1000 pockets in size.

    They suggest that the relatively small size of pocket-space could be one of the reasons that such a large amount of off-target promiscuity is seen in drug design attempts.

    From this result, Skolnick went on to discuss several interesting possibilities for the evolutionary history of ligand-binding pockets. One of the interesting hypotheses is that many, if not all, proteins could have inherent low-level catalytic ability across a wide range of biochemical reactions. Motivated by the small size of pocket-space, it is conceivable that one protein would be able to catalyse many different reactions – this could give an insight into the evolutionary history of protein catalysis.

    If primordial proteins could catalyse many different reactions, albeit inefficiently, this would give a possibility for how the first lifeforms developed. Nature need only then work on increasing specificity and efficiency from a background of weak catalytic ability. However, even through the course of evolution to produce more specific proteins, this background of activity could remain, and explain the background of ‘biochemical noise’ that is seen in biological systems.

    Drug promiscuity and inherent reactivity may not only be present due to the small size of pocket-space – they may be a direct consequence of evolution and the fundamental properties of proteins.

    Alistair Martin & codon!

    Alistair Martin & codon!

    James Dunbar presenting good stuff about antibodies

    James Dunbar presenting good stuff about Antibodies

    Stepwise Assembly For Protein Loop Prediction

    INTRODUCTION:

    Loop modeling is used frequently in designing the structure of new proteins or refining protein structures with limited X-ray and NMR data. In 2011, Sripakdeevong et al. introduced a hypothesis called “Stepwise Ansatz” for modeling RNA loops with atomic accuracy. They believed that current knowledge-based RNA loop predictors which aimed at predicting loops with atomic accuracy, failed to sample models within 1.5 Å RMSD of the native structures. The bottleneck in these methods is related to inefficient sampling of the conformational space. To overcome the limitation of sampling, Sripakdeevong et al. introduced an ‘ab initio’ (de novo) buildup strategy to allow for high resolution sampling of loops instead of restricting the search space to available fragments. But with current computational power, exhaustive enumeration of N-length (N>1) loops with atomic resolution is impossible. If N=1, considering all the degrees of freedom for nucleotide will result in 1 million conformations. Performing Rosetta energy minimization on these models will need 1 hour CPU time which is computationally reasonable. Every time a new nucleotide is added the conformational size will be multiplied exponentially by the RNA loop length (for a N=5 computational time ~ 10^23 CPU year).

    Since enumeration of one nucleotide long loop is possible, the entire loop can be modeled by stepwise enumerative building of one nucleotide at a time on low energy conformations which are well-packed and hydrogen bonded. Therefore, their implementation of stepwise assembly (SWA) protocol in a dynamic programming-like recursion style enables sampling of 12 length loops with achievable CPU time. SWA being successful in prediction of RNA loops, was first used to predict protein loops with atomic accuracy by Rhiju Das . Loop regions in protein structures have particular characteristics compared to the regions of regular secondary structure. Loops have similar number of hydrogen bonds (on average 1.1 per residue), mainly contain polar/charged side chains and have less contact with the non-polar core of the protein. Current Loop modeling methods with atomic resolution start off with a reduced representation of the protein with simplified or no side-chains. Although coarse graining of proteins will assist in reducing large number of local minima but will fail in capturing non-polar and hydrogen bond interactions involving side chains.Therefore, SWA is used to build up a loop at its atomic resolution by sampling the possible conformation space which is energetically favorable and also computationally possible.

    SWA PIPELINE:

    SWA is implemented in c++ in Rosetta framework. SWA uses a dynamic programming matrix (example is shown below in Figure 1D for a 6 length loop) to allow de novo buildup of loops from residue k to l. To achieve this, at each step SWA adds loop residue to build up forward from the N-terminus (from residue k-1 to i) and backward from the C-terminus (l+1 to j). Therefore, each circle point in figure 1D represents a (i,j) stage. SWA contains 5 main steps:

    1. Pre-packing the starting point : To start, all atoms of the loop region is removed from the model and side-chains are added to the model. This stage (k-1,l+1) is shown as green circle in figure 1D. Side chains are added and their torsion are minimized. Note that the non-loop backbones are kept fix in all stages of SWA.
    2.  Adding one loop residue to n-terminal: This stage is shown by orange arrows (moving downward) in Figure 1D. To generating possible conformations after adding the loop residue, backbone torsion angles (Φ,Ψ) of the added residue  and the backbone residue before that are sampled (Figure 1A). Φ,Ψ combinations which do not correspond to the Ramachandram are discarded. This procedure, can result in generating tens to thousands of conformations. For all the generated models, side chains are added to the sampled residues (i and i-1) and these side-chain along with the potential neighboring side chains are optimized. Afterward, clustering is performed, in which models are ranked in order of the energy and if a lower energy model has backbone RMSD of residue (i and i-1) <0.10Å compared to a higher energy model then the low energy model is removed (otherwise kept as a seed for a new cluster). After clustering the top 400 models are selected for all atom energy minimization on sampled residue backbone torsion and its neighbouring side-chain. Then, a final clustering is performed on the these models as described above.
    3. Adding one loop residue to c-terminal: This stage is shown by pink arrows (moving left) in Figure 1D. This is similar to step2, in which residue j and j+1 are considered for backbone sampling (Figure 1B), side-chain packing, model clustering and torsional minimization and final clustering.
    4. Closing loop chains :All models where the gap difference between C-terminal and N-terminal are 1,2 or 3 are subjected to chain closure. To generate closed loops, residue i+1 is added to N-terminal and i+2 and j-1 are added to C-terminal. For i+1, Φ and Ψ torsion are sampled by performing grid search as described above while backbone of i+2 and j-1 undergo Cyclic Coordinate Descent (CCD) which changes the Φ and Ψ torsion of i+2 and j-1 till it closes the gap to i+2. Models with chain closure < 0.01Å are then subject to side chain optimization, clustering, and torsional minimization. This procedure differs to above since all loop side chains and all loop backbones are affected by minimization. This stage is shown by blue arrows in Figure 1D just for gap lengths of one. In addition, to this procedure for loop closure, all models were closed by adding the last gap residue by grid search and trying to close the loop by CCD on the preceding residue. Also, models created by only sampling C-terminal or N-terminal are also used along with CCD loop closure to create full length models.
    5. Clustering: For each stage 400 models are generated, where the next stage uses these models to generate new conformations resulting in thousands models. Also several path can be used to get reach a specific stage, adding up to the numbers of generated models. Therefore, since SWA works on only low-energy models, only the 4000 lowest energy models are kept for further clustering. Clustering is similar to procedure above but with RMSD of 0.25Å and is applied on the entire loop segment which is build up to that stage. Then, the 400 lowest energy is used to move on to the next stage. At the loop closure stage also when the whole loops are modeled clustering is also used with RMSD of 1.0Å and the five lowest energy models are considered as SWA prediction.
    journal.pone.0074830.g001

    Figure 1: Stepwise Assembly Schematic Representation

    For short loops of (6 to 9 residue long), it was shown that solutions can be found just by creating models from N-terminal onward and separately by C-terminal backward and joining them by loop closure (or simply be moving just along the first column and first row of the dynamic matrix). Figure 1E shows a directed acyclic graph (DAG) of this procedure. The positive point is that in these cases computational time reduces to O(N) instead of O(N^2). Therefore, for such cases this procedure is tested first. If the difference between the lowest energy model and the second lowest is less than 1 kBT (a Rosetta energy unit is approximately 1 kBT) we can argue that modeling has not converged toward one model and the whole O(N^2) calculation should take place (Except for loops of length 24)

    RESULTS:

    A difficult case study:

    Predicting 1oyc loop (residue 203-214) has always been a challenge by loop predictors since most of its side-chains are polar/charged where hydrogen bonds play an important for stabilising the loop. All these factors are not considered in ab initio predictors with coarse-grained representation. Figure 2 of paper, displays the SWA build up procedure for 1oyc loop.The final best model (Figure 2:I) with the lowest energy has a c-alpha RMSD of 0.39 Å to the native structure. Following the build up path of 1oyc shows that the intermediate steps which lead to this final best model have not always been the lowest energy, therefore it is important to keep all the possible intermediate conformations. It is important to consider that different search paths allows sampling of totally diverse configurations. For example in Figure 2 (below), for 1oyc, 5 different configurations with comparable low energy generated by different build up paths are shown. Two totally different paths (blue and brown) may result in similar configurations while reasonably similar paths (pink, green and orange) have resulted in substantially different loop models.

    SWA for 1oyc. 5 different configurations with comparable low energy

    Figure 2: prediction of SWA for 1oyc loop. Five different configurations with comparable low energy are shown.

    SWA on 35 loop test set:

    SWA was used on a data set of 35 protein loops, where 20 of them allowed comparison with PLOP and Rosetta KIC and 15 where difficult cases with loop ranging between 8 to 24 residue. Comparing the median of RMSDs of lowest energy models (Table S1 of paper) shows SWA achieves better quality models (0.63 Å) with the same computational time as PLOP and Rosetta KIC. For the other 15 difficult cases SWA performance reduced by median RMSD of 1.4 Å for the lowest energy models.But, the highlight of SWA is prediction of 24 residue long loops,where it achieves sub-angstrom accuracy. Since SWA uses the O(n) strategy to solve the problem, in comparison to Rosetta KIC it requires less computational time.

    In total, considering the best of 5 models, for 27 of 35 cases SWA produces sub-angstrom accuracy. But looking at the five best models of these 27 models show that best RMSD does not corresponds to the best lowest energy model. Also, in some cases Rosetta KIC produces better RMSD models while energy wise it is worse than SWA. This shows Rosetta energy function requires improvement specially in its solvent model (where it fails the most).

    SWA and blind test:

    • SWA was used to predict 4 loops of a protein which its native structure was not released. SWA started with a model where the loop regions were removed and achieved sub-angstrom accuracy (Rosetta KIC achieved this for 3 out of the 4 cases).
    • SWA loop prediction accuracy of 0.53 Å for a RNA/protein target on a comparative model (instead of X-ray model) shows its ability in prediction complex structures.

    DISCUSSION:

    SWA method has been successful in predicting protein loops with sub-angstrom accuracy. Of significance are prediction of RNA-Protein model target and loop lengths of 24 residue. Although it provides atomic-accuracy predictions, SWA requires 5000 CPU hours (which is achievable with current processing powers) for 12 length loops. While Monte Carlo and refinement-based methods can predict loops in hundreds of CPU hours. SWA computational time can be improved by considering minimization of several factors in the build up pathway and the use of experimental constraints.

    SWA method can be used to guide and assist ab-initio prediction of protein structures and in protein folding. Also it may have application in ab inito modeling problems such as fitting high-res protein structures in low-res electron density maps or prediction of NMR structures using sparse chemical shift data. In addition, stepwise ansatz offers solutions to design of protein interfaces which require simultaneous optimizing of side-chain conformation, side-chain identity and back bones.

    From Protein Interface Recognition to Protein Complex Prediction

    Similarly to ‘words’, which need to be “assembled into sentences, paragraphs, chapters and books” to tell a story, ‘protein structures’ need to be assembled into protein complexes to perform a specific task. To form complexes, proteins interact with other proteins, DNA, RNA and small molecules using their interface residues. All those types of interactions are under intense scrutiny by the research community, each of them defining a distinct field of research. During my PhD I focused on protein-protein interactions (PPIs) and prediction of their interfaces. Modifications in PPIs affect the events that take place within cells which may lead to critical diseases such as cancer. Therefore, knowledge about PPIs and their resulting 3D complexes can provide key information for drug design.

    Docking is a popular computational method which predicts the possible structure of the complex produced by two proteins using the known 3D structure of the individual proteins. However, docking of two proteins can result in a large number of different conformational models whose majority is far from correct. This highlights one of the main limitations of docking.  Therefore, scoring functions have been proposed which are used to re-score and re-rank docked conformations in order to detect near-native models. One way to distinguish native-like models from false docked poses is to use knowledge of protein interfaces. If one knows the possible location of interface residues on each individual protein, docked complexes which do not involve those interfaces can be rejected. Therefore, accurate prediction of protein interfaces can assist with detection of native-like conformations.

    Various methods have been proposed for predicting protein interfaces as mentioned above. A high number of methods investigate protein sequential or structural features in order to characterise protein interfaces. Usage of 3D structural properties has improved the sequence-based predictions.  Moreover, evolutionary conservation was shown to be an important property. Therefore, methods have integrated various structural features along with evolutionary information to increase performance.

    The combination of different features using various techniques has been investigated by intrinsic-based predictors. However, it seems that these methods have reached their saturation, and combination of more properties does not improve their prediction performance. On the other hand, many studies have investigated the 3D structure of binding sites among protein families. They discovered that the binding site localisation and structure are conserved among homologous. These properties have improved the detection of functional residues and protein-ligand binding sites. Therefore, predictors took advantage of structurally conserved residues among homologous proteins to improve binding site predictions.

    Although homologous template-based predictors improve the predictions, they are limited to those proteins whose homologous structure exists. Therefore, methods have extended their search for templates to structural neighbours, since interface conservation exists even among remote structural neighbours. In addition, with the increase in experimentally determined 3D complexes good quality templates can be found for many proteins. Therefore, usage of structural neighbours is the current focus of template-based protein interfaces predictors.

    Although, template-based methods are currently the predictors under the main focus, one of their main limitations is their dependency to availability of the QP 3D structure. Also, these predictors have not investigated the contribution of interacting partners of structural neighbours in the prediction. In addition, since these methods perform structural comparisons their computational time is high which limits their application to high-throughput predictions.

    One of my PhD contributions was toward developing, T-PIP (Template based Protein Interface Prediction), a novel PIP approach based on homologous structural neighbours’ information. T-PIP addresses the above mentioned limitations by quantifying, first, homology between QP and its structural neighbours and, second, the diversity between the ligands of the structural neighbours (here, ligands refers to the interacting partners of proteins). Finally, predictions can be performed for sequences of unknown structure if that of a homologous protein is available. T-PIP’s main contribution is the weighted score assigned to each residue of QP, which takes into account not only the degree of similarity between structural neighbours, but also the nature of their interacting partners.

    In addition, we used T-PIP prediction to re-rank docking conformations which resulted in T-PioDock (Template based Protein Interface prediction and protein interface Overlap for Docking model scoring), a complete framework for prediction of a complex 3D structure. T-PioDock supports the identification of near-native conformations from 3D models that docking software produced by scoring those models using binding interfaces predicted by T-PIP.

    T-PioDock Pipeline

    T-PioDock Pipeline

    Exhaustive evaluation of interface predictors on standard benchmark datasets has confirmed the superiority of template base approaches and has showed that the T-PIP methodology performs best. Moreover, comparison between T-PioDock and other state-of-the-art scoring methods has revealed that the proposed approach outperforms all its competitors.

    Accurate identification of near-native conformations remains a challenging task. Although availability of 3D complexes will benefit to template based methods such as T-PioDock, we have identified specific limitations which need to be addressed. First, docking software are still not able to produce native like models for every target. Second, current interface predictors do not explicitly refer to pair-wise residue interactions which leaves ambiguity when assessing quality of complex conformations.