Stepwise Assembly For Protein Loop Prediction | Oxford Protein Informatics Group

INTRODUCTION:

Loop modeling is used frequently in designing the structure of new proteins or refining protein structures with limited X-ray and NMR data. In 2011, Sripakdeevong et al. introduced a hypothesis called “Stepwise Ansatz” for modeling RNA loops with atomic accuracy. They believed that current knowledge-based RNA loop predictors which aimed at predicting loops with atomic accuracy, failed to sample models within 1.5 Å RMSD of the native structures. The bottleneck in these methods is related to inefficient sampling of the conformational space. To overcome the limitation of sampling, Sripakdeevong et al. introduced an ‘ab initio’ (de novo) buildup strategy to allow for high resolution sampling of loops instead of restricting the search space to available fragments. But with current computational power, exhaustive enumeration of N-length (N>1) loops with atomic resolution is impossible. If N=1, considering all the degrees of freedom for nucleotide will result in 1 million conformations. Performing Rosetta energy minimization on these models will need 1 hour CPU time which is computationally reasonable. Every time a new nucleotide is added the conformational size will be multiplied exponentially by the RNA loop length (for a N=5 computational time ~ 10^23 CPU year).

Since enumeration of one nucleotide long loop is possible, the entire loop can be modeled by stepwise enumerative building of one nucleotide at a time on low energy conformations which are well-packed and hydrogen bonded. Therefore, their implementation of stepwise assembly (SWA) protocol in a dynamic programming-like recursion style enables sampling of 12 length loops with achievable CPU time. SWA being successful in prediction of RNA loops, was first used to predict protein loops with atomic accuracy by Rhiju Das . Loop regions in protein structures have particular characteristics compared to the regions of regular secondary structure. Loops have similar number of hydrogen bonds (on average 1.1 per residue), mainly contain polar/charged side chains and have less contact with the non-polar core of the protein. Current Loop modeling methods with atomic resolution start off with a reduced representation of the protein with simplified or no side-chains. Although coarse graining of proteins will assist in reducing large number of local minima but will fail in capturing non-polar and hydrogen bond interactions involving side chains.Therefore, SWA is used to build up a loop at its atomic resolution by sampling the possible conformation space which is energetically favorable and also computationally possible.

SWA PIPELINE:

SWA is implemented in c++ in Rosetta framework. SWA uses a dynamic programming matrix (example is shown below in Figure 1D for a 6 length loop) to allow de novo buildup of loops from residue k to l. To achieve this, at each step SWA adds loop residue to build up forward from the N-terminus (from residue k-1 to i) and backward from the C-terminus (l+1 to j). Therefore, each circle point in figure 1D represents a (i,j) stage. SWA contains 5 main steps:

Pre-packing the starting point : To start, all atoms of the loop region is removed from the model and side-chains are added to the model. This stage (k-1,l+1) is shown as green circle in figure 1D. Side chains are added and their torsion are minimized. Note that the non-loop backbones are kept fix in all stages of SWA.
Adding one loop residue to n-terminal: This stage is shown by orange arrows (moving downward) in Figure 1D. To generating possible conformations after adding the loop residue, backbone torsion angles (Φ,Ψ) of the added residue and the backbone residue before that are sampled (Figure 1A). Φ,Ψ combinations which do not correspond to the Ramachandram are discarded. This procedure, can result in generating tens to thousands of conformations. For all the generated models, side chains are added to the sampled residues (i and i-1) and these side-chain along with the potential neighboring side chains are optimized. Afterward, clustering is performed, in which models are ranked in order of the energy and if a lower energy model has backbone RMSD of residue (i and i-1) <0.10Å compared to a higher energy model then the low energy model is removed (otherwise kept as a seed for a new cluster). After clustering the top 400 models are selected for all atom energy minimization on sampled residue backbone torsion and its neighbouring side-chain. Then, a final clustering is performed on the these models as described above.
Adding one loop residue to c-terminal: This stage is shown by pink arrows (moving left) in Figure 1D. This is similar to step2, in which residue j and j+1 are considered for backbone sampling (Figure 1B), side-chain packing, model clustering and torsional minimization and final clustering.
Closing loop chains :All models where the gap difference between C-terminal and N-terminal are 1,2 or 3 are subjected to chain closure. To generate closed loops, residue i+1 is added to N-terminal and i+2 and j-1 are added to C-terminal. For i+1, Φ and Ψ torsion are sampled by performing grid search as described above while backbone of i+2 and j-1 undergo Cyclic Coordinate Descent (CCD) which changes the Φ and Ψ torsion of i+2 and j-1 till it closes the gap to i+2. Models with chain closure < 0.01Å are then subject to side chain optimization, clustering, and torsional minimization. This procedure differs to above since all loop side chains and all loop backbones are affected by minimization. This stage is shown by blue arrows in Figure 1D just for gap lengths of one. In addition, to this procedure for loop closure, all models were closed by adding the last gap residue by grid search and trying to close the loop by CCD on the preceding residue. Also, models created by only sampling C-terminal or N-terminal are also used along with CCD loop closure to create full length models.
Clustering: For each stage 400 models are generated, where the next stage uses these models to generate new conformations resulting in thousands models. Also several path can be used to get reach a specific stage, adding up to the numbers of generated models. Therefore, since SWA works on only low-energy models, only the 4000 lowest energy models are kept for further clustering. Clustering is similar to procedure above but with RMSD of 0.25Å and is applied on the entire loop segment which is build up to that stage. Then, the 400 lowest energy is used to move on to the next stage. At the loop closure stage also when the whole loops are modeled clustering is also used with RMSD of 1.0Å and the five lowest energy models are considered as SWA prediction.

Figure 1: Stepwise Assembly Schematic Representation

For short loops of (6 to 9 residue long), it was shown that solutions can be found just by creating models from N-terminal onward and separately by C-terminal backward and joining them by loop closure (or simply be moving just along the first column and first row of the dynamic matrix). Figure 1E shows a directed acyclic graph (DAG) of this procedure. The positive point is that in these cases computational time reduces to O(N) instead of O(N^2). Therefore, for such cases this procedure is tested first. If the difference between the lowest energy model and the second lowest is less than 1 kBT (a Rosetta energy unit is approximately 1 kBT) we can argue that modeling has not converged toward one model and the whole O(N^2) calculation should take place (Except for loops of length 24)

RESULTS:

A difficult case study:

Predicting 1oyc loop (residue 203-214) has always been a challenge by loop predictors since most of its side-chains are polar/charged where hydrogen bonds play an important for stabilising the loop. All these factors are not considered in ab initio predictors with coarse-grained representation. Figure 2 of paper, displays the SWA build up procedure for 1oyc loop.The final best model (Figure 2:I) with the lowest energy has a c-alpha RMSD of 0.39 Å to the native structure. Following the build up path of 1oyc shows that the intermediate steps which lead to this final best model have not always been the lowest energy, therefore it is important to keep all the possible intermediate conformations. It is important to consider that different search paths allows sampling of totally diverse configurations. For example in Figure 2 (below), for 1oyc, 5 different configurations with comparable low energy generated by different build up paths are shown. Two totally different paths (blue and brown) may result in similar configurations while reasonably similar paths (pink, green and orange) have resulted in substantially different loop models.

SWA for 1oyc. 5 different configurations with comparable low energy

Figure 2: prediction of SWA for 1oyc loop. Five different configurations with comparable low energy are shown.

SWA on 35 loop test set:

SWA was used on a data set of 35 protein loops, where 20 of them allowed comparison with PLOP and Rosetta KIC and 15 where difficult cases with loop ranging between 8 to 24 residue. Comparing the median of RMSDs of lowest energy models (Table S1 of paper) shows SWA achieves better quality models (0.63 Å) with the same computational time as PLOP and Rosetta KIC. For the other 15 difficult cases SWA performance reduced by median RMSD of 1.4 Å for the lowest energy models.But, the highlight of SWA is prediction of 24 residue long loops,where it achieves sub-angstrom accuracy. Since SWA uses the O(n) strategy to solve the problem, in comparison to Rosetta KIC it requires less computational time.

In total, considering the best of 5 models, for 27 of 35 cases SWA produces sub-angstrom accuracy. But looking at the five best models of these 27 models show that best RMSD does not corresponds to the best lowest energy model. Also, in some cases Rosetta KIC produces better RMSD models while energy wise it is worse than SWA. This shows Rosetta energy function requires improvement specially in its solvent model (where it fails the most).

SWA and blind test:

SWA was used to predict 4 loops of a protein which its native structure was not released. SWA started with a model where the loop regions were removed and achieved sub-angstrom accuracy (Rosetta KIC achieved this for 3 out of the 4 cases).
SWA loop prediction accuracy of 0.53 Å for a RNA/protein target on a comparative model (instead of X-ray model) shows its ability in prediction complex structures.

DISCUSSION:

SWA method has been successful in predicting protein loops with sub-angstrom accuracy. Of significance are prediction of RNA-Protein model target and loop lengths of 24 residue. Although it provides atomic-accuracy predictions, SWA requires 5000 CPU hours (which is achievable with current processing powers) for 12 length loops. While Monte Carlo and refinement-based methods can predict loops in hundreds of CPU hours. SWA computational time can be improved by considering minimization of several factors in the build up pathway and the use of experimental constraints.

SWA method can be used to guide and assist ab-initio prediction of protein structures and in protein folding. Also it may have application in ab inito modeling problems such as fitting high-res protein structures in low-res electron density maps or prediction of NMR structures using sparse chemical shift data. In addition, stepwise ansatz offers solutions to design of protein interfaces which require simultaneous optimizing of side-chain conformation, side-chain identity and back bones.

Author

Reyhaneh Esmaielbeiki

View all posts