Author Archives: Laura Depner

Protein structure determination using metagenome sequence data

For this week’s journal club, I presented a recent paper from Ovchinnikov, and the David Baker group – Protein structure determination using metagenome sequencing data. This discussed how incorporating metagenome sequence data into multiple sequence alignments, can assist with and improve residue-residue contact prediction. The paper concludes with the prediction of over 600 structures from protein families that currently have no solved structures.

The Pfam database contains 14,849 protein families with 50 or more residues. However, only 4752 of these families have at least one member with an experimentally determined structure. 3984 of the remaining 10,097 families have reliable comparative models built on the basis of homologs of known structure. Less confident comparative models can be built for a further 902 families, however this leaves 5211 families with no structural information.

The recent technological advances in genome sequencing have provided an increasingly large number of amino acid sequences to work with. Large numbers of sequences allows the identification of compensatory mutations that have occurred in residues that are in contact with each other. This is called evolutionary co-variance and can allow the relatively accurate prediction of residues that are in contact in a structure. Rosetta utilises these co-evolutionary couplings, along with partial structural matches (found by combining the predicted contacts with contact patterns of known structures, using the map_align algorithm ) to predict structures from a number of families with fold-level accuracy ( TM-score > 0.5 ). However, it was unknown if this method could be used to accurately predict protein structures on a large-scale.

One challenge in using co-evolutionary couplings to predict residue-residue contacts is that a large number of sequences (hundreds to thousands) are needed. The accuracy of the predicted contacts is also dependent on the diversity of the sequences in a family, and the length of the protein. Nf is a measure that incorporates all of these factors :

Figure 1A shows the dependence of Rosetta structure prediction accuracy on the Nf. In general, where Nf64, accuracy typical of comparative modelling (TM-score > 0.7) can be achieved. For Nf32, fold-level accuracy (TM-score > 0.5) can be achieved, below this, accuracy falls off. Of the 5211 families with no structural information, only ~400 of these had Nf64; therefore accurate structural modelling could not be achieved for the remaining ~4800 of these families using the sequencing data available on UniRef100.

 

Fig 1. (a) Accuracy of predicted structures produced with and without refinement by Rosetta for families with different Nf values. (b) Number of protein families with Nf≥64 between 2009 and 2015 using UniRef100 database, and UniRef100 and Metagenome data. (c) Percentage of protein families with Nf scores 4, 8, 16, 32, and 64 including sequences from UniRef100 and metagenome data.

The addition of metagenome sequence data (from shotgun sequencing microbial DNA from environmental samples) increased the proportion of families with Nf64 from 0.08, to 0.25. The proportion of families with Nf32 also increased from 0.16, to 0.33. The difference in the fraction of protein families with Nf64 before and after the addition of metagenome sequence data can be seen in Figure 1B, and Figure 1C shows the percentage of families with Nf scores above 4, 8, 16, 32 and 64.

After running a set of benchmark calculations, this larger set of sequence data were used to generate models for 921 protein families, which now had Nf64 and also had number of long range contacts greater than half the number of residues in the protein. Of these 921 protein families, models with predicted TM scores > 0.65 were generated for 614 families. Although these were only predicted TM scores, crystal structures for members of 5 of the 614 families have since been published and had a TM-score > 0.7 when compared with the corresponding model.

Limitations with this using this data include the lack of eukaryotic genetic information currently, as well as the lack of explicit modeling of ligands, co-factors and lipids using the Rosetta workflow. However, the fast rate of increase in metagenome sequencing data (as compared to the rate of increase of sequencing data in UniRef100) means that while these new models fill roughly 12% of the unknown structural information for protein families, the potential for future structural prediction is bright.

TM-score

 

The similarity between two protein structures can be measured using TM-score (template modelling score). This can be particularly useful when examining the quality of a model, as compared to a target or template structure. One common method of comparing protein structures has been by calculating the root mean squared deviation (RMSD) from the distances of equivalent residues in both structures. An issue with this is that, as all residue pairs are weighted evenly, when the RMSD value is large, it becomes more sensitive to local structure deviation rather than to the global topology. Other established scoring functions, such as GDT-TS (1) and MaxSub (2) rely on finding substructures of the model, where all residues are within a certain threshold distance of the corresponding template residues. However, this threshold distance is subjective and therefore could not be used “as standard” for all proteins. A major disadvantage with all of these methods is that they display power-law dependence with the length of the protein.

TM-score (3) was developed in order to overcome this length dependence. It is a variation of the Levitt-Gerstein (LG) score, which weights shorter distances between corresponding residues more strongly than longer distances. This ensures there is more sensitivity to global topology rather than local structure deviations. TM-score is defined:

where Max is the maximum value after optimal superposition, LN is the length of the native structure, Lr is the length of the aligned residues to the template structure, di is the distance between the ith pair of residues and d0 is a scaling factor. In alternative scoring functions, including MaxSub, d0 is taken to be constant. TM-score uses the below equation to define d0:

which is an approximation of the average distance of corresponding residue pairs of random related proteins. This removes the dependence of TM-score on protein length.

The value of TM-score always lies between (0,1]; evaluations of TM-score distributions have shown that when the TM-score between two structures <0.17, the P–value is close to 1 and the protein structures are indistinguishable from random structure pairs. When the TM-score reaches 0.5, the P-value is vastly reduced and the structures are mostly in the same fold (4). Therefore it is suggested that TM-score may be useful not only in the automated assessment of protein structure predictions, but also to determine similar folds in protein topology classification.

  1. Zemla A, Venclovas Č, Moult J, Fidelis K. Processing and analysis of CASP3 protein structure predictions. Proteins Struct Funct Genet. 1999;37(SUPPL. 3):22–9.
  2. Siew N, Elofsson a, Rychlewski L, Fischer D. MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics. 2000;16(9):776–85.
  3. Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins [Internet]. 2004;57(4):702–10. Available from: http://www.ncbi.nlm.nih.gov/pubmed/15476259
  4. Xu J, Zhang Y. How significant is a protein structure similarity with TM-score = 0.5? Bioinformatics. 2010;26(7):889–95.