Protein structure determination using metagenome sequence data

For this week’s journal club, I presented a recent paper from Ovchinnikov, and the David Baker group – Protein structure determination using metagenome sequencing data. This discussed how incorporating metagenome sequence data into multiple sequence alignments, can assist with and improve residue-residue contact prediction. The paper concludes with the prediction of over 600 structures from protein families that currently have no solved structures.

The Pfam database contains 14,849 protein families with 50 or more residues. However, only 4752 of these families have at least one member with an experimentally determined structure. 3984 of the remaining 10,097 families have reliable comparative models built on the basis of homologs of known structure. Less confident comparative models can be built for a further 902 families, however this leaves 5211 families with no structural information.

The recent technological advances in genome sequencing have provided an increasingly large number of amino acid sequences to work with. Large numbers of sequences allows the identification of compensatory mutations that have occurred in residues that are in contact with each other. This is called evolutionary co-variance and can allow the relatively accurate prediction of residues that are in contact in a structure. Rosetta utilises these co-evolutionary couplings, along with partial structural matches (found by combining the predicted contacts with contact patterns of known structures, using the map_align algorithm ) to predict structures from a number of families with fold-level accuracy ( TM-score > 0.5 ). However, it was unknown if this method could be used to accurately predict protein structures on a large-scale.

One challenge in using co-evolutionary couplings to predict residue-residue contacts is that a large number of sequences (hundreds to thousands) are needed. The accuracy of the predicted contacts is also dependent on the diversity of the sequences in a family, and the length of the protein. Nf is a measure that incorporates all of these factors :

Figure 1A shows the dependence of Rosetta structure prediction accuracy on the Nf. In general, where Nf64, accuracy typical of comparative modelling (TM-score > 0.7) can be achieved. For Nf32, fold-level accuracy (TM-score > 0.5) can be achieved, below this, accuracy falls off. Of the 5211 families with no structural information, only ~400 of these had Nf64; therefore accurate structural modelling could not be achieved for the remaining ~4800 of these families using the sequencing data available on UniRef100.


Fig 1. (a) Accuracy of predicted structures produced with and without refinement by Rosetta for families with different Nf values. (b) Number of protein families with Nf≥64 between 2009 and 2015 using UniRef100 database, and UniRef100 and Metagenome data. (c) Percentage of protein families with Nf scores 4, 8, 16, 32, and 64 including sequences from UniRef100 and metagenome data.

The addition of metagenome sequence data (from shotgun sequencing microbial DNA from environmental samples) increased the proportion of families with Nf64 from 0.08, to 0.25. The proportion of families with Nf32 also increased from 0.16, to 0.33. The difference in the fraction of protein families with Nf64 before and after the addition of metagenome sequence data can be seen in Figure 1B, and Figure 1C shows the percentage of families with Nf scores above 4, 8, 16, 32 and 64.

After running a set of benchmark calculations, this larger set of sequence data were used to generate models for 921 protein families, which now had Nf64 and also had number of long range contacts greater than half the number of residues in the protein. Of these 921 protein families, models with predicted TM scores > 0.65 were generated for 614 families. Although these were only predicted TM scores, crystal structures for members of 5 of the 614 families have since been published and had a TM-score > 0.7 when compared with the corresponding model.

Limitations with this using this data include the lack of eukaryotic genetic information currently, as well as the lack of explicit modeling of ligands, co-factors and lipids using the Rosetta workflow. However, the fast rate of increase in metagenome sequencing data (as compared to the rate of increase of sequencing data in UniRef100) means that while these new models fill roughly 12% of the unknown structural information for protein families, the potential for future structural prediction is bright.