So what is this buzz I hear about predicted protein contacts? Is it really the long awaited solution for one of the biggest open problems in biology today? Has protein structure prediction been solved?
Well, first things first. Let me give you a quick introduction to this predicted protein contact business (probably not quick enough for an elevator pitch, but hopefully you are not reading this in an elevator).
Nowadays, the scientific community has become very good at sequencing things (and by things I mean genetic things, like whole genomes of a bunch of different people and organisms). We are so good at it that mountains of sequence data are now available: genes, mRNAs, protein sequences. The question is what do we do with all this data?
Good scientists are coming up with new and creative ideas to extract knowledge from these mountains of data. For instance, one can build multiple sequence alignments using protein sequences for a given protein family. One of the ways in which information can be extracted from these multiple sequence alignments is by identifying extremely conserved columns (think of the alignment as a big matrix). Residues in these conserved positions are good candidates for being functionally important for the proteins in that particular family.
Another interesting thing that can be done is to look for pairs of residues that are mutating in a correlated fashion. In more practical terms, you are ascertaining how correlated is the information between two columns of a multiple sequence alignment; how often a change in one of them is countered by a change in the other. Why would anyone care about that? Simple. There is an assumption that residues that mutate in a correlated fashion are co-evolving. In other words, they share some sort of functional dependence (i.e. spatial proximity) that is under selective pressure.
Ok, that was a lot of hypotheticals, does it work? For many years, it didn’t. There were lots of issues with the way these correlations were computed and one of the biggest problems was to identify (and correct for) transitivity. Transitivity is the idea that you observe a false correlation between residues A and C because residues A,B and residues B,C are mutating in a correlated fashion. AS more powerful statistical methods were developed (borrowing some ideas from mechanical statistics), the transitivity issue has seemingly been solved.
The newest methods that detect co-evolving residues in a multiple sequence alignment are capable of detecting protein contacts with high precision. In this context, a contact is defined as two residues that are close together in a protein structure. How close? Their C-betas must be 8 Angstroms or less apart. When sufficient sequence information is available (at least 500 sequences in the MSA), the average precision of the predicted contacts can reach 80%.
This is a powerful way of converting sequence information into distance constraints, which can be used for protein structure modelling. If a sufficient number of correct distance constraints is used, we can accurately predict the topology of a protein . Recently, we have also observed great advances in the way that models are refined (that is, refining a model that contains the correct topology to atomic, near-experimental resolution). If you put those two things together, we start to look at a very nice picture.
So what’s the catch? The catch was there. Very subtle. “When sufficient sequence information is available”. Currently, there is an estimate that only 15% of the de novo protein structure prediction cases present sufficient sequence information for the prediction of protein contacts. One potential solution would be to sit and wait for more and more sequences to be obtained. Yet a potential pitfall of sitting and waiting is that there is no guarantee that we will have sufficient sequence information for a large number of protein families, as they may as well present less than 500 members.
Furthermore, scientists are not very good at sitting around and waiting. They need to keep themselves busy. There are many things that the community as whole can invest time on while we wait for more sequences to be generated. For instance, we want to be sure that, for the cases where there is a sufficient number of sequences, that we get the modelling step right (and predict the accurate protein topology). Predicted contacts also show potential as a tool for quality assessment and may prove to be a nice way of ascertaining whether you have confidence that a model with correct topology was created. More than that, model refinement still needs to improve if we want to make sure that we get from the correct topology to near-experimental resolution.
Protein structure prediction is a hard problem and with so much room for improvement, we still have a long way to go. Yet, this predicted contact business is a huge step in the right direction. Maybe, it won’t be long before models generated ab initio are considered as reliable as the ones generated using a template. Who knows what promised the future holds.
 Kim DE, Dimaio F, Yu-Ruei Wang R, Song Y, Baker D. One contact for every twelve residues allows robust and accurate topology-level protein structure modeling. Proteins. 2014 Feb;82 Suppl 2:208-18. doi: 10.1002/prot.24374. Epub 2013 Sep 10.