Inside Memoir: MP-T aligns membrane proteins

Although Memoir has received a lot of air-time on this blog, we haven’t gone into a great deal of detail about how it models membrane proteins. Memoir is a pipeline involving a series of programs iMembrane -> MP-T -> Medeller -> Fread, and in this post I’ll explain the MP-T step (I’ll briefly touch on Medeller too).

Let’s first look at the big-picture. There are several ways of modelling a protein’s 3D structure. In an ideal world we could specify an extended polypeptide, teach a computer some physics, set if off simulating, and watch the exact folding pathway of a protein. This doesn’t work. A second method would be to build up a protein from lots of fragments of unrelated proteins… this is usually what is meant by ‘ab initio’ modelling. The most accurate (and least sophisticated) approach is to find a protein of known structure with similar sequence, align the sequences, and copy over the coordinates of the aligned residues to make a model for the query protein. This is the approach taken by Memoir and is called homology modelling or comparative modelling.

The diagram below shows an example of how homology modelling might work. Four membrane protein sequences are aligned (left) and the alignment specifies a structural superposition (right). Assume now that the red structure is unknown: we could make a good model for it just by copying over the aligned parts of the blue, green and yellow structures.

Screen shot 2013-05-20 at 14.26.06
The greatest difficulty in the modelling described above is making an accurate alignment. As sequences become more distantly related they share less and less sequence identity, and working out the optimum alignment becomes challenging. This problem is especially acute for membrane protein modelling: there are so few structures from which to copy coordinates that a randomly chosen query protein has a good chance of having <30% sequence identity to the nearest related structure.

Although alignment is the most important facet of homology modelling it is not the only consideration. In the above diagram the centres of the proteins are structurally very conserved (so copying coordinates will lead to a good model in this region), but the top of the proteins differ (the stringy loops don’t sit on top of one another). It is the role of coordinate generation software to distinguish which coordinates to copy. It turns out that the pattern of a conserved centre and varying top/bottom is generally true for membrane proteins, and Memoir uses our Medeller coordinate generation software to take advantage of this pattern.

Back then to alignment. The aim of alignment is to work out which amino acids in one protein are related to amino acids in another. All alignment methods have at their heart a set of scores which encode the propensity for one amino acid to mutate to another, and for that mutation to become fixed in a population. These scores form a substitution table (here mutation + fixation = substitution). More sophisticated alignment methods augment these scores in different ways — for example by adding in scoring based on secondary structure, smoothing scores over a window, or estimating a statistical supplement to the score determined from a related set of pre-aligned sequences — but at some level a substitution table is always present. Using a substitution table, the most likely evolutionary relationship between two sequences can be detected and this is reported in the form of an alignment.

So that’s general alignment, now to apply this to membrane proteins. The cell membrane is composed of a lipid bilayer: a sandwich with a hydophobic filling and hydrophilic crusts. The part of a membrane protein that touches the filling will have different preferences for amino acids (and, more importantly, substitutions between these amino acids) than the part of a membrane protein that touches the crust. Similarly there are systematic preferences for amino acid substitutions depending on whether part of a protein is buried or exposed, and on which type of secondary structure it assumes. The figure below shows a membrane protein with different regions of the membrane and different types of secondary structure annotated.

Screen shot 2013-05-20 at 14.27.19 Screen shot 2013-05-20 at 14.28.42

 

It is possible to make separate substitution tables for each environment within a membrane protein, where an environment specifies where the protein sits in the membrane, what secondary structure it has, and whether it is accessible or buried. Below is a principal components analysis of the resulting set of tables: each table is represented by a single point and the axes show the direction of the greatest variation between the tables. The plot on the right shows a separation of the points based on whether they are buried (more hydrophobic) or accessible (more hydrophilic). The hydrophobic centre (red circles) and hydrophilic edges (green circles) of the membrane fall into this general pattern. The table on the left shows that the tables further divide by secondary structure type. In summary there are systematic substitution preferences in practice as well as theory, and for membrane proteins it is most important to consider hydrophobicity when aligning two protein sequences.

Screen shot 2013-05-20 at 14.30.09

On then to modelling. The conventional approach to aligning a pair of sequences for homology modelling is to take a set of pre-aligned sequences (a sequence profile), and use them to estimate a supplement to the standard substitution score for aligning two sequences. This is termed profile-profile alignment. Memoir takes a different approach by using the MP-T program to construct a multiple sequence alignment scored with environment-specific substitution tables. The alignment includes a set of homologous sequences to the pair of interest.

Profile-profile alignment methods and MP-T are very different. It is unclear whether the substitution preferences at a position are best estimated by MP-T’s tables or the supplements derived from sequence profiles, and the answer probably depends on how well the profiles are made — garbage in, garbage out. Similarly the MP-T algorithm only determines the upper limit of alignment accuracy, and the actual accuracy depends on how the homologous sequences in the alignment are chosen.

In general we find little difference between the fraction of an alignment that MP-T and either HHsearch or Promals (profile-profile alignment methods) gets right. However we do find a difference in the fraction of the alignment that these methods get wrong (part of an alignment can be right, wrong or simply not aligned, so it’s possible to get a lower fraction wrong whilst getting the same fraction right). It turns out that on average MP-T gets less of an alignment wrong for simple reasons of combinatorics: for a pair of proteins, the number of possible multiple sequence alignments is much greater than the number of possible profile-profile alignments. This means that, just by chance, the number of incorrectly aligned positions between the two sequences of interest will be lower for MP-T than for a conventional profile-profile alignment method.

Now for a little sales-pitch. The source code for MP-T is freely available and easy to expand (if you have a passing familiarity with Haskell). Only two or three lines of code need to be changed to define a new set of protein environments, and to feed it a substitution table for each environment. I’d be happy to help anyone who wants to try it out.

Author