Author Archives: Jamie Hill

Inside Memoir: MP-T aligns membrane proteins

Although Memoir has received a lot of air-time on this blog, we haven’t gone into a great deal of detail about how it models membrane proteins. Memoir is a pipeline involving a series of programs iMembrane -> MP-T -> Medeller -> Fread, and in this post I’ll explain the MP-T step (I’ll briefly touch on Medeller too).

Let’s first look at the big-picture. There are several ways of modelling a protein’s 3D structure. In an ideal world we could specify an extended polypeptide, teach a computer some physics, set if off simulating, and watch the exact folding pathway of a protein. This doesn’t work. A second method would be to build up a protein from lots of fragments of unrelated proteins… this is usually what is meant by ‘ab initio’ modelling. The most accurate (and least sophisticated) approach is to find a protein of known structure with similar sequence, align the sequences, and copy over the coordinates of the aligned residues to make a model for the query protein. This is the approach taken by Memoir and is called homology modelling or comparative modelling.

The diagram below shows an example of how homology modelling might work. Four membrane protein sequences are aligned (left) and the alignment specifies a structural superposition (right). Assume now that the red structure is unknown: we could make a good model for it just by copying over the aligned parts of the blue, green and yellow structures.

Screen shot 2013-05-20 at 14.26.06
The greatest difficulty in the modelling described above is making an accurate alignment. As sequences become more distantly related they share less and less sequence identity, and working out the optimum alignment becomes challenging. This problem is especially acute for membrane protein modelling: there are so few structures from which to copy coordinates that a randomly chosen query protein has a good chance of having <30% sequence identity to the nearest related structure.

Although alignment is the most important facet of homology modelling it is not the only consideration. In the above diagram the centres of the proteins are structurally very conserved (so copying coordinates will lead to a good model in this region), but the top of the proteins differ (the stringy loops don’t sit on top of one another). It is the role of coordinate generation software to distinguish which coordinates to copy. It turns out that the pattern of a conserved centre and varying top/bottom is generally true for membrane proteins, and Memoir uses our Medeller coordinate generation software to take advantage of this pattern.

Back then to alignment. The aim of alignment is to work out which amino acids in one protein are related to amino acids in another. All alignment methods have at their heart a set of scores which encode the propensity for one amino acid to mutate to another, and for that mutation to become fixed in a population. These scores form a substitution table (here mutation + fixation = substitution). More sophisticated alignment methods augment these scores in different ways — for example by adding in scoring based on secondary structure, smoothing scores over a window, or estimating a statistical supplement to the score determined from a related set of pre-aligned sequences — but at some level a substitution table is always present. Using a substitution table, the most likely evolutionary relationship between two sequences can be detected and this is reported in the form of an alignment.

So that’s general alignment, now to apply this to membrane proteins. The cell membrane is composed of a lipid bilayer: a sandwich with a hydophobic filling and hydrophilic crusts. The part of a membrane protein that touches the filling will have different preferences for amino acids (and, more importantly, substitutions between these amino acids) than the part of a membrane protein that touches the crust. Similarly there are systematic preferences for amino acid substitutions depending on whether part of a protein is buried or exposed, and on which type of secondary structure it assumes. The figure below shows a membrane protein with different regions of the membrane and different types of secondary structure annotated.

Screen shot 2013-05-20 at 14.27.19 Screen shot 2013-05-20 at 14.28.42

 

It is possible to make separate substitution tables for each environment within a membrane protein, where an environment specifies where the protein sits in the membrane, what secondary structure it has, and whether it is accessible or buried. Below is a principal components analysis of the resulting set of tables: each table is represented by a single point and the axes show the direction of the greatest variation between the tables. The plot on the right shows a separation of the points based on whether they are buried (more hydrophobic) or accessible (more hydrophilic). The hydrophobic centre (red circles) and hydrophilic edges (green circles) of the membrane fall into this general pattern. The table on the left shows that the tables further divide by secondary structure type. In summary there are systematic substitution preferences in practice as well as theory, and for membrane proteins it is most important to consider hydrophobicity when aligning two protein sequences.

Screen shot 2013-05-20 at 14.30.09

On then to modelling. The conventional approach to aligning a pair of sequences for homology modelling is to take a set of pre-aligned sequences (a sequence profile), and use them to estimate a supplement to the standard substitution score for aligning two sequences. This is termed profile-profile alignment. Memoir takes a different approach by using the MP-T program to construct a multiple sequence alignment scored with environment-specific substitution tables. The alignment includes a set of homologous sequences to the pair of interest.

Profile-profile alignment methods and MP-T are very different. It is unclear whether the substitution preferences at a position are best estimated by MP-T’s tables or the supplements derived from sequence profiles, and the answer probably depends on how well the profiles are made — garbage in, garbage out. Similarly the MP-T algorithm only determines the upper limit of alignment accuracy, and the actual accuracy depends on how the homologous sequences in the alignment are chosen.

In general we find little difference between the fraction of an alignment that MP-T and either HHsearch or Promals (profile-profile alignment methods) gets right. However we do find a difference in the fraction of the alignment that these methods get wrong (part of an alignment can be right, wrong or simply not aligned, so it’s possible to get a lower fraction wrong whilst getting the same fraction right). It turns out that on average MP-T gets less of an alignment wrong for simple reasons of combinatorics: for a pair of proteins, the number of possible multiple sequence alignments is much greater than the number of possible profile-profile alignments. This means that, just by chance, the number of incorrectly aligned positions between the two sequences of interest will be lower for MP-T than for a conventional profile-profile alignment method.

Now for a little sales-pitch. The source code for MP-T is freely available and easy to expand (if you have a passing familiarity with Haskell). Only two or three lines of code need to be changed to define a new set of protein environments, and to feed it a substitution table for each environment. I’d be happy to help anyone who wants to try it out.

How to make a custom latex bibliography style

Imagine you are writing up your latest thrilling piece of science in your favourite odt or docx format. Nothing comes from nothing so you need to cite the 50 or so people whose ideas you built on, or who came to conclusions that contradict yours. Suddenly you realize that your second sentence needs a reference… and this will require you to renumber the subsequent 50. What a drag! There goes 5 minutes of your life that could have been better spent drinking beer.

Had you written your research in latex instead, this drudgery would have been replaced by a range of much more interesting and intractable difficulties. In latex it is easy to automagically renumber everything just by recompiling the document. Unfortunately, it is often hard to direct latex’s magic… just try moving a picture an inch to the right, or reformatting a reference.

Moving figures around is still a black art as far as I’m concerned… but I’ve recently found out an easy way to reformat references. This might be especially handy when you find out that your sort of proteins fall out of the scope of the International Journal of Eating Disorders and you now want to submit to a journal that requires you to list authors in small-caps, and the dates of publication in seconds from the Unix epoch.

A good way of including references in latex is with a “.bib” file and a “.bst” file. An example of the end of a document is shown below.


\bibliographystyle{myfile}
\bibliography{mycollection}

\end{document}

What’s happening here? All my references are stored in bibtex format in a database file called “mycollection.bib”. A separate file “myfile.bst” says how the information in the database should be presented. For example, are references in the text of the form (Blogs et al 2005) or are they numbered (1)? At the end of the text are they grouped in order of appearance, by date of publication or alphabetically? If alphabetically does “de Ville” come under “d” or “v”? To reformat a reference, we simply need to change “myfile.bst”.

Most latex distributions come with a set of bibliography styles. Some examples can be found here (a page which also explains all of the above much better than I have). However, it is very easy to produce a custom file using the custom-bib package. After a one-click download it is as simple as typing:


latex makebst.ins
latex makebst.tex

Here’s a screenshot to prove it. At the bottom is the first of thirty or forty multiple-choice questions about how you want your references to look. If in doubt, just close your eyes and press return to select the default.

Screen shot 2013-02-17 at 00.34.45

The problem with a multiple-choice fest is that if you make a poor decision at question 28 you have to go through the whole process again. Fortunately, this can be circumvented — as well as generating a pretty “myfile.bst” file, the custom-bib package generates an intermediate file “myfile.dbj”. Changing your multiple-choice answers is just a matter of commenting out the relevant parts and typing “latex myfile.dbj”. A snippet of a “dbj” file is below:

Screen shot 2013-02-17 at 00.41.42

Selected options are those without a “%” sign on the left hand side. Who would have thought that Latex could be so cuddly?

Journal club: Principles for designing ideal protein structures

The goal of protein design is to generate a sequence that assumes  a certain structure and/or performs a specific function. A recent paper in Nature has attempted to design sequences for each of five naturally occurring protein folds. The success rate ranges from 10-40%.

This recent work comes from the Baker group, who are best known for Rosetta and have made several previous steps in this direction. In a 2003 paper this group stripped several naturally occurring proteins down to the backbone, and then generated sequences whose side-chains were consistent with these backbone structures. The sequences were expressed and found to fold into proteins, but the structure of these proteins remained undetermined. Later that same year the group designed a protein, Top7, with a novel fold and confirmed that its structure closely matched that of the design (RMSD of 1.2A).

The proteins designed in these three pieces of work (the current paper and the two papers from 2003) all tend to be more stable than naturally occurring proteins. This increased stability may explain why, as with the earlier Top7, the final structures in the current work closely match the design (RMSD 1 or 2A), despite ab initio structure prediction rarely being this accurate. These structures are designed to sit in a deep potential well in the Rosetta energy function, whereas natural proteins presumably have more complicated energy landscapes that allow for conformational changes and easy degradation. Designing a protein with two or more conformations is a challenge for the future.

In the current work, several sequences were designed for each of the fold types. These sequences have substantial sequence similarity to each other, but do not match existing protein families. The five folds all belong to the alpha + beta or alpha/beta SCOP classes. This is a pragmatic choice: all-alpha proteins often fold into undesired alternative topologies, and all-beta proteins are prone to aggregation. By contrast, rules such as the right-handedness of beta-alpha-beta turns have been known since the 1970s, and can be used to help design a fold.

The authors describe several other rules that influence the packing of beta-alpha-beta, beta-beta-alpha and alpha-beta-beta structural elements. These relate the lengths of the elements and their connective loops with the handedness of the resulting subunit. The rules and their derivations are impressive, but it is not clear to what extent they are applied in the design of the 5 folds. The designed folds contain 13 beta-alpha-beta subunits, but only 2 alpha-beta-beta subunits, and 1 beta-beta-alpha subunit.

An impressive feature of the current work is the use of the Rosetta@home project to select sequences with funnelled energy landscapes, which are less likely to misfold. Each candidate sequence was folded >200000 times from an extended chain. Only ~10% of sequences had a funnelled landscape. It would have been interesting to validate whether the rejected sequences really were less likely to adopt the desired fold — especially given that this selection procedure requires vast computational resources.

The design of these five novel proteins is a great achievement, but even greater challenges remain. The present designs are facilitated by the use of short loops in regions connecting secondary structure elements. Functional proteins will probably require longer loops, more marginal stabilities, and a greater variety of secondary structure subunits.