Author Archives: Hannah Edwards

Submitting your thesis!

Writing and submitting your thesis is (almost) the final stage of completing your PhD. It can be the most stressful and unpleasant part of the process… but it can also be rewarding to see the story of your last three years’ work fall into place.

 "Piled Higher and Deeper" by Jorge Cham (

All I want for christmas is… “Piled Higher and Deeper” by Jorge Cham (

This post is a miscellaneous collection of advice and resources about the submission process, most of which have been passed down from the very first members of OPIG. Hopefully it will be useful to have it all in the same place for present and future members. Feel free to comment here if you have any tips I have missed!

All information and links that I’ve included are correct at the time of writing (for Oxford University Statistics students) but you should always use the university’s guidelines as your primary resource.

The very beginning: the plan

Don’t spend too long on this! But you should have an idea of your planned chapter titles and an overall story for your book. Also useful is a timeline for when you will finish drafts of chapters by. Try to be realistic with this. If you decide to change your thesis title you should fill out an application for change of thesis title form (GSO.6). Make sure you look up any restrictions (word/page limits etc.) which may apply, and confirm your hand-in date.

Starting writing

It’s a good idea to decide what you will use to write your thesis. Most OPIG members use LaTeX. There are some great thesis templates out there but the one most people tend to use is one from Cambridge’s Engineering department. You can do a fair bit of customisation within that template… changing fonts, headers, titles and more, but it’s a great starting point.

When the finish line’s in sight: choosing examiners

A couple of months before you are planning to submit your thesis you should discuss with your supervisor(s) potential examiners. Your supervisor can informally check with them if they are happy to examine you and then you should fill out an appointment of examiners form (GSO.3). You can also change your thesis title on this form without filling in GSO.6.

Finishing writing

Your final document is likely to be over 100 pages with thousands of words (or potential typos as you might come to call them). A great LaTeX spell checker is aspell which should already be installed on your work machine. To spell check a .tex file (ignoring all TeX notation… apart from multiple citations I found!) using a British dictionary simply type:

aspell --lang=en_GB -t -c filename.tex

You’re absolutely guaranteed to still have typos floating around but it’s a decent start. You (and others if you can get them) should manually proof-read as well!

Final Formatting

Your thesis should be set out on numbered, portrait A4 pages. It should be double spaced and the inner (bound) margin should be 3-3.5cm. For more details on the formatting required check out the university’s regulations.

Printing and binding

When you’re happy with your proof-reading (you’re still almost guaranteed to have remaining typos) you’ll have to print and bind your finished book! To comply with university guidelines you will need to submit two copies, for each of your examiners, to the exam schools. You may also like to print a copy for yourself (you will need one to take with you into your viva). Before you start, if you are printing in colour at the department make sure you have enough printer credit by emailing IT (let them know the printer and your Bod card number and they will top you up if necessary).

If you are planning to print your copies double sided you may want to buy your own paper of higher quality than that provided by the department (at least 100gsm). As of October 2014, the Oxford Print Centre was selling the cheapest packs of 100gsm paper we could find but sold out close to deadline day! Also check out WHSmiths or Ryman’s.

You might want to do a test run of a few colour pages of your thesis before you send the whole file to be printed. Printing at 1200dpi (instead of the default 600dpi) can improve the appearance of your figures considerably. You may want to stay late at the office to print so you are not disturbed by other print jobs during office hours.

Your thesis should be securely bound in either hard or soft cover. Loose-leaf or spiral binding won’t be accepted. There are several binding facilities through Oxford but I used the Oxford Print Centre just down the road, which also guarantees a one hour service for soft binding even on submission days.


Submit your completed copies to the exam schools, noting their opening hours (08.30-17:00, Monday to Friday), take the traditional photo, and bask in your newly found FREEDOM (try to forget about the viva!).

MAMMOTH: a case study in protein structure alignment

I’ve talked about protein structure alignment before in the context of a rather novel, mathematical approach. This time I wanted to revisit the topic in a general sense, using a more established algorithm as a case study. MAMMOTH stands for Matching Molecular Models Obtained from Theory and was first published in 2002. Since then it has been cited nearly 400 times and the underlying algorithm has been extended to a multiple alignment program: MAMMOTH-mult.

Establishing biologically relevant and consistent alignments between protein structures is one of the major unsolved problems in computational bioinformatics. However, it’s an important part of many challenges that we face: such as establishing homology between distantly related proteins, functional inference for unannotated proteins, and evaluating the accuracy of models of predicted structure for competitions such as CASP.

Problem Outline

In essence the problem of protein structure alignment can be outlined by considering two ordered sets of coordinates, A = {a1,a2,…,an} and B = {b1,b2,…,bm}, representing points in 3D space. In most cases these points will be the location of the Cα atoms along each structure’s backbone. The sets A and B might be completely different lengths and, if an alignment exists, are almost certainly orientated differently compared to each other.


Establishing an alignment between these sets is equivalent to two steps:

  1. Establish a match M = {(ai,bj) | ai ∈ A, bj ∈ B}
  2. Rotate and translate A onto B so that equivalent atoms are as close as possible.

Of course, it is not immediately clear how to best fulfill these requirements. In particular, we don’t really know what features to prioritise for a biologically relevant match. Should we try to match secondary structure elements and what penalty should we attach to mismatched elements? How about maintaining the correct hydrogen bonding patterns between matched residues? And how much weight should we put on the matched atoms being consecutive in each set (i.e. how should we penalise gaps)?

The second step is equally ambiguous. Especially as there is no consensus on what the correct interpretation of close is. Minimising the RMSD between equivalent atoms is a popular choice of distance measure. However, as the MAMMOTH paper points out, RMSD is often dominated by the mismatched portions of remotely related structures and is thus largely inappropriate in these cases. Furthermore, even if we have a well-defined distance metric, should the superposition prioritise minimising the distances between nearly identical parts of the different structures, at the expense of less similar substructures? Or should the emphasis be on maintaining as lengthy a match as possible at the possible cost of a lower closeness of fit? How about the relative importance of a close fit for atoms in the core of the structure vs. those on the surface?

The majority of these questions remain unanswered and as a result it is often hard to validate alignments as we simply do not know what the right answer is. In fact, in many cases, manual analysis is preferred over any of the available computational techniques.

In this post I’ll go through how the MAMMOTH algorithm approaches each of these steps. For many of the above questions MAMMOTH does not postulate a solution, primarily because, as its name suggests, it was designed to assess prediction models which are often at low resolutions and lacking secondary structure or hydrogen bonding information. I think it’s important to keep these questions in mind, but there’s certainly no necessity to design a programme which deals with them all.

Step 1: Pairing up residues (similarity matrix)

In order to establish a match between equivalent atoms in A and B, MAMMOTH, like several other structural alignment algorithms, uses a well-established alignment technique: a similarity matrix (often inferred from and referenced as a distance matrix). A similarity matrix for alignment is an n x m array where each entry S(ai,bj) represents the pairwise similarity score between residues ai and bj. An alignment between the residues of A and B is any non-decreasing path (that is, a pair (ai,bj) in the path must appear later in the ordering of coordinates of both A and B than the preceding pair of residues in the path) from the top left corner of the array (a1,b1) to the bottom right corner (an,bm). For example the following path can be interpreted as an alignment between A = {a1, …, a11} and B = {b1, …, b8}


Any alignment can be scored by summing up the similarity scores along this path, while penalising any gaps in an appropriate way (normally, these algorithms use trial and error to decide on sensible penalties). For example, the above alignment would have the score S = S(a1,b1) + S(a2,b2) + S(a3,b3) + S(a7,b4) + S(a8,b5) + S(a9,b6) + S(a10,b7) + S(a11,b8) + α + 2β, where α and β are gap opening and gap extension penalties respectively. The optimal alignment is simply the alignment which maximises this score.

For sequence alignments similarity scores can be assigned to residues from substitution tables like BLOSUM. However, it is not immediately clear of an appropriate equivalent for structures. MAMMOTH, like several other algorithms, defines the similarity between different residues by examining their local structural landscape. Specifically, this means comparing fragments of each backbone, centred on the residue of interest. MAMMOTH uses the URMS distance between heptapeptide fragments. This distance is illustrated below using 2D chains and tripeptide fragments.


Comparing residues a2 and b3 involves looking at the directions between each successive residue of the fragment. Each vector is mapped to the unit sphere, beginning at the origin and ending at the surface of the sphere (in this case 2 vectors are considered, and in MAMMOTH’s case 6 different 3D vectors are mapped). The optimal rotation is found, superposing equivalent vectors as best as possible, and then the RMSD of the endpoints on the surface of the sphere is calculated as URMS(ai,bj).

Aside: The optimal superposition of a set of vectors is actually a non-trivial problem. It is essentially equivalent to step 2 in our alignment protocol outlined above, but is significantly easier for the 6 vectors characterising a fragment in MAMMOTH’s algorithm.

Finally, S(ai,bj) is calculated by converting the distance into a similarity measure:


where URMSR is the expected URMS of a random set of vectors and:


The optimal alignment through this MAMMOTH matrix is the path which maximises the sum of similarities between matched residues (each residue being at the centre of the heptapeptide fragment) using gap opening and extension penalties of 7.00 and 0.45 respectively.

Step 2: Global superposition (MaxSub)

The above alignment results in a match M’ optimising the local structural similarity of residues in each structure, however, their is no guarantee that this will result in a set of coordinates close in global space. In order to finalise the match set M ⊆ M’ as well as calculating the optimal superposition of the paired residues of A onto their equivalent points in B, MAMMOTH use the MaxSub algorithm. This is a very efficient algorithm (worth a read if you’re that way inclined) for calculating the maximal subset from a set of paired up atoms which are close in global space. MAMMOTH decide that close means < 4A away after superposition. They do not try to optimise a closer superposition than that but attempt to find the largest possible set of matched residues.

The MaxSub algorithm relies on the assumption (made for computational tractability) that the final subset M ⊆ M’ will have, somewhere, a set of at least four residues consecutive in M’. The algorithm then starts with every possible seed of four consecutive residues (just to illustrate the power of the assumption in reducing computational time: for a 150 residue protein there are just 147 such seeds, but over 2 million sets of four non-consecutive residues!! And it’s a pretty reasonable assumption to make as well). The MaxSub algorithm then calculates the superposition for those four matched pairs, extending the set of residues that are <4A away from their partners, recalculating the superposition using these new pairs as well, then removing any pairs which are no longer within the threshold of each other. It repeats these steps, gradually extending the set M, until the algorithm converges.

Scoring the alignment

Using the two approaches outlined above, MAMMOTH generates an alignment between the two input structures. In order to summarise the significance of this alignment, the algorithm generates the PSI score: the percentage structural identity (which is simply the size of the maximum subset divided by the length of the shortest protein). As a global measure of the strength of similarity the PSI score is poorly constructed and scales with protein length. In order to adjust for this bias, MAMMOTH fits a Gumbel distribution to PSI scores obtained from random structure comparisons between unrelated proteins at bins of different lengths. This results in a z-score measuring, instead of the PSI of an alignment, the likelihood of obtaining a PSI score as good as that by chance between any two proteins of the same lengths.

Journal Club: Quantification and functional analysis of modular protein evolution in a dense phylogenetic tree

For journal club this week I decided to look at a paper by Moore et al. on the modular evolution of proteins.

Modular evolution, or the rearrangement of the domain architecture of a protein, is one of the key drivers behind functional diversification in the protein universe. I used the example in my talk of the multi-domain protein Peptidase T, which contains a catalytic domain homologous to Carboxypeptidase A, a zinc dependent protease. The additional domain in Peptidase T induces the formation of a dimer, which restricts the space around the active site and so affects the specificity of the enzyme.


The multi-domain protein Peptidase T in a dimer (taken from Bashton and Chothia 2007). The active site is circled in green. Carboxypeptidase A is made up of a single domain homologous to the catalytic domain (in blue) of Peptidase T.

I took this case study from a really interesting paper, The generation of new protein functions by the combination of domains (Bashton and Chothia, 2007), which explores several other comparisons between the functions of multi-domain proteins and their single domain homologues.

What this paper does not address however is the directionality of such domain reorganisations. In all these examples, it is not clear whether the multi-domain organisation has evolved from the single domain enzyme or vice versa. Which brings me back to the paper I was presenting on, which attempts a reconstruction of domain arrangements followed by a categorisation of rearrangement events.

Essentially, given a phylogenetic tree of 20 closely related pancrustacean species, the paper takes the different domain arrangements on the genomes (step 1), assigns the presence or absence of each arrangement at interior nodes on the tree (step 2), and then assigns each gained arrangement to one of four possible rearrangement events (step 3).

1. Domain Annotation
The authors use different methods to annotate domains on the genomes. They conclude the most effective methodology is to use the clan level (where families with suspected homologies are joined together… similar to our beloved superfamily classification) of Pfam-A (high quality family alignments with a manually chosen seed sequence). Moreover, they collapse any consecutive stretches of identical domains into one “pseudo-domain”, eliminating the effect of the domain repeat number on an arrangement’s definition.

2. Ancestral State Reconstruction
The ancestral state reconstruction of each domain arrangement (it’s presence/absence at each internal node on the tree) is a result of a 2-pass sweep across the tree: the first from leaves to root, and the second from the root to the leaves. On the first pass the presence of an arrangement on a parent node is decided by majority rule on the state of its children. If the arrangement is present in one child node but absent in the other, the state at the parent node is defined as uncertain. Any uncertain child nodes have a neutral impact on their parent node’s state (i.e. if a parent has a child with the arrangement and a child with an uncertain state the arrangement will be annotated as present in the parent node). On the second pass (from root to leaves) uncertain nodes are decided by the state at their parent node. An uncertain arrangement at the root will be annotated as present. For more details and a clearer explanation see Box 1 in the figure below.


A schematic for the assignment of domain recombination events. Box 1 gives the algorithm for the ancestral state reconstruction. Figure S2 from Moore et al. 2013.

3. Rearrangement events
For each gained event on a particular branch, the authors then postulated one of four simple rearrangement events dependent on the arrangements on the parent’s predicted proteome.

i) Fusion: A gained domain arrangement (A,B,C) on a child’s proteome is the result of fusion if the parent’s proteome contains both the arrangements (A,B) AND (C) (as one example).
ii) Fission: A gained arrangement (A,B,C) is the result of fission if the parent contains the arrangement (A,B,C,D) AND the child also contains the arrangement (D).
iii) Terminal Loss: A gained arrangement (A,B,C) is the result of terminal loss if the parent contains the arrangement (A,B,C,D) AND the child does not contain the arrangement (D).
iv) Domain gain: A gained arrangement (A,B,C) is the result of domain gain if the parent contains (A,B) but not (C).

Any gained arrangement which cannot be explained by these cases (as a single-step solution) is annotated as having no solution.


The authors find, roughly speaking, that the domain arrangements they identify fall into a bimodal distribution. The vast majority are those which are seen on only one genome, of which over 80% are multi-domain arrangements. There are also a sizeable number of arrangements seen on every single genome, the vast majority of which are made up of a single domain. I do wonder though, how much of this signal is due to the relative difficulty of identifying and assigning multiple different domains compared to just a single domain. While it seems unlikely that this would explain the entirety of this observation (on average, 75% of proteins per genome were assigned) it would be interesting to have seen how the authors address this possible bias.

Interestingly, the authors also find a slight inflation in fusion events over fission events across the tree (around 1 more per million years), although there are more fusion events nearer the root of the tree, with fission dominating nearer the leaves, and in particular, on the dense Drosophila subtree.

Finally, the authors performed a functional term enrichment analysis on the domain arrangements gained by fusion and fission events and showed that, in both cases, terms relating to signalling were significantly overrepresented in these populations, emphasising the potential importance that modular evolution may play in this area.

ISMB/ECCB Conference 2013 (Berlin)

It’s that time of the year again… when an intrepid group of OPIGlets trundle back tired but happy from another successful conference (this time it was ISMB/ECCB and its satellite conference 3Dsig in Berlin) armed with their favourite titbits from the presented work. This blog post is a mashup of some of our highlights as presented at the last group meeting.


Post-schnitzel and out and about in Berlin!

Definitely one of the best things for me was getting the chance to hear Sir Tom Blundell (our very own academic grandfather… Charlotte’s supervisor) give the keynote at 3Dsig, talking about everything from the structure of insulin to his deep, underlying love of jazz. Here are some more of our favourite things…

Empirical contact potentials derived from binding free energy changes upon mutation
(poster presentation by Iain H. Moal and Juan Fernández Racio)

Chosen by Jinwoo Leem

I was impressed by Moal (et al.)’s poster on predicting protein-protein binding affinities (in fact, it won the poster prize at 3D-Sig!). The poster describes a statistical potential that considers the number of mutations in a protein, and the type of interatomic contacts. Two variants of the potential were made; one for considering all atoms (atomic potential), and one considering residue side chains, represented as a centroid atom (residue potential). Ultimately, the energy change is represented as:


where N is the matrix of interatomic contacts between atoms i,j and P is a vector of contact types. Using weighted least-squares to minimise the residuals, r, the equation was used to predict affinity (ΔG) and affinity changes following mutations (ΔΔG).


As we can see in the top two graphs, the model shows decent performance for predicting ΔΔG of enzyme-inhibitor interactions, i.e. the model can indicate how a mutation affects binding affinities. Having said this, the ΔΔG predictions for Ab-Ag interactions were poor (Pearson’s correlation = 0.5-0.6).

Moreover, when the same potential was used to predict ΔG (bottom two graphs), the correlations were even worse. In fact, for flexible protein pairs, i.e. receptor-ligand pairs whose interface RMSD > 1.0Å, the correlation has gone to as low as 0.02.

Although the results are disappointing with respect to ΔG prediction, the model raises two interesting points. First, this is one of the few scoring functions that are specifically designed to predict affinity, rather than giving an arbitrary score for low RMSD. In addition, this model re-iterates the challenges in predicting Ab-Ag interactions. The solution for the latter point is not yet clear, but it may be of interest to re-train the model specifically with Ab-Ag complexes, and see if the model’s performance improves!

Predicting protein contact map using evolutionary and physical constraints by integer programming
(paper presentation by Zhiyong Wang and Jinbo Xu)

Chosen by Saulo de Oliveira

Last week, I decided to present a quick overview of a Paper Presentation I attended during the ISMB 2013.

The title of the presentation was “Predicting protein contact map using evolutionary and physical constraints by integer programming.” based on a paper by the same name.

Contact prediction (or evolutionary constraint prediction, a term I am much more fond of) was a trendy topic both at the 3DSig (2013) and at the ISMB (2013), with several presentations and posters on the subject.

In this particular presentation, Zhiyong Wang and Jinbo Xu described a new method to identify evolutionary constraints. The big differential of their talk and their work was approaching the problem in a different angle: their aim was to predict contacts when you have a low number of sequences in the multiple sequence alignment (refer to previous posts in the blog for an introduction to contact prediction).

They proposed a combination of machine learning and integer programming (similar to linear programming, again a topic we discussed previously here) to perform their predictions.

The features of the machine learning did not present any innovation. They were quite standard in the field such as mutation rates on PSIBLAST profiles and the Mutual Information (MI). The results of the Random Forest algorithm was employed to formulate constraints in a linear problem. These constraints were used to enforce physical properties of proteins, based mostly on our understanding of secondary structure.

Results seemed positive in both a random test set (CASP10) and 2 other test sets. By positive, I mean there was an improvement on the current state-of-the-art, especially for proteins with 10-1000 sequences in the MSA. Still, their precision was around 30, 40% for the top L/10 predictions (where L is the protein length). Further improvements are still necessary before we can apply these evolutionary constraints to improve protein structure prediction.

Evolution of drug resistance: structural models
(presentation by Maria Safi)

Chosen by Hannah Edwards

I found this talk by Maria Safi (which won the prize for best non-keynote presentation at 3Dsig) to be a really interesting method, despite my complete lack of background knowledge in the area (what are conferences for but to expose you to new ideas, right?).

Their idea was to produce a computationally viable method for identifying drug resistance in a protein’s mutant space. Drugs work by binding to their target protein in such a way as to inhibit its native function. If the protein mutates so as to maintain its native function but impair its binding to the drug it acquires resistance. The problem is, even within a reasonable distance of the native sequence, a proteins’ mutant space is huge, and it’s by no means trivial to test for maintenance of function and binding energy.

The groups’ solution was to recognise that the vast majority of mutant space would not be of interest. As such they send their candidate mutants through a 2-pass search: the first, a quick and easy algorithm to swiftly eliminate the dead end mutants… those that either are not resistant to drug binding or do not maintain their native function, and the second, a more biochemically accurate yet computationally expensive algorithm to be applied to the shortlist identified during the first pass.

The first algorithm is based on restricted dead-end elimination which aims to minimise a simple energy potential based on the protein’s structural stability and it’s binding energy to the drug. The algorithm keeps the backbone structure constant but by differing the side chain conformations, the mutants result in different energy potentials. A mutation at residue r can then be eliminated if an alternative mutation at r will always result in a lower energy potential.

The second algorithm is based on the more sophisticated methodology of MM-PBSA, combining molecular mechanics with the Poisson-Boltzman Surface Area calculations to estimate the free energy of the compound. This final run identifies the candidate mutants.

A significant strength of their method is that it requires only the crystal structures of the drug and target protein. As a purely structural model it eliminates the need for large amounts of training data, which, for newly emerging diseases and drugs, is often impossible to have access to.

The main focus of Maria’s talk however was using these energy potentials to predict evolutionary pathways from a wild-type protein to a resistant candidate. By treating evolution as a random walk through mutant space, weighted by the energy potentials, and assuming selection pressure of resistance, they were able to computationally simulate evolutionary scenarios.

For example, Maria focussed on the ritonavir-HIV protease complex to illustrate this method. The majority of the mutants with resistance to ritonavir which have been observed in nature were predicted by the method. For the candidates that were predicted but have not been seen, further elucidation could be found from the simulated evolutionary pathways: around 60% of these candidates were not accessible under the evolutionary model.

Sequence comes to the Structural Rescue: Identifying Relevant Protein Interfaces in Crystal Structures
(presentation by Jose M. Duarte)

Chosen by Henry Wilman

Jose Duarte presented a tool, EPPIC, which identifies and classifies protein interfaces from pdb structures. The talk was titled ‘Sequence comes to the Structural Rescue: Identifying Relevant Protein Interfaces in Crystal Structures’, and follows from their publication Protein interface classification by evolutionary analysis, Duarte JM, Srebniak A, Schärer MA, Capitani G. BMC Bioinformatics. 2012 Dec 22..

As the title suggests, this uses both structural and sequence information to classify protein contacts as biological or crystal. There is a webserver, and a downloadable version. There are a number of methods that exist to classify interfaces, and this differs in a few ways.

The previous methods typically rely on the area of the interface. As you see in the paper, even the benchmark sets used to test the other methods are biased such that biological contacts have much greater areas than crystal contacts. When the authors constructed a set where the contact area was similar, they found the previous methods performed generally poorly. However, there are a number of ways that you can define the interface or contact area, and specifically what people call ‘core residues’ of the interface. They found one study performed much better on their training set than the others. This defined core residues as ones that lost the majority of their solvent accessible surface area on binding to the interface. A simple cut off of >= 6 core residues at an interface produced a good classification.

In addition to this, they used sequence information. We know that interfaces are important, and often mutations at interface residues are bad. So, for a biological interface, we would expect residues to be better conserved than non-interacting surface residues. The authors used sequence entropy as a measure of the conservation. They calculated this by collecting homologous sequences with PSI-Blast and aligned them using Clustal-Omega. For each position in the alignment, if x is the occupancy frequency for a given amino acid, the sequence entropy is given by the sum over all amino acids of xlog(x). (They actually use a reduced alphabet for this, to avoid penalising mutations to similar amino acids). They then compare the entropy of the ‘core’ residues in the interface to those on the surface of the protein, and those on the periphery of the interface. If the core residues have lower entropy, then the contact is classed as biological. There are simple thresholds for both of these comparisons.

They have three metrics – one structural (number of core residues), and two sequence (entropy of core residues vs. peripheral residues, and entropy of core residues vs. surface residues). They classify based on a majority vote of the three methods. If there are an insufficient number of homologous sequences (i.e. fewer than 8), then they ignore the sequence scores, and classify using the structure only.

So why do we care about protein interfaces? Lots of us work with experimental protein structures. Many of these come from X-ray crystallography experiments. This means that when the structural information is captured, the protein is not isolated – instead it is packed against many other copies of itself. A bit like a brick in a wall – a wall many rows of bricks deep. So our protein is in contact with many others. Some of these contacts occur within the natural environment of the cell, others are a result of the crystal packing structure.
Now, protein interfaces are important. ‘Why?’, I hear you ask. Interfaces are how proteins interact with each other, and with other molecules. Some interfaces are large, some small, some are involved in transient interactions, others in permanent ones. Many diseases occur due to amino acid mutations at these interfaces. These change how the protein functions, which can cause bad things to happen. Similarly, if we can learn how to interact with an interface, we are (theoretically) able to alter the function with some sort of drug, and cause good things (whilst avoiding bad things).

So, this raises a few questions for people who study protein crystal structures. Things like, which bits of my protein interact? Do these interactions happen in a cell? Is my structure somehow distorted by the crystal packing? This EPPIC tool gives us one way to answer these.


Congratulations on reaching the end of this blog post… as a reward, the epic Brandenburg gate (taken by Jin)

Evolutionary fold space preferences

At group meeting last week I focussed, alongside some metaphysical speculation, on a project which has occupied the first half of my DPhil: namely exploring the preferences of both very old and very young protein structures. This work is currently in preparation for publication so I will give only a brief overview and hopefully update the juicy details later. Feel free to contact me for more information.

Proteins are the molecular machinery of the cell. Their evolution is one of the most fundamental processes which has delivered the diversity and complexity of life that we see around ourselves today. Despite this diversity, protein domains (independent folding units) of known structure fall into just over 1,000 unique SCOP folds.

This project has sought to identify how populations of proteins at different stages of evolution explore their possible structure space.

Superfamily ages

Structural domains are clustered at different levels of similarity within the SCOP classification. At the superfamily level this classification attempts to capture evolutionary relationships through structural and functional similarities even if sequence diversion has occurred.

Evolutionary ages for these superfamilies are then estimated from their phylogenetic profiles across the tree of life. These ages are an estimate of the structural ancestor for a superfamily.




The phylogenetic occurrence profiles are constructed using predictions of superfamilies on completely sequenced genomes using HMMs and taken from the SUPERFAMILY database. Given an occurrence profile and a phylogenetic tree (for robustness we consider several possible reconstructions of the tree of life) we use a maximum parsimony algorithm (proposed by Mirkin et. al) which estimates the simplest scenario of loss events (domain loss on a genome) and gain events (domain gain) at internal nodes on the tree which explains the occurrence profile. The age estimate is the height of the first gain event, normalised between 0 (at the leaves of the tree) and 1 (at the root).

We estimated ages for 1,962 SCOP superfamilies and compared several properties relating to their primary, secondary and tertiary structures, as well as their functions. In particular, we compared two populations of superfamilies: ancients, with an age of 1, and new-borns, with an age < 0.4. Full details of our results will hopefully be published shortly so watch this space!

Journal Club: A mathematical framework for structure comparison

For my turn at journal club I decided to look at this paper by Liu et. al. from FSU.

The paper, efficiently titled ‘A mathematical framework for protein structure comparison’, is motivated by the famously challenging problem of structure comparison. Historically, and indeed presently, the most respected methods of structural classification (dividing up the protein universe by structural similarity) have required a significant amount of manual comparison. This is essentially because we have no formal consensus on what similarity between protein structures truly means and how it should be measured. There are computational challenges to efficient alignment as well but, without a formal framework and a well-defined metric, structure comparison remains an ill-posed problem.

The solution for the authors was to represent each protein structure as a continuous, elastic curve and define the deformation between any two curves as the distance between their structures. We consider, hypothetically at first, a geometric space where each point represents a 3D curve (or potential protein structure). One problem with studying protein structures as geometric objects in Euclidean 3D space is that we really want to ignore certain shape-preserving transformations, such as translations, rotations, scaling and re-parametrization. Ideally therefore, we’d like our geometric space to represent curves unique up to a combination of these transformations. That is, if a protein structure is picked up, turned upside down, then put down somewhere else, it should still be treated as the same shape as it was before. Let’s call this space the shape space S.

Key to the formulation of this space is the representation of a protein structure by its square root velocity function (SRVF). We can represent a protein structure by a continuous function β, which maps the unit interval onto 3D space: β: [0,1] → R3. The SRVF of this curve is then defined as:


where β'(t) is the derivative of β. q(t) contains both the speed and the direction of β but, since it is the derivative, it is invariant to any linear translation of β. So, simply by using the SRVF representation of a curve, we have eliminated one of the shape-preserving transformations. We can eliminate another, rescaling, by requiring β(t) to have unit length: ∫|β'(t)|2dt = ∫|q(t)|2dt = 1.

The space of q‘s is not the shape space S as we still have rotations and re-parametrizations to account for but it is a well-defined space (a unit sphere in the Hilbert space L2 if you were wondering). We call this space the preshape space C. The main advantage of the SRVF representation of curves is that the standard metric on this space explicitly measures the amount of deformation, in terms of stretching and bending, between two curves.

All that is left to do is to eliminate the effects of an arbitrary rotation and re-parametrization. In fact, instead of working directly in S, we choose to remain in the preshape space C as its geometry is so well-defined (well actually, it’s even easier on the tangent space of C but that’s a story for another time). To compare two curves in C we fix the first curve, then find the optimal rotation and parametrization of the second curve so as to minimise the distance between them. Computationally, this is the most expensive step (although almost every other global alignment will need this step and more) and consists of firstly applying singular value decomposition to find the rotation then a dynamic programming algorithm to find the parametrization (this is the matching function between the two backbone structures… you could add secondary structure/sequence constraints here to improve the relevance of the alignment).

Now we can compare q1 and q2 in C. Not only can we calculate the distance between them in terms of the deformation from one to the other, we can also explicitly visualise this optimal deformation as a series of curves between q1 and q2, calculated as the geodesic path between the curves in C:


where θ = cos-1( ∫ <q1,q2>dt) is the distance between q1 and q2 in S.


The geodesic path between protein 1MP6, the left-most structure, and protein 2K98, the right-most structure.

As a consequence of this formal framework for analysing protein structures it’s now possible, in a well-defined way, to treat structures as random variables. Practically, this means we can study populations of protein structures by calculating their mean structure and covariances.