alignment | Oxford Protein Informatics Group

In last week’s group meeting, I introduced our network comparison method (Netdis) and presented some new results that enable the method to be applied to larger networks.

The most tractable methods for network comparison are those which compare at the level of the entire network using statistics that describe global properties, but these statistics are not sensitive enough to be able to reconstruct phylogeny or shed light on evolutionary processes. In contrast, there are several network alignment based methods that compare networks using the properties of the individual proteins (nodes) e.g. local network similarity and/or protein functional or sequence similarity. The aim of these methods is to identify matching proteins/nodes between networks and use these to identify exact or close sub-network matches. These methods are usually computationally intensive and tend to yield an alignment which contains only a relatively small proportion of the network, although this has been alleviated to some extent in more recent methods.

Thus, we do not follow the network alignment paradigm, but instead we take our lead from alignment-free sequence comparison methods that have been used to identify evolutionary relationships. Alignment-free methods based on k-tuple counts (also called k-grams or k-words) have been applied to construct trees from sequence data. A key feature is the standardisation of the counts to separate the signal from the background noise. Inspired by alignment-free sequence comparison we use subgraph counts instead of sequence homology or functional one-to-one matches to compare networks. Our proposed method, Netdis, compares the subgraph content not of the networks themselves but instead of the ensemble of all protein neighbourhoods (ego-networks) in each network, through an averaging many-to-many approach. The comparison between these ensembles is summarised in a Netdis value, which in turn is used as input for phylogenetic tree reconstruction.

Effect of sub-sampling egos on the resulting grouping of networks generated by Netdis. Higher Rand index values indicate better fit to non-sampling results.

Fig1: Effect of sub-sampling egos on the resulting grouping of networks generated by Netdis. Higher Rand index values indicate better fit to non-sampling results.

Extensive tests on simulated and empirical data-sets show that it is not necessary to analyze all possible ego-networks within a network for Netdis to work. Our results indicate that in general, randomly sampling around 10% of egos in each network results in a very similar clustering of networks on average, compared to the tree with 100% sampling (Fig 1). This result has important implications for use-cases where eextremely large graphs are to be compared (e.g > 100,000 nodes). Related to the ego-nework sub-sampling idea is the notion of size-limiting the ego-networks that are to be analyzed by the algorithm. Our tests show that the vast majrity of ego-netowrks in most networks have a relatively low coverage of the overall network. Moreover, by introducing lower-size threshold on the egos, we observe better results on average. Together, this means a limited range of ego-network sizes to be analyzed for each network, which should lead to better statitical properties as the sub-sampling scheme is inspired by bootstrapping.

I’ve talked about protein structure alignment before in the context of a rather novel, mathematical approach. This time I wanted to revisit the topic in a general sense, using a more established algorithm as a case study. MAMMOTH stands for Matching Molecular Models Obtained from Theory and was first published in 2002. Since then it has been cited nearly 400 times and the underlying algorithm has been extended to a multiple alignment program: MAMMOTH-mult.

Establishing biologically relevant and consistent alignments between protein structures is one of the major unsolved problems in computational bioinformatics. However, it’s an important part of many challenges that we face: such as establishing homology between distantly related proteins, functional inference for unannotated proteins, and evaluating the accuracy of models of predicted structure for competitions such as CASP.

Problem Outline

In essence the problem of protein structure alignment can be outlined by considering two ordered sets of coordinates, A = {a₁,a₂,…,a_n} and B = {b₁,b₂,…,b_m}, representing points in 3D space. In most cases these points will be the location of the C_α atoms along each structure’s backbone. The sets A and B might be completely different lengths and, if an alignment exists, are almost certainly orientated differently compared to each other.

Establishing an alignment between these sets is equivalent to two steps:

Establish a match M = {(a_i,b_j) | a_i ∈ A, b_j ∈ B}
Rotate and translate A onto B so that equivalent atoms are as close as possible.

Of course, it is not immediately clear how to best fulfill these requirements. In particular, we don’t really know what features to prioritise for a biologically relevant match. Should we try to match secondary structure elements and what penalty should we attach to mismatched elements? How about maintaining the correct hydrogen bonding patterns between matched residues? And how much weight should we put on the matched atoms being consecutive in each set (i.e. how should we penalise gaps)?

The second step is equally ambiguous. Especially as there is no consensus on what the correct interpretation of close is. Minimising the RMSD between equivalent atoms is a popular choice of distance measure. However, as the MAMMOTH paper points out, RMSD is often dominated by the mismatched portions of remotely related structures and is thus largely inappropriate in these cases. Furthermore, even if we have a well-defined distance metric, should the superposition prioritise minimising the distances between nearly identical parts of the different structures, at the expense of less similar substructures? Or should the emphasis be on maintaining as lengthy a match as possible at the possible cost of a lower closeness of fit? How about the relative importance of a close fit for atoms in the core of the structure vs. those on the surface?

The majority of these questions remain unanswered and as a result it is often hard to validate alignments as we simply do not know what the right answer is. In fact, in many cases, manual analysis is preferred over any of the available computational techniques.

In this post I’ll go through how the MAMMOTH algorithm approaches each of these steps. For many of the above questions MAMMOTH does not postulate a solution, primarily because, as its name suggests, it was designed to assess prediction models which are often at low resolutions and lacking secondary structure or hydrogen bonding information. I think it’s important to keep these questions in mind, but there’s certainly no necessity to design a programme which deals with them all.

Step 1: Pairing up residues (similarity matrix)

In order to establish a match between equivalent atoms in A and B, MAMMOTH, like several other structural alignment algorithms, uses a well-established alignment technique: a similarity matrix (often inferred from and referenced as a distance matrix). A similarity matrix for alignment is an n x m array where each entry S(a_i,b_j) represents the pairwise similarity score between residues a_i and b_j. An alignment between the residues of A and B is any non-decreasing path (that is, a pair (a_i,b_j) in the path must appear later in the ordering of coordinates of both A and B than the preceding pair of residues in the path) from the top left corner of the array (a₁,b₁) to the bottom right corner (a_n,b_m). For example the following path can be interpreted as an alignment between A = {a₁, …, a₁₁} and B = {b₁, …, b₈}

Any alignment can be scored by summing up the similarity scores along this path, while penalising any gaps in an appropriate way (normally, these algorithms use trial and error to decide on sensible penalties). For example, the above alignment would have the score S = S(a₁,b₁) + S(a₂,b₂) + S(a₃,b₃) + S(a₇,b₄) + S(a₈,b₅) + S(a₉,b₆) + S(a₁₀,b₇) + S(a₁₁,b₈) + α + 2β, where α and β are gap opening and gap extension penalties respectively. The optimal alignment is simply the alignment which maximises this score.

For sequence alignments similarity scores can be assigned to residues from substitution tables like BLOSUM. However, it is not immediately clear of an appropriate equivalent for structures. MAMMOTH, like several other algorithms, defines the similarity between different residues by examining their local structural landscape. Specifically, this means comparing fragments of each backbone, centred on the residue of interest. MAMMOTH uses the URMS distance between heptapeptide fragments. This distance is illustrated below using 2D chains and tripeptide fragments.

Comparing residues a₂ and b₃ involves looking at the directions between each successive residue of the fragment. Each vector is mapped to the unit sphere, beginning at the origin and ending at the surface of the sphere (in this case 2 vectors are considered, and in MAMMOTH’s case 6 different 3D vectors are mapped). The optimal rotation is found, superposing equivalent vectors as best as possible, and then the RMSD of the endpoints on the surface of the sphere is calculated as URMS(a_i,b_j).

Aside: The optimal superposition of a set of vectors is actually a non-trivial problem. It is essentially equivalent to step 2 in our alignment protocol outlined above, but is significantly easier for the 6 vectors characterising a fragment in MAMMOTH’s algorithm.

Finally, S(ai,bj) is calculated by converting the distance into a similarity measure:

where URMS^R is the expected URMS of a random set of vectors and:

The optimal alignment through this MAMMOTH matrix is the path which maximises the sum of similarities between matched residues (each residue being at the centre of the heptapeptide fragment) using gap opening and extension penalties of 7.00 and 0.45 respectively.

Step 2: Global superposition (MaxSub)

The above alignment results in a match M’ optimising the local structural similarity of residues in each structure, however, their is no guarantee that this will result in a set of coordinates close in global space. In order to finalise the match set M ⊆ M’ as well as calculating the optimal superposition of the paired residues of A onto their equivalent points in B, MAMMOTH use the MaxSub algorithm. This is a very efficient algorithm (worth a read if you’re that way inclined) for calculating the maximal subset from a set of paired up atoms which are close in global space. MAMMOTH decide that close means < 4A away after superposition. They do not try to optimise a closer superposition than that but attempt to find the largest possible set of matched residues.

The MaxSub algorithm relies on the assumption (made for computational tractability) that the final subset M ⊆ M’ will have, somewhere, a set of at least four residues consecutive in M’. The algorithm then starts with every possible seed of four consecutive residues (just to illustrate the power of the assumption in reducing computational time: for a 150 residue protein there are just 147 such seeds, but over 2 million sets of four non-consecutive residues!! And it’s a pretty reasonable assumption to make as well). The MaxSub algorithm then calculates the superposition for those four matched pairs, extending the set of residues that are <4A away from their partners, recalculating the superposition using these new pairs as well, then removing any pairs which are no longer within the threshold of each other. It repeats these steps, gradually extending the set M, until the algorithm converges.

Scoring the alignment

Using the two approaches outlined above, MAMMOTH generates an alignment between the two input structures. In order to summarise the significance of this alignment, the algorithm generates the PSI score: the percentage structural identity (which is simply the size of the maximum subset divided by the length of the shortest protein). As a global measure of the strength of similarity the PSI score is poorly constructed and scales with protein length. In order to adjust for this bias, MAMMOTH fits a Gumbel distribution to PSI scores obtained from random structure comparisons between unrelated proteins at bins of different lengths. This results in a z-score measuring, instead of the PSI of an alignment, the likelihood of obtaining a PSI score as good as that by chance between any two proteins of the same lengths.

Oxford Protein Informatics Group

or "OPIG" to friends

Tag Archives: alignment

A topology-based distance measure for network data

MAMMOTH: a case study in protein structure alignment