# Finding Communities in Networks: Link Clustering and the Affiliation Graph Model

The research I do revolves around how to break down protein interaction networks (PINs) into functional modules, or communities, using community detection methods. These communities are groups of proteins which interact more with one another than with the rest of the network. We decompose the PINs in this manner, as it is very difficult to determine how a protein functions through its interactions by simply looking at the entire network. The sheer amount of data in one of these networks, clouds the information we would actually like to see, as explained in my previous blog post.

In a journal club in August, I presented the overlapping community detection method Link Clustering by Ahn et al. which is the first community detection method I am aware of that assigns nodes to communities by clustering the edges between the nodes. I contrasted Link Clustering with the Affiliation Graph Model method proposed by Yang and Leskovec, as the authors use the same comparison technique to validate their methods. In a world where studies usually only work on a single network using a specific method, having two independent papers that use the same validation method is something that gets me much more excited than it probably should.

Instead of assigning nodes to communities, Link Clustering focusses on the edges in a network. Clustering edges is essentially an equivalent problem to clustering nodes. This becomes obvious if you take a graph, decide to call all the edges “blobs” and say two of your new “blobs” have an edge between them if they share a node in the original network. This is called creating the “line graph” of the network and performing “blob” clustering on this is the same as performing edge clustering on the initial network.

To come back from my tangent, if an edge is assigned to community A, this means the nodes on either side of this edge are assigned community A by association. By assigning each edge to an individual community, Link Clustering creates overlapping node communities as shown in Figure 1 below.

Figure 1: Example of a simple, link-clustered network. Nodes can be seen to be in multiple communities as edges are assigned to communities (c.f. node 4 in red, green and yellow communities). [Ahn et al., 2010]

In Link Clustering, edges are placed into the same community, based on similarity scores computed between all connected edges. Edge clusters are computed from these similarity scores using single-linkage clustering, where the edges with the highest similarity scores are iteratively merged into the same community. Using this method all edges end up in the same community in the end, so a threshold for the similarity score needs to be found at which the network is partitioned into communities in the best way. This threshold value represents a type of resolution parameter of the for community detection. At a low threshold there are few, large communities as many edges have been clustered together, while at a high threshold only highly similar edges are in the same community and there are many, small communities. Ahn et al. propose maximizing a function called the partition density to find the optimal resolution threshold. This function is simply a weighted average of the density of the communities. For those incredibly keen, it is given by the equation:

$D = \frac{2}{M} \sum_c m_c \frac{m_c - (n_c -1)}{(n_c - 2)(n_c -1)}$

Here, $M$ represents the total number of edges in the network and $m_c$ and $n_c$ represent the number of edges and nodes in the community $c$ respectively.

All of this partitioning is based on a similarity score between two edges. The crux of this method, the scoring measure, is explained in Figure 2.

Figure 2: Schematic Diagram showing how the similarity between edges $e_{ik}$ and and $e_{jk}$ is calculated in Link Clustering. The overlap of the inclusive neighbour sets of nodes i and j is divided by the union of these sets. The inclusive neighbour set of node i is computed by taking the neighbours of node i and including node i itself. The nodes in grey are overlapping between both sets, therefore the similarity between edges $e_{ik}$ and $e_{jk}$ is 4/12 or 1/3. [Ahn et al. 2010]

## Affiliation Graph Model

The method which I would like to compare Link Clustering to is the Affiliation Graph Model (AGM). Briefly, this is a statistical community detection algorithm which generates overlapping communities. Node affiliations are recorded in the Community Affiliation Matrix shown in Figure 2.

Schematic of the Community Affiliation Matrix, which shows nodes (coloured squares) being affiliated with communities (circles A, B, and C) as shown by the arrows. Nodes can have multiple community affiliations as indicated by red and purple node squares. The probability of interaction between two nodes which are both in only community A, is $p_A$. [Yang and Leskovec, 2012]

The AGM has a probability of interaction for each community, denoted $p_A$ for community $A$. For nodes $u$ and $v$, which may share several community affiliations, the probability of interaction, $p(u,v)$ is given by the equation:

$p(u,v) = 1 - \prod_{k \in C_{uv}} (1 - p_k),$

where $k$ denotes a community in the set of communities shared by nodes $u$ and $v$, $C_{uv}$. This model is be fitted to the network using a Metropolis-Hastings algorithm with a Markov chain constructed by pre-defined defined steps in the space of possible Community Affiliation Matrices (c.f. Yang and Leskovec 2012).

## Result Comparison Method

Link Clustering itself has some very interesting results, especially relating to looking at real-world networks at different resolutions (which can be found in the paper). However, I want to focus on the validation method used in the paper which compares results generated from Link Clustering with that of other methods. This comparison method evaluates four different aspects of the partitions generated by different methods. These four aspects are: Community Quality, Overlap Quality, Community Coverage, and Overlap Coverage. The “Quality” aspects relate to network metadata indicating how good the obtained communities are, and the “Coverage” aspects relate to the amount of information extracted from a network.

The metadata referred to above is a general term for additional data available about the nodes in a network. In the case of PINs this metadata is related to the function of proteins in the network (their Gene Ontology annotations). Loosely, proteins with similar functions should be placed in the same community to improve the Community Quality measure. Overlap Quality concerns the boundaries between generated communities. If the functions assigned to proteins show similar boundaries between groups of functions, the Overlap Quality is high.

The “Coverage” values are more basic calculations. A partition has a high Community Coverage if the fraction of nodes assigned to communities with three or more nodes is high (non-trivial communities). A community of size two is uninformative, as it is the result of a single edge being in a community by itself. Overlap Coverage is simply the number of non-trivial community memberships per node. The two “Coverage” values are thus equal for non-overlapping community detection methods.

When comparing community detection methods, these four measures are computed for a partition generated by each method, and then rescaled so that the maximum score achieved by any method is 1 for each measure independently. These values are then added to give a score between 0 and 4 for each community detection method.

## Results

The results shown in Figure 4 were generated by Yang and Leskovec to show their AGM outperforms Link Clustering (and other methods) on a variety of networks. While it is noteable that they used the same metric and networks used by Ahn et al. when proposing Link Clustering, this Figure must be taken with a pinch of salt. Yang and Leskovec acknowledge that the metric proposed by Ahn et al. rewards methods which find more communities in a network and thus do not fit the number of communities judged to be best by the AGM, but instead fit the same number as fitted in Link Clustering. Furthermore, it is peculiar that half of the networks used for comparison by Ahn et al. have disappeared in this comparison.

Figure 4: Community detection method comparison for Link Clustering (L), Clique Percolation (C), Mixed-Membership Stochastic Block Model (M) and the Affiliation Graph Model (A) on different PINs (PPI) and other social networks. Methods are compared by the metric proposed by Ahn et al. [Yang & Leskovec 2012]

To conclude we can say that while it is very interesting that two papers use the same method to compare their methods to others for validation purposes, it wasn’t done completely accurately here. The comparison metric is however an interesting development to create a gold standard for method comparison. For my purposes, Community Quality is the most important, so maybe a weighted version of this may be more interesting.

# ECCB 2014 (Strasbourg)

The European Conference on Computational Biology was held in Strasbourg, France this year. The conference was attended by several members of OPIG including Charlotte Deane, Waqar Ali, Eleanor Law, Jaroslaw Nowak and Cristian Regep. Waqar gave a talk on his paper proposing a new distance measure for network comparison (Netdis). There were many interesting talks and posters at the conference, and brief summaries of the ones we found most relevant are given below.

The impact of incomplete knowledge on the evaluation of protein function prediction: a structured output learning perspective.

Authors: Yuxiang Jiang, Wyatt T. Clark, Iddo Friedberg and Predrag Radivojac

Chosen by: Waqar Ali

The automated functional annotation of biological macromolecules is a problem of computational assignment of biological concepts or ontological terms to genes and gene products. A number of methods have been developed to computationally annotate genes using standardized nomenclature such as Gene Ontology (GO). One important concern is that experimental annotations of proteins are incomplete. This raises questions as to whether and to what degree currently available data can be reliably used to train computational models and estimate their performance accuracy.

In the paper, the authors studied the effect of incomplete experimental annotations on the reliability of performance evaluation in protein function prediction. Using the structured-output learning framework, they provide theoretical analyses and carry out simulations to characterize the effect of growing experimental annotations on the correctness and stability of performance estimates corresponding to different types of methods. They also analyzed real biological data by simulating the prediction, evaluation and subsequent re-evaluation (after additional experimental annotations become available) of GO term predictions. They find a complex interplay between the prediction algorithm, performance metric and underlying ontology. However, using the available experimental data and under realistic assumptions, their results also suggest that current large-scale evaluations are meaningful and surprisingly reliable. The choice of function prediction methods evaluated by the authors is not exhaustive however and it is quite possible that other methods might be much more sensitive to incomplete annotations.

Towards practical, high-capacity, low-maintenance information storage in synthesized DNA.

Authors: Nick Goldman, Paul Bertone, Siyuan Chen, Christophe Dessimoz, Emily M. LeProust, Botond Sipos & Ewan Birney

Chosen by: Jaroslaw Nowak

This was one of the keynote speaker talks. One of the authors, Ewan Birney discussed how viable is storing digital information in DNA code. The paper talked about storing 739 kilobytes of hard-disk storage in the genetic code. The data stored included all 154 of Shakespeare’s sonnets in ASCII text, a scientific paper in PDF format, a medium-resolution colour photograph of the European Bioinformatics Institute in JPEG format, a 26-s excerpt from Martin Luther King’s 1963 ‘I have a dream’ speech in MP3 format and a Huffman code used in their study to convert bytes to base-3 digits (ASCII text). The authors accomplished this by first converting the binary data into base-3 numbers using Huffman coding (which represents the most common piece of information using the least bits). The base-3 numbers where then converted into a genetic code in such a way that produced no homopolymers. The authors also proved that they can read the encoded information and reconstruct the data with 100% accuracy.

Using DNA for information storage could very soon become cost-effective for sub-50 years archiving. The current costs of the process are $12,400/MB for writing and$220/MB for reading information, with negligible costs of copying. Nevertheless, if the current trends persist, we could see a 100 – fold drop in costs in less than a decade. The main advantages of storing information in DNA is the low maintenance and durability of the medium (intact DNA fragments have been recovered from samples that are tens of thousands years old) as well as little physical space required to store the information (~2.2 PB/g)

PconsFold: Improved contact predictions improve protein models.

Authors: Michel M, Hayat S, Skwark MJ, Sander C, Marks DS and Elofsson A.

Chosen by: Eleanor Law

De novo structure prediction by fragment assembly is a very difficult task, but can be aided by contact prediction in cases where there is plenty of sequence data. Contact prediction has also significantly improved recently, using statistical methods to separate direct from indirect contact information.

PSICOV and plmDCA are two such methods, providing contacts which can be used by software such as Rosetta as an additional energy term. PconsFold combines 16 different sets of contact predictions by these programs, built from different sequence alignments, with secondary structure and solvent accessibility prediction. The output of the deep learning process on these inputs is more reliable that the individual contact predictions alone, and produces more accurate models. The authors found that using only 2,000 decoys rather than 20,000 did not greatly harm their results, which is encouraging as the decoy generation stage is the particularly resource intensive stage. Using a balance between the Rosetta energy function and weight of contact prediction, the optimal number of constraints to include was around 2 per residue, compared to 0.5 per residue for PSICOV or plmDCA alone.

The PconsFold pipeline is not always able to make full use of the contact prediction, as accuracy of contact prediction on the true structure can be higher than that in the model. This is a case where the conformational search is not effective enough to reach the correct answer, though it would be scored correctly if it were obtained. All-beta proteins are the most difficult to predict, but PconsFold compares favourably to EVfold-PLM for each of the mainly alpha, mainly beta, and alpha & beta classes.

# A program to aid primary protein structure determination -1962 style.

This year, OPIG have been doing series of weekly lectures on papers we considered to be seminal in the field of protein informatics. I initially started looking at “Comprotein: A computer program to aid primary protein structure determination” as it was one of the earliest (1960s) papers discussing a computational method of discovering the primary structure of proteins. Many bioinformaticians use these well-formed, tidy, sterile arrays of amino acids as the input to their work, for example:

MGLSDGEWQL VLNVWGKVEA DIPGHGQEVL IRLFKGHPET LEKFDKFKHL KSEDEMKASE DLKKHGATVL TALGGILKKK GHHEAEIKPL AQSHATKHKI PVKYLEFISE CIIQVLQSKH PGDFGADAQG AMNKALELFR KDMASNYKEL GFQG
(For those of you playing at home, that’s myoglobin.)

As the OPIG crew come from a diverse background and frequently ask questions well beyond my area of expertise, if for nothing other than posterior-covering, I needed to do some background reading. Though I’m not a researcher by trade any more, I began to realise despite the lectures/classes/papers/seminars I’d been exposed to, regarding all the clever things you do with a sequence when you have it, I didn’t know how you would actually go from a bunch of cells expressing (amongst a myriad of other molecules) the protein you were interested in, to the neat array of characters shown above. So without further ado:

The first stage in obtaining your protein is: cell lysis and there’s not much in it for the cell.
Mangle your cell using chemicals, enzymes, sonification or a French press (not your coffee one).

The second stage is producing a crude extract by centrifuging the above cell-mangle. This, terrifyingly, appears to be done between 10,000G and 100,000G and removes the cellular debris leaving it as a pellet in the bottom of the container, with the supernatant containing little but a mix of the proteins which were present in the cytoplasm along with some additional macromolecules.

Stage three is to purify the crude extract. Depending on the properties of the protein you’re interested in, one or more of the following stages are required:

• Reverse-phase chromatography to separate based on hydrophobicity
• Ion-exchange to separate based on the charge of the proteins
• Gel-filtration to separate based on the size of the proteins

If all of the above are preformed, whilst the sequence of these variously charged/size-sorted/polar proteins will still be still unknown, they will now be sorted into various substrates based upon their properties. This is where the the third stage departs from science and lands squarely in the realm of art. The detergents/protocols/chemicals/enzymes/temperatures/pressures of the above techniques all differ depending on the hydrophobicity/charge/animal source of the type of protein one is aiming to extract.

Since at this point we still don’t know their sequence, working out the concentrations of the various constituent amino acids will be useful. One of the simplest methods of determining the amino acid concentrations of a protein is follow a procedure similar to:

Heat the sample in 6M HCL at at a temperature of 110C for 18-24h (or more) to fully hydrolyse all the peptide bonds. This may require an extended period (over 72h) to hydrolyse peptide bonds which are known to be more stable, such as those involving valine, isoleucine and leucine. This however can degrade Ser/Thr/Tyr/Try/Gln and Cys which will subsequently skew results. An alternative is to raise the pressure in the vessel to allow temperatures of 145-155C to for 20-240 minutes.

TL;DR: Take the glassware that’s been lying about your lab since before you were born, put 6M hydrochloric acid in it and bring to the boil. Take one difficultly refined and still totally unknown protein and put it in your boiling hydrochloric acid. Seal the above glassware in order to use it as a pressure vessel. Retreat swiftly whilst the apparatus builds up the appropriate pressure and cleaves the protein as required. -What could go wrong?

At this point I wondered if the almost exponential growth in PDB entries was due to humanity’s herd of biochemists now having been thinned to those which remained simply being several generations worth of lucky.

Once you have an idea of how many of each type of amino acid comprise your protein, we can potentially rebuild it. However at this point it’s like we’ve got a jigsaw puzzle and though we’ve got all the pieces and each piece can only be one of a limited selection of colours (thus making it a combinatorial problem) we’ve no idea what the pattern on the box should be. To further complicate matters, since this isn’t being done on but a single copy of the protein at a time, it’s like someone has put multiple copies of the same jigsaw into the box.

Once we have all the pieces, to determine the actual sequence, a second technique needs to be used. Though invented in 1950, Edman degradation appears not to have been a particularly widespread protocol, or at least it wasn’t in the National Biomedical Research Foundation from which the above paper emerged. This means of degradation tags the N-terminal amino acid and cleaves it from the rest of the protein. This can then be identified and the protocol repeated. Whilst this would otherwise be ideal, it suffers from a few issues in that it takes about an hour per cycle, only works reliably on sequences of about 30 amino acids and doesn’t work at all for proteins which have their n-terminus bonded or buried.

Instead, the refined protein is cleaved into a number of fragments at known points using a single enzyme. For example, Trypsin will cleave on the carboxyl side of arginine and lysine residues. A second copy of the protein is then cleaved using a different enzyme at a different point. These individual fragments are then sorted as above and their individual (non-sequential) components determined.

For example, if we have a protein which has an initial sequence ABCDE
Which then gets cleaved by two different enzymes to give:
Enzyme 1 : (A, B, C) and (D, E)
Enzyme 2 : (A, B) and (C, D)

We can see that the (C, D) fragment produced by Enzyme 2 overlaps with the (A, B, C) and (D, E) fragments produced by Enzyme 1. However, as we don’t know the order in which the amino acid appear within in each fragment, thus there are a number of different sequences which can come to light:
 Possibility 1 : A B C D E Possibility 2 : B A C D E Possibility 3 : E D C A B Possibility 4 : E D C B A 

At this point the paper comments that such a result highlights to the biochemist that the molecule requires further work for refinement. Sadly the above example whilst relatively simple doesn’t include the whole host of other issues which plague the biochemist in their search for an exact sequence.

# The origins of exponential random graph models

The article An Exponential Family of Probability Distributions for Directed Graphs, published by Holland and Leinhardt (1981), set the foundation for the now known exponential random graph models (ERGM) or p* models, which model jointly the whole adjacency matrix (or graph) $X$. In this article they proposed an exponential family of probability distributions to model $P(X=x)$, where $x$ is a possible realisation of the random matrix $X$.

The article is mainly focused on directed graphs (although the theory can be extended to undirected graphs). Two main effects or patterns are considered in the article: Reciprocity, which relates to appearance of symmetric interactions ($X_{ij}=1 \iff X_{ji}=1$) (see nodes 3-5 of the Figure below).

and, the Differential attractiveness of each node in the graph, which relates to the amount of interactions each node “receives” (in-degree) and the amount of interactions that each node “produces” (out-degree) (the Figure below illustrates the differential attractiveness of two groups of nodes).

The model of Holland and Leinhardt (1981), called p1 model, that considers jointly the reciprocity of the graph and the differential attractiveness of each node is:

$p_1(x)=P(X=x) \propto e^{\rho m + \theta x_{**} + \sum_i \alpha_i x_{i*} + \sum_j \beta_j x_{*j}},$

where $\rho,\theta,\alpha_i,\beta_j$ are parameters, and $\alpha_*=\beta_*=0$ (identifying constrains).  $\rho$ can be interpreted as the mean tendency of reciprocation$\theta$ can be interpreted as the density (size) of the network, $\alpha_i$ can be interpreted as as the productivity (out-degree) of a node, $\beta_j$ can be interpreted as the attractiveness (in-degree) of a node.

The values $m, x_{**}, x_{i*}$ and $x_{*j}$ are: the number of reciprocated edges in the observed graph, the number of edges, the out-degree of node i and the in-degree of node j; respectively.

Taking $D_{ij}=(X_{ij},X_{ji})$, the model assumes that all $D_{ij}$ with $i are independent.

To better understand the model, let’s review its derivation:

Consider the pairs $D_{ij}=(X_{i,j},X_{j,i}),\,i and describe the joint distribution of $\{D_{ij}\}_{ij}$, assuming all $D_{ij}$ are statistically independent. This can be done by parameterizing the probabilities

$P(D_{ij}=(1,1))=m_{ij} \text{ if } i

$P(D_{ij}=(1,0))=a_{ij} \text{ if } i\neq j,$

$P(D_{ij}=(0,0))=n_{ij} \text{ if } i

where $m_{ij}+a_{ij}+a_{ji}+n_{ij}=1,\, \forall \, i.

$P(X=x)=\prod_{i

where $\rho_{ij}=log(m_{ij}n_{ij} / a_{ij}a_{ji})$ for $i, and $\theta_{ij}=log(a_{ij}/n_{ij})$ for $i\neq j$.

It can be seen that the parameters $\rho_{ij}$ and $\theta_{ij}$ can be interpreted as the reciprocity and differential attractiveness, respectively. With a bit of algebra we get:

$exp(\rho_{ij})=[ P(X_{ij}=1|X_{ji}=1)/P(X_{ij}=1|X_{ji}=0) ]/[ P(X_{ij}=1|X_{ji}=0) / P(X_{ij}=0|X_{ji}=0) ],$
and
$exp(\theta_{ij})=P(X_{ij}=1|X_{ji}=0)/P(X_{ij}=0|X_{ji}=0).$

Now, if we consider the following restrictions:

$\rho_{ij}=\rho,\, \forall i, and $\theta_{ij}=\theta+\alpha_i + \beta_j,\, \forall i\neq j$ where $\alpha_*=\beta_*=0$.

With some algebra we get the proposed form of the model

$p_1(x)=P(X=x) \propto e^{\rho m + \theta x_{**} + \sum_i \alpha_i x_{i*} + \sum_j \beta_j x_{*j}}.$

# Modeling of Protein Interaction Networks

In the group meeting on the 20th of August, I presented the paper by Vazquez et al (2002). This was one of the first papers proposing the duplication divergence model of evolution for protein interaction networks, and thus has had a significant impact on the field, inspiring many variants of the basic model. The paper starts out by noting that the yeast protein network has the ‘small world’ property – following along links in the network, it requires only a handful of steps to go from any one protein to any other. Another property is the manner in which the links are shared out among the various proteins: empirically, the probability that a protein interacts with k other proteins follows a power-law distribution.

Vazquez et al. show how evolution can produce scale-free networks. They explore a model for the evolution of protein networks that accurately reproduces the topological features seen in the yeast S. cerevisae. As the authors point out, proteins fall into families according to similarities in their amino-acid sequences and functions, and it is natural to suppose that such proteins have all evolved from a common ancestor. A favoured hypothesis views such evolution as taking place through a sequence of gene duplications – a relatively frequent occurrence during cell reproduction. Following each duplication, the two resulting genes are identical for the moment, and naturally lead to the production of identical proteins. But between duplications, random genetic mutations will lead to a slow divergence of the genes and their associated proteins. This repetitive, two-stage process can be captured in a relatively simple model for the growth of a protein interaction network .

Simulations by the author show that the model, starting out with a seed capture the degree distribution of the yeast network with high fidelity (Fig 1) and also possesses the quality of high tolerance to random node removal seen in biological networks. While the results are more qualitative in nature, the model still serves as the basis of most biologically rooted explanations of protein network evolution, with minor improvements. Some of these additions have been the use of asymmetric divergence, whole genome duplication events as well as interaction site modelling. As the jury is still out on what model (if any) best fits current interaction data, the basic model is still relevant as a benchmark for newer models.

Fig. 1. Zipf plot for the PIN and the DD model with p = 0.1, q = 0.7 with N = 1,825. k is the connectivity of a node and r is its rank in decreasing order of k. Error bars represent standard deviation on a single network realization.

# PLOS Comp Bio’s front page!

The work of our molecular dynamic expert, Bernhard Knapp, has made the front page of the PLOS Computational Biology . The image shows how CDR3 loops scan a peptide/MHC complex (Figure below: Snapshot from the website). Immune cells in the human body screen other cells for possible infections. The binding of T-cell receptors (TCR) and parts of pathogens bound by major histocompatibility complexes (MHC) is one of the activation mechanisms of the immune system. There have been many hypotheses as to when such binding will activate the immune system. In this study we performed the, to our knowledge, largest set of Molecular Dynamics simulations of TCR-MHC complexes. We performed 172 simulations each of 100 ns in length. By performing a large number of simulations we obtain insight about which structural features are frequently present in immune system activating and non-activating TCR-MHC complexes. We show that many previously suggested structural features are unlikely to be causal for the activation of the human immune system.

How CDR3 loops scan a peptide/MHC complex. Rendering is based on the 100 ns simulation of the wild-type TCRpMHC complex. Blue: peptide; Solid white cartoon and transparent surface: first frame of the MHC helices and β-sheet floor; Orange: equally distributed frames of the CDR3s over simulation time. Transparent white cartoon: first frame of all other TCR regions.

# HHSearch

Introduction and Methods:

Today I prsented the paper which introduces HHsearch. HHSearch is mainly used for template detection and still is one of the best known methods. The early methods for template detection simply performed sequence-sequence comparison between a query protein and a database of targets: such as BLAST or FASTA. But then more advanced methods such as PSI-BLAST were introduced which perform sequence-profile comparisons. Profiles are build using multiple sequence alignments (MSA) of protein families which describes the probability of the occurrence of an amino acid in a column of a MSA. In other words, profiles describe the conservation of amino acids among families. A remarkable advance in template detection was introduced by methods performing profile-profile comparisons such as COMPASS, PROF-SIM and LAMA.

Hidden Markov Model (HMM) introduced a new way of building profiles which resulted in methods performing sequence-HMM comparisons to detect templates. A HMM is similar to a state machine and is build using a MSA where each column is given a ‘M’ (match) state. A match state emits amino acids based on probability calculated from the MSA. In addition of a match state, all columns of the MSA will have an ‘I’ (insert) state and ‘D’ (delete/gap) state (See below figure for an example). There is a transition between states (shown by arrows) where all transitions also have probabilities. Having a sequence and a HMM, the sequence can be alignment on the HMM. In other words there is path in the HMM which associates to emitting the sequence and a log-odd score associated with this path. Dynamic programming (Viterbi algorithm) is used to detect this path (similar to needleman and wunsch algorithm for sequence-sequence alignments). More detail can be found here.

Example of an HMM. Taken from Bioinformatics Sequence and Genome Analysis, David W. Mount
(http://compbio.pbworks.com/w/page/16252909/Multiple%20Sequence%20Alignment)

The novel idea of HHsearch was that instead of performing sequence-HMM alignment, HMM-HMM alignments can be used for template detection. Therefore, they first discuss the formula used to calculate the log-odd scores which are required by the Viterbi algorithm to find the best aligned path. The score of aligning two columns in two HMMs (query profile q and template profile t) are calculated as:

Using this score, the best alignment between two profile HMMs is generated by maximizing the log-sum-odds score as in the formula below:

Now that the scoring function is defined, Viterbi algorithm is used to maximize this function. For simplicity HHsearch has described five pair states and the allowed transition between them (see figure 1.C of the paper). Therefore five dynamic program matrices are needed to align the two HMMs.

There are two other main parameters that can contribute to the final score of the HHsearch functions: 1) Correlation score: this score is based on the idea that if two proteins are homologs then once they are aligned high score columns (conserved columns) should cluster together. Which means the higher this score is the more homolog the sequences are. This score can be simply added to the ‘final best alignment score’ from the Viterbi algorithm. 2) While aligning two columns of the two HMMs, Secondary Structure (SS) elemnts are scored using statistical scores generated by the authors, which take into account the confidence values of the SS predictions. HHsearch provides two set of scores for SS comparison: 1- predicted vs predicted 2- predicted vs known. The second one is mainly used when performing 3D structure prediction. These acores are added to the Viterbi algorithm scoring functions in the formula 5,6 and 7 of the paper.

Results:

Dataset:

HHsearch performance was compared to BLAST(sequence-sequence), PSI-BLAST(profile-sequence) , HMMER(HMM-sequence) and COMPASS and PROF-SIM(profile-profile) methods. five version of HHsearch was benchmarked:

HHserach 0 -> basic profile-profile comparison with fix gap penalties
HHserach 1 -> basic HMM-HMM comparison
HHsearch 2 -> HMM-HMM comparison including correlation score
HHsearch 3 -> HMM-HMM comparison including correlation score + predicted vs predicted SS
HHsearch 4 -> HMM-HMM comparison including correlation score + predicted vs known SS

The dataset used in the comparison study consists of 3691 sequences taken from the SCOP database filtered at 20% sequence identity (SCOP-20). For each sequence an alignment is prepared using PSI-BLAST. These alignments are used to compare methods.

Detection of homolog:

The ability of methods in detecting homologs are compared against each other. Homolog and non-homolog definitions are: If two domains of SCOP-20 are in the same SCOP superfamily then they are considered as homologs. If the two domains are from different classes they are considered as non-homologs. All other pairs are classified as unknown.
The performance are compared by drawing TP (number ofhomolog pairs) vs FP (number of non-homolog pairs) curves, for all-against-all comparison of data in SCOP-20. The highest curves represent the best performing method. This curve is shown in Figure 2 of the paper. In this dataset the total number of TP are 41505 and the total number of FP are 1.08 x 10^7. The worse method is BLAST which at a error rate of 10% will only find 22% of the homologs. carrying on from BLAST the detection ability grows as:

BLAST < PSI_BLAST < HMMER < PROF-SIM < COMPASS < HHsearch0 < HHsearch1 < HHsearch2 < HHsearch3 < HHsearch4

Studying the results from HHsearch in detail the authors realised that in some cases HHSearch (with high confidence) groups two domains from different superfamily or even different folds as homologs. Looking at the structures of these proteins the authors noticed that the structure of these proteins are either locally or globally similar. Therefore, proteins defined by SCOP to be in different superfamilies or fold might actually be homologs. Therefore, they repeated the same test but changing the definition of TP and F:

-homologs (TPs) are domains from the same superfamily OR if their sequence-based alignment resulted in an structure alignment with MaxSub score of 0.1. (MaxSub ranges from 0-1, where 0 is insignificant and 1 very significant.)
-non-homologs (FPs) are those domains from different classes AND zero MaxSub score. Domains not in these two categories are grouped as unknow.

Figure 3 of the paper displays the new results. Although the figures 2 and 3 look similar there are few main points concluded from these diagrams:

1- At the same error rate all methods except BLAST detect more homologs compared to Figure 2 of the paper.
2- With the new definitions, sensitive tools improve more than others in homolog detection such as HHsearch 3 and 4. Since the new set of homologs are harder to detect and therefore only sensitive tools can detect them
3- COMPASS and PRO-SIM perform better than HHsearch 0 which means they are better at remote homolog detection.

To check this hypothesis, they draw TP vs FP (Figure 4 of paper) only for TPs from different families. Not only they confirm the hypothesis, they notice that using SS (HHsearch 3 and 4) they detect more TPs. Interesting enough as the pairs under study evolutionary diverge, the power of SS in detection becomes bolder.

Alignment quality:

The quality of a homology model really depends on how well the query protein is aligned with the homolog. Therefore, in this paper all of the methods are compared on their ability on building accurate alignments. To do so, using an aligned pair of sequences, the 3D structures of two proteins
are superposed and the spatial distances are evaluated (similar to MaxSub method). The authors also introduced the Sbalance score which is similar to MaxSub but considers over and under predictions that MaxSub fails to consider.

Comparing alignment qualities using MaxSub score (Fig 5 of paper) we can roughly conclude the performance as:

BLAST < PSI_BLAST < PROF-SIM < COMPASS < HHsearch0 < HMMER < HHsearch1 < HHsearch2 < HHsearch3 < HHsearch4

again more sensitive methods are able to build better alignments for distant homologs.

Comparing alignment qualities using Sbalance score (Fig 6 of paper) we can roughly conclude the performance as:

BLAST < PSI_BLAST < HMMER < PROF-SIM < COMPASS < HHsearch0 < HHsearch1 < HHsearch2 < HHsearch4 < HHsearch3

HMMER now is inferior to profile-profile alignment methods. In addition HHsearch 3 is the winner.

In general HMM-HMM methods are superior to all other methods for homolog detection and building alignments. HHsearch 4 has shown to be able to detect related structures (more than other methods) in the the same superfamily and folds.

Using the same idea more accurate tools have been developed since this paper such as HHblits from the same group. Also, recently a method has introduced Markov Random Fields to detect homologs, with better performance than HHsearch and HHblits.

# DSSP

My talk this week focused on secondary structure (SS) assignment. What do I mean by this? It is assigning SS types (principally α-helices and β-sheets) to protein structures. It can be found hiding in many of the things we do – e.g. alignment and modelling. There are many available methods to do this, of which DSSP (despite being published in 1983) is the most popular.

## DSSP

### How does it work?

The algorithm identifies hydrogen bonds between mainchain carbonyl and amide groups. Partial charges are applied to the amide and carbonyl bonds, and the the C, O, N, and H atoms are assumed to be point charges (hence C has charge +ρ1, O -ρ1, N -ρ2, and H +ρ2. The the electrostatic energy between these 4 atoms is calculated, and if it is < -0.5 kcal/mol, a hydrogen bond exists. This is a relatively relaxed threshold, as a normal hydrogen bond in an \$alpha;-helix is in the region of -3 kcal/mol, so it means that a given residue could have i+3, i+4, and i+5 hydrogen bonds.

Helices and sheets are then identified where there are characteristic hydrogen bond patterns. For example, two consecutive i to i+4 backbone hydrogen bonds indicates an α-helix turn. The algorithm identifies each turn, and each β-bridge, and where several of these overlap, they are combined into single elements.

DSSP has 8 different SS assignments:

• G – 310 helix
• H – α-helix
• I – π-helix
• E – β-sheet
• B – β-bridge
• T – helix turn
• S – bend (high curvature)
• C – coil (none of the above)

These are assigned in an order of preference – HBEGITSC.
Many (but by no means all) SS assignment programs still use this notation.

DSSP is one of the more simple SS assignment programs. Its hydrogen bond energy calculation is distinctly simplistic. It does not (fully) take the angles of the hydrogen bond into account, and provides only a binary classification for each hydrogen bond. However, perhaps surprisingly, DSSP is still the most used method. Why? Probably something to do with them giving it away for free, which resulted in many software suits incorporating it (e.g. JOY, PROMOTIF). As a general rule, if something does not say what it uses for SS assignment, it probably uses DSSP.

## Other Methods

Given the simplicity of DSSP, it is not surprising that there are a large number of other available methods. Indeed, you may notice that different programs will give different assignments (e.g. comparing Pymol to VMD or the PDB annotation).

There have been a vast number of other secondary structure (SS) annotation methods published, including: STRIDE, DEFINE, PROMOTIF, KAKSI, SST, PSSC, P-SEA, SECSTR, XLLSSTR, PALSSE, and STICK. The other two you are likely to come across are STRIDE, and the PDB annotation.

## SS assignment in general

All of the SS assignment methods rely on a combination of three features of SS. These are:

1. Mainchain hydrogen bonds
2. Φ and Ψ angles
3. Inter Cα distances

For all three of these, there are values characteristic of both helices and β-sheets. Every method takes some combination of these three features for each residue, and if they are within the chosen limits, classifies the residue accordingly. The limits are generally chosen to give good agreement with the PDB annotation.
(It should be noted that the hydrogen-bond containing methods use the position of the hydrogen atom, which is rarely present in the crystal structure, and thus must be inferred.)

STRIDE can be described as an updated version of DSSP. It has two main differences – it uses a much fuller description of hydrogen bond geometry, and combines this with a knowledge based φ/ψ angle potential to assign the residue state. It has many more parameters that DSSP, and these are trained based on the PDB annotation. So where does that come from?

This PDB annotation comes from the depositors own annotation. The current guidance (from here) is to use the generated annotation, from PROMOTIF. PROMOTIF uses DSSP, with a slight change – it annotates an extra residue at the end of each structure element. I am in no position to say how well this guidance is adhered to by the depositors, or what their historical behaviour was, but the vast majority of annotations are reasonable.

I guess you are now wondering how different these methods are. Generally they agree in the obvious cases, and disagreement is normally at the ends of SS elements. Other examples (particularly pertinent to my research) occur when one method identifies a single long element, while another method identifies two elements seperated by a coil section. Ultimately there is no ‘right’ answer, so saying one method is right and another wrong is impossible.

To sum up, DSSP is the de facto standard. Ignoring my previous comment, it is probably not the best algorithm, as it is a gross simplification. STRIDE improves on the algorithm (although using more parameters), whilst for specific tasks, one method may be better than all of the others. It is hard to say if one is the best, and if it is important to you, then you should think about which method to use. If you do not think it is, then you should reconsider, and if it really is not important, then just use DSSP like everyone else. This is perhaps an example where willing, free, provision your code to the community results in your method (DSSP) becoming the de facto standard.

# Loopy LaTeX

Something that you may or may not know, is that you can write loops in LaTeX. It is, in fact, a Turing-complete language, so you can write if statements too (which is helpful for making multi-line comments), and do all the other things you can do with a programming language.

So, I guess you are thinking, ‘eh, that’s cool, but why would I do it?’ Indeed, I have known about this for ages, but never found any need for it, until recently. My problem was that I had generated 80 small images, that I wanted to display on a number of pages. I could have played with print settings, and made multiple pages per sheet, but since I wanted two columns, and to fit 16 to a page (the 80 images were in 5 sets of 16, each with two subsets), that was going to be a pain. I also wanted to add some labels to each set, but not have said label on every image. However, I thought that LaTeX could solve my problem.

As ever, there are a number of different latex packages that you can use, but I used the pgffor package. In the example below, my pictures were named [A-E]_[ab][1-8]_picture, e.g. A_b2_picture.png, or D_a3_picture.png. The code produces pages of 16 pictures, with the ‘a’ pictures on the left, and the ‘b’ pictures on the right.
There is a more simple example at the bottom.

Happy LaTeXing..

# OPIG goes Punting!

Last Wednesday, it was the oh-so-very traditional OPIG Punting day (OPunting for those more acronym-prone).

To my surprise,  the weather was spectacular! It was warm and sunny, a true perfect day for punting. We set off from our alcoves offices with determination in our hearts and, more importantly, with delicious snacks and a significant amount of Pimms and G&T.   Everything was set for a truly amazing day. Our group took over 5 punts from the Cherwell Boathouse, constituting what I like to think of as a small fleet of avid punters and merriment-seekers. We punted all the way up the Cherwell, past the lovely Victoria’s Arms into lands unknown (or into some grassy meadows in the vicinities of Oxford). Fortunately no keys were thrown overboard and no one fell off the punts (well, at least not accidentally). Yet, as usual, OPunting was very eventful! Following the customs of our group, we had the traditional punting race. I may have been too busy gorging on Pimms during the race, but if memory does not fail me, the race was won by Hannah (who can be seen in the photo bellow doing her swimming victory lap).

Hannah, in her victory lap swim…

During the punting, we also discovered that Bernhard had some previously unknown Viking ancestry (Austrian vikings?), which manifested in an impetus to ram his punt against others. Suffice to say that he paved the grounds to “Dodgems Punts”, a ride that will become popular in fun fairs and amusement parks in 2027.

Other than that, the weather was so great that many of us decided to go for a lovely swim at the Cherwell.

After a refreshing pint at Victoria’s, we made our way back to conclude yet another successful OPunting day!