# Finding Communities in Networks: Link Clustering and the Affiliation Graph Model

The research I do revolves around how to break down protein interaction networks (PINs) into functional modules, or communities, using community detection methods. These communities are groups of proteins which interact more with one another than with the rest of the network. We decompose the PINs in this manner, as it is very difficult to determine how a protein functions through its interactions by simply looking at the entire network. The sheer amount of data in one of these networks, clouds the information we would actually like to see, as explained in my previous blog post.

In a journal club in August, I presented the overlapping community detection method Link Clustering by Ahn et al. which is the first community detection method I am aware of that assigns nodes to communities by clustering the edges between the nodes. I contrasted Link Clustering with the Affiliation Graph Model method proposed by Yang and Leskovec, as the authors use the same comparison technique to validate their methods. In a world where studies usually only work on a single network using a specific method, having two independent papers that use the same validation method is something that gets me much more excited than it probably should.

Instead of assigning nodes to communities, Link Clustering focusses on the edges in a network. Clustering edges is essentially an equivalent problem to clustering nodes. This becomes obvious if you take a graph, decide to call all the edges “blobs” and say two of your new “blobs” have an edge between them if they share a node in the original network. This is called creating the “line graph” of the network and performing “blob” clustering on this is the same as performing edge clustering on the initial network.

To come back from my tangent, if an edge is assigned to community A, this means the nodes on either side of this edge are assigned community A by association. By assigning each edge to an individual community, Link Clustering creates overlapping node communities as shown in Figure 1 below.

Figure 1: Example of a simple, link-clustered network. Nodes can be seen to be in multiple communities as edges are assigned to communities (c.f. node 4 in red, green and yellow communities). [Ahn et al., 2010]

In Link Clustering, edges are placed into the same community, based on similarity scores computed between all connected edges. Edge clusters are computed from these similarity scores using single-linkage clustering, where the edges with the highest similarity scores are iteratively merged into the same community. Using this method all edges end up in the same community in the end, so a threshold for the similarity score needs to be found at which the network is partitioned into communities in the best way. This threshold value represents a type of resolution parameter of the for community detection. At a low threshold there are few, large communities as many edges have been clustered together, while at a high threshold only highly similar edges are in the same community and there are many, small communities. Ahn et al. propose maximizing a function called the partition density to find the optimal resolution threshold. This function is simply a weighted average of the density of the communities. For those incredibly keen, it is given by the equation:

$latex D = \frac{2}{M} \sum_c m_c \frac{m_c – (n_c -1)}{(n_c – 2)(n_c -1)}$

Here, $latex M$ represents the total number of edges in the network and $latex m_c$ and $latex n_c$ represent the number of edges and nodes in the community $latex c$ respectively.

All of this partitioning is based on a similarity score between two edges. The crux of this method, the scoring measure, is explained in Figure 2.

Figure 2: Schematic Diagram showing how the similarity between edges $latex e_{ik}$ and and $latex e_{jk}$ is calculated in Link Clustering. The overlap of the inclusive neighbour sets of nodes i and j is divided by the union of these sets. The inclusive neighbour set of node i is computed by taking the neighbours of node i and including node i itself. The nodes in grey are overlapping between both sets, therefore the similarity between edges $latex e_{ik}$ and $latex e_{jk}$ is 4/12 or 1/3. [Ahn et al. 2010]

## Affiliation Graph Model

The method which I would like to compare Link Clustering to is the Affiliation Graph Model (AGM). Briefly, this is a statistical community detection algorithm which generates overlapping communities. Node affiliations are recorded in the Community Affiliation Matrix shown in Figure 2.

Schematic of the Community Affiliation Matrix, which shows nodes (coloured squares) being affiliated with communities (circles A, B, and C) as shown by the arrows. Nodes can have multiple community affiliations as indicated by red and purple node squares. The probability of interaction between two nodes which are both in only community A, is $latex p_A$. [Yang and Leskovec, 2012]

The AGM has a probability of interaction for each community, denoted $latex p_A$ for community $latex A$. For nodes $latex u$ and $latex v$, which may share several community affiliations, the probability of interaction, $latex p(u,v)$ is given by the equation:

$latex p(u,v) = 1 – \prod_{k \in C_{uv}} (1 – p_k),$

where $latex k$ denotes a community in the set of communities shared by nodes $latex u$ and $latex v$, $latex C_{uv}$. This model is be fitted to the network using a Metropolis-Hastings algorithm with a Markov chain constructed by pre-defined defined steps in the space of possible Community Affiliation Matrices (c.f. Yang and Leskovec 2012).

## Result Comparison Method

Link Clustering itself has some very interesting results, especially relating to looking at real-world networks at different resolutions (which can be found in the paper). However, I want to focus on the validation method used in the paper which compares results generated from Link Clustering with that of other methods. This comparison method evaluates four different aspects of the partitions generated by different methods. These four aspects are: Community Quality, Overlap Quality, Community Coverage, and Overlap Coverage. The “Quality” aspects relate to network metadata indicating how good the obtained communities are, and the “Coverage” aspects relate to the amount of information extracted from a network.

The metadata referred to above is a general term for additional data available about the nodes in a network. In the case of PINs this metadata is related to the function of proteins in the network (their Gene Ontology annotations). Loosely, proteins with similar functions should be placed in the same community to improve the Community Quality measure. Overlap Quality concerns the boundaries between generated communities. If the functions assigned to proteins show similar boundaries between groups of functions, the Overlap Quality is high.

The “Coverage” values are more basic calculations. A partition has a high Community Coverage if the fraction of nodes assigned to communities with three or more nodes is high (non-trivial communities). A community of size two is uninformative, as it is the result of a single edge being in a community by itself. Overlap Coverage is simply the number of non-trivial community memberships per node. The two “Coverage” values are thus equal for non-overlapping community detection methods.

When comparing community detection methods, these four measures are computed for a partition generated by each method, and then rescaled so that the maximum score achieved by any method is 1 for each measure independently. These values are then added to give a score between 0 and 4 for each community detection method.

## Results

The results shown in Figure 4 were generated by Yang and Leskovec to show their AGM outperforms Link Clustering (and other methods) on a variety of networks. While it is noteable that they used the same metric and networks used by Ahn et al. when proposing Link Clustering, this Figure must be taken with a pinch of salt. Yang and Leskovec acknowledge that the metric proposed by Ahn et al. rewards methods which find more communities in a network and thus do not fit the number of communities judged to be best by the AGM, but instead fit the same number as fitted in Link Clustering. Furthermore, it is peculiar that half of the networks used for comparison by Ahn et al. have disappeared in this comparison.

Figure 4: Community detection method comparison for Link Clustering (L), Clique Percolation (C), Mixed-Membership Stochastic Block Model (M) and the Affiliation Graph Model (A) on different PINs (PPI) and other social networks. Methods are compared by the metric proposed by Ahn et al. [Yang & Leskovec 2012]

To conclude we can say that while it is very interesting that two papers use the same method to compare their methods to others for validation purposes, it wasn’t done completely accurately here. The comparison metric is however an interesting development to create a gold standard for method comparison. For my purposes, Community Quality is the most important, so maybe a weighted version of this may be more interesting.

# ECCB 2014 (Strasbourg)

The European Conference on Computational Biology was held in Strasbourg, France this year. The conference was attended by several members of OPIG including Charlotte Deane, Waqar Ali, Eleanor Law, Jaroslaw Nowak and Cristian Regep. Waqar gave a talk on his paper proposing a new distance measure for network comparison (Netdis). There were many interesting talks and posters at the conference, and brief summaries of the ones we found most relevant are given below.

The impact of incomplete knowledge on the evaluation of protein function prediction: a structured output learning perspective.

Authors: Yuxiang Jiang, Wyatt T. Clark, Iddo Friedberg and Predrag Radivojac

Chosen by: Waqar Ali

The automated functional annotation of biological macromolecules is a problem of computational assignment of biological concepts or ontological terms to genes and gene products. A number of methods have been developed to computationally annotate genes using standardized nomenclature such as Gene Ontology (GO). One important concern is that experimental annotations of proteins are incomplete. This raises questions as to whether and to what degree currently available data can be reliably used to train computational models and estimate their performance accuracy.

In the paper, the authors studied the effect of incomplete experimental annotations on the reliability of performance evaluation in protein function prediction. Using the structured-output learning framework, they provide theoretical analyses and carry out simulations to characterize the effect of growing experimental annotations on the correctness and stability of performance estimates corresponding to different types of methods. They also analyzed real biological data by simulating the prediction, evaluation and subsequent re-evaluation (after additional experimental annotations become available) of GO term predictions. They find a complex interplay between the prediction algorithm, performance metric and underlying ontology. However, using the available experimental data and under realistic assumptions, their results also suggest that current large-scale evaluations are meaningful and surprisingly reliable. The choice of function prediction methods evaluated by the authors is not exhaustive however and it is quite possible that other methods might be much more sensitive to incomplete annotations.

Towards practical, high-capacity, low-maintenance information storage in synthesized DNA.

Authors: Nick Goldman, Paul Bertone, Siyuan Chen, Christophe Dessimoz, Emily M. LeProust, Botond Sipos & Ewan Birney

Chosen by: Jaroslaw Nowak

This was one of the keynote speaker talks. One of the authors, Ewan Birney discussed how viable is storing digital information in DNA code. The paper talked about storing 739 kilobytes of hard-disk storage in the genetic code. The data stored included all 154 of Shakespeare’s sonnets in ASCII text, a scientific paper in PDF format, a medium-resolution colour photograph of the European Bioinformatics Institute in JPEG format, a 26-s excerpt from Martin Luther King’s 1963 ‘I have a dream’ speech in MP3 format and a Huffman code used in their study to convert bytes to base-3 digits (ASCII text). The authors accomplished this by first converting the binary data into base-3 numbers using Huffman coding (which represents the most common piece of information using the least bits). The base-3 numbers where then converted into a genetic code in such a way that produced no homopolymers. The authors also proved that they can read the encoded information and reconstruct the data with 100% accuracy.

Using DNA for information storage could very soon become cost-effective for sub-50 years archiving. The current costs of the process are $12,400/MB for writing and$220/MB for reading information, with negligible costs of copying. Nevertheless, if the current trends persist, we could see a 100 – fold drop in costs in less than a decade. The main advantages of storing information in DNA is the low maintenance and durability of the medium (intact DNA fragments have been recovered from samples that are tens of thousands years old) as well as little physical space required to store the information (~2.2 PB/g)

PconsFold: Improved contact predictions improve protein models.

Authors: Michel M, Hayat S, Skwark MJ, Sander C, Marks DS and Elofsson A.

Chosen by: Eleanor Law

De novo structure prediction by fragment assembly is a very difficult task, but can be aided by contact prediction in cases where there is plenty of sequence data. Contact prediction has also significantly improved recently, using statistical methods to separate direct from indirect contact information.

PSICOV and plmDCA are two such methods, providing contacts which can be used by software such as Rosetta as an additional energy term. PconsFold combines 16 different sets of contact predictions by these programs, built from different sequence alignments, with secondary structure and solvent accessibility prediction. The output of the deep learning process on these inputs is more reliable that the individual contact predictions alone, and produces more accurate models. The authors found that using only 2,000 decoys rather than 20,000 did not greatly harm their results, which is encouraging as the decoy generation stage is the particularly resource intensive stage. Using a balance between the Rosetta energy function and weight of contact prediction, the optimal number of constraints to include was around 2 per residue, compared to 0.5 per residue for PSICOV or plmDCA alone.

The PconsFold pipeline is not always able to make full use of the contact prediction, as accuracy of contact prediction on the true structure can be higher than that in the model. This is a case where the conformational search is not effective enough to reach the correct answer, though it would be scored correctly if it were obtained. All-beta proteins are the most difficult to predict, but PconsFold compares favourably to EVfold-PLM for each of the mainly alpha, mainly beta, and alpha & beta classes.