Author Archives: Malte Lucken

Counting Threads

When someone talks about “counting threads” the first thing that you think of is probably shopping for bed sheets. But this post is not about the happy feeling of drifting off to sleep on smooth, comfortable Egyptian cotton. This post is about that much less happy feeling when you want to quickly run a bit of code on a couple of data sets to finish the results section of a thesis chapter, and you see this:

Luis blocking the server again….

Obviously someone (*cough*Luis*cough*) is having some fun on the server without nice-ing their code to allow people who are much less organized than they should be (*cough*me*cough*) to do a quick last-minute data analysis run.

The solution: confront the culprit with their excessive server usage (Note: alternatives include manual server restart with a power cord to make the world your enemy – not recommended).

So… now we just need to find out how much of the server “ospina” is using. Screenshots won’t convince him… and we can’t take enough screenshots to show the extent of the server-hogging with his 1000s of processes anyway. We need to count…

Luckily there is a handy function to find out information about processes called pgrep. This is basically a ‘ps | grep’ function which has a bunch of options to reflect the many ways it can be used. We see opsina is running R, so here goes:

pgrep -c R

The -c flag counts processes and the pattern matches the command name that was run. But yeah, it turns out this wasn’t the best idea ever. A lot of people are running R (as might be expected in the Statistics Department), and you get a number that is really too high to be likely. We need to be more specific in our query, so let’s go back to the ps command. Second attempt:

ps -Af | grep ospina | wc -l

What we’re doing now is first showing all processes that are run on the server (ps -A) also showing details of the command run and who ran it (-f flag). Then we’re finding the ones that are labelled with our server culprit (grep ospina) and counting the lines we find. There are annoyingly still a few problems with this approach.

We just ran this command on the server and thus will count a command like grep –color=auto ospina,
User “ospina” is probably running a few more things than just his R command (like ssh-ing into the server and maybe a couple of screens)
We get a number than looks far lower than what we expected just by visual inspection.

So… what happened? We can fix problems 1 and 2 by just piping to a further grep command. But problem 3 is different. As it turns out, our culprit is running multiple threads from the same process (which is also why you find so many chrome instances on htop for example). We just counted processes, when really the server is being occupied by his multi-threading exploits. So… if you want to back up your complaint with a nice number, here’s your baby:

ps -ALf | grep opsina | grep R-3.3 | wc -l

The -L flag displays all threads instead of only the processes. I further used R-3.3 as it turns out he is using a specific version of R, which I can use to specify this command. Otherwise it also helps to use inputs arguments to functions to search against. If your fingers get too tired to press the shift-key that often, ps -ALf is equivalent to ps -eLf.

For now: moan away, folks!

Disclaimer: Any scenarios alluded to in the above text are fictitious and do not represent the behaviour of the individuals mentioned. Read: obviously I do not do my analysis last-minute.

Improving the precision of local community detection

Communities are everywhere. You can find groups of connected people on Facebook who went to university together, communities of people on youtube that like the same band, sets of products on amazon that tend to be purchased together, or modules of proteins that act in the same biological process in the human body. With the amount of data we can collect on these networks (facebook friendship, protein interactions, etc), you would think we can easily recover these groups. You would think wrong…

There are a lot of methods out there to find communities in networks. As a paper that compares community detection methods by Lancichinetti and Fortunato put it: “A thorough analysis of all existing methods would require years of work (during which many new methods would appear), and we do not believe it is feasible”. Thus, clearly we have not found the answer.

So what do people generally do to find these communities? Well… very different things. To explain a few of the approaches, it helps to break them down into broad categories such as global and local methods.

Global Methods

Global community detection methods generally aim to optimized an objective function over the entire network. For example, I could decide that I will score the network partition based on a function I will call “greatness”, G. I will define the greatness of a community by the sum of the number of nodes each node is connected with that are in the community (the sum of the node in-degree). Or simply:

Here, A_ij is the adjacency matrix that contains a 1 where nodes i and j interact, and 0 otherwise. The sigmas in the equation correspond to the community identifiers of nodes i and j respectively, and thus the delta function is only 1 when sigma_i and sigma_j are equal, meaning nodes i and j are in the same community, and 0 otherwise.

If you now come to the conclusion that this is a pretty stupid thing to define, then you’ve understood the concept of a global method (alternatively… get rid of that negativity!). A global community detection method would now look how to partition the network so that we can maximize G. In the case of greatness defined here, everything will be put into the same community, as there’s no penalty for putting two nodes together that shouldn’t belong together. Adding a further node to a community can never decrease the greatness score!

If you want a real overview over these and other methods, check out this 100-page review article.

Local Methods

In contrast to global methods that attempt to gather information on the whole network, local community detection methods dive into the network to try to optimize the community partitioning. This means on the one hand they can be very fast at assigning nodes to communities, as they take into account less information. However, on the other hand using less information in this assignment means they can be less precise.

An example of a local method is link clustering, which I described in an earlier post. Link clustering assesses the similarity of two edges that share a node, rather than comparing the nodes themselves. The similarity of these edges is calculated based on how similar the surroundings of the nodes on the endpoints are. Using this score, all edges that are more similar than a cutoff value, S, can be clustered together to define the edge-communities at resolution S. This partitioning translates to nodes by taking the nodes that the edges in the edge-communities connect.

Improving local community detection

Link clustering is a pretty nice and straightforward example of how you can successfully find communities. However, the authors admit it works better on dense than on sparse networks. This is an expected side-effect of taking into account less information to be faster as mentioned above. So you can improve it with a pretty simple idea: how much more information can I take into account without basically looking at the whole network?

Local community detection methods such as link clustering are generally extended by including more information. One approach is to include information other than network topology in the local clustering process (like this). This approach requires information to be present apart from the network. In my work, I like using orthogonal data sets to validate community detection, rather than integrate it into the process and be stuck without enough of an idea if it actually worked. So, instead I’m trying to include more of the local surroundings of the node to increase the sensitivity of the similarity measure between edges. At some point, this will blow up the complexity of the problem and turn it into a global method. So I’m now walking a line between precision and speed. A line that everyone always seems to be walking. On this line, you can’t really win… but maybe it is possible to optimize the trade-off to suit my purposes. And besides… longer algorithm run times give me more time for lunch breaks ;).

Isoform-resolved protein interaction networks and poker

Every time I talk about protein interaction networks, I put up a the nice little figure below. This figure suggests that experiments are done to detect which proteins interact with each other, and lines are drawn between them to create a network. Sounds pretty easy, right? It does, and it probably should, because otherwise you wouldn’t listen past the words “Yeast Two-Hybrid”. However, sometimes it’s important to hear about how it’s not at all easy, how there are compromises made, and how those compromises mean there are some inherent limitations. This post is one of those things it is important to listen to if you care about protein interaction networks (and if you don’t… they’re pretty cool, so don’t be a downer!).

Classical image showing the design of a protein interaction network

Schematic image showing the assembly of a human protein interaction network with ~25 000 interactions

So what’s wrong with that nice figure? Just to deal with the obvious: the colour scheme isn’t great… there’s a red protein and the interactions are also red… and then a yellow protein and red interactions aren’t that easy on the eyes. But you’re not here to judge my choice of colours (I hope… otherwise you’d be better off criticizing the graphs in this beast). You’re here to hear me rant about networks… much more fun ;). So here goes:

Not all interactions come from the same experiments, they investigate different “types” of binding.
Each experiment has experimental errors associated with it, the interactions are not all correct.
The network is not complete (estimation is that there are somewhere between 150k and 600k interactions depending on what paper you read).
People tend to investigate proteins that they know are associated with a disease, so more interactions are known for certain proteins and their neighbours resulting in an “inspection bias”.
That’s a stupid depiction of a network, it’s just a blue blob and you can’t see anything (hence termed “ridiculogram” by Mark Newman).
The arrow is wrong, you are not reporting interactions between proteins!

Points 1-5 should more or less make sense as they’re written. Point 6 however sounds a little cryptic. They’re called PROTEIN interaction networks, why would they not be interactions between proteins, you might ask.. and you’d be right in asking, because it’s really quite annoying. The problem is one of isoforms. The relation gene to protein is not 1-to-1. After a gene is transcribed into RNA, that piece of RNA is cut up and rearranged into mRNA, which is then translated into a protein. This rearranging process can occur in different ways so that a single gene may encode for different proteins, or as they are called, different protein isoforms. Testing for interactions between isoforms is difficult, so what tends to be done is that people test for interactions for THE ONE isoform of a protein to rule them all (the “reference isoform”) and then report these interactions as interactions for the gene. Sneaky! What you end up seeing are interactions mostly tested between reference isoforms (or any that happened to be in the soup) and reported as interactions for the gene product.

So how much does it matter if we don’t know isoform interaction information? Are there even different interacting partners for different isoforms? Do they have different functions? Well… yes, they can have different interactions and different functions. As to whether they matter… according to Corominas et al that answer is also a resounding yes… or at least in Autism Spectrum Disorder (ASD) it is.

The paper is the result of a 5-year investigation which investigates isoform interactions and the effect of knowing them vs not knowing them on predicting candidate ASD genes of interest. And seeing as a bunch of people spent a lot of time on this stuff, it was definitely worth a read. Corominas et al found that in an ASD-related protein interaction network, there is a significant number of interactions that would not be found if only the reference isoform interactions were used. Furthermore, compared to a “high-quality” literature curated network, the generated isoform-resolved ASD network added a lot of interactions. They then went on to show that these additional interactions played an important role in the decision of which genes to prioritize as important “players in Autism”.

Should these results make us renounce the use of currently available non-isoform-resolved protein interaction networks lest we burn in the depths of bioinformatics hell? Well… probably not. While the paper is very interesting and shows the importance of isoforms, it does so in the context of Autism only. The paper itself states that ASD is a brain-related disease which is an environment known for many isoforms. In many cases, it will likely be the case that the “dominant isoform” is just that… dominant. Moreover, the results may sound a little stronger than they are. The literature curated network that was compared to, to say that this isoform-resolved network is really important, was quoted as being “high-quality”. It is likely that many of the isoform interactions would be included in lower quality networks, but they have simply not been as well-studied as dominant isoforms. Thus, their isoform-resolved network would just confirm lower quality interactions as high-quality ones. That being said, if you want to look at the specific mechanisms causing a phenotype, it is likely that isoform information will be necessary. It really depends on what you want to achieve.

Let’s say you’re playing Texas Hold’em poker and you have two pairs. You’d like to have that full house, but it’s elusive and you’re stuck with the hand you have when your opponent bids high. That’s the situation we were in with protein interaction networks: you know you’re missing something, but you don’t know how bad it is that you’re missing it. This paper addresses part of that problem. We now know that your opponent could have the flush, but possibly only if you’re in Vegas. If you only want to play in a local casino, you’ll likely be fine.

GO annotations: Warning – secondary IDs!

I’m writing this post after a bit of a scare that has cost me at least 2 hours of my life so far, and possibly may cost me a lot more. This is something I wished I had read about before starting to use gene ontology annotations (GO terms), so hopefully this will reach some people that are in the position I was in a couple of years ago. The post assumes you know about the GO directed acyclic graph (DAG), which describes the relationships between GO terms, and a bit about functional similarity. Good reviews on this were written by Pesquita et al (2009) and Guzzi et al (2012)…

Secondary GO-terms

Yesterday evening I discovered GO terms can have secondary accession numbers. For example the GO term GO:0060555 is a secondary ID for GO:0070266 which describes the biological process term “necroptotic process”. The existence of secondary IDs in itself is not particularly surprising given that ontologies like the gene ontology are at the forefront of research and thus also have to be updated with the latest knowledge. As is the case for the UniProt KB and the RefSeq databases identifiers are merged if they are found to refer to the same thing. RefSeq has a good overview for when and why this occurs in their FAQ. Keeping a record of secondary IDs in the database is important for compatibility with past applications of the ontology.

Why is this a problem?

This can become a problem when the reverse compatibility is not fully committed to. For example, if the downloadable annotation set for human proteins contains GO terms that are secondary IDs, while the downloadable ontology structure does not include these secondary IDs in the directed acyclic graph (DAG). This means someone may download all the annotations and check what their parent terms are in the DAG, but as these terms are not included in the DAG, they do not have a parent term set.

So why the scare?

It seems like this behaviour should become obvious very fast. It should either give a key error that you are looking for something that isn’t there, or an empty set should ring some alarm bells. The scare comes in when neither happens, and you just don’t notice it at all, only to find out a couple of years later…

In order to calculate the functional similarity between two proteins, I need the full GO term sets associated with every protein including ancestral terms. I have a local copy of the ontology structure which I can use to query for these ancestral terms based on the GO terms directly associated with the proteins. As I need to do this for between 6800 and 13000 proteins depending on the network, I automated the process. The problem is that MySQL returns an empty set when asking for a match on an accession number that isn’t there. Returning an empty set has the effect that terms not in the ontology structure are automatically filtered out. This came to light now that I’m looking to explicitly calculate the information content of each GO term (a metric for how prevalent a term is in a data set) and three terms could not be found in the ancestral term sets for any proteins.

How do I prevent this from happening to me?

As there are only few secondary ID terms associated with proteins in the downloadable association file, it is easy to manually look for mappings to the respective primary IDs in the EBI QuckGO tool. Then, before querying for ancestral term sets, just map the secondary IDs across to the newer IDs and you’re all set. It’s really not a hassle if you know it’s there. Sadly, there is only a sentence on the gene ontology site about secondary IDs and no mention whatsoever that these IDs are not incorporated in the GO-basic ontology download, which is recommended for finding ancestral term sets.

Are you okay now?

As I hinted at before, there are very few proteins annotated with secondary ID terms. In my case I found three secondary ID GO terms annotated to five different proteins, out of a total 6861 proteins I was checking. One protein was also annotated with the corresponding primary ID term, so there was no effect, and another protein was annotated with the only parent term of the secondary ID, which no other proteins were annotated to. Thus, the effect really boils down to three proteins with the same secondary ID. Of those three, only one is heavily affected in the worst case by its similarity score with 10 other proteins changing from 9.3 to 2.6 without the primary ID annotation (scores range from ~ 1.5 to 12). Those are 10 scores out of a total of approximately 24 000 000… I think I will survive. But I am yet to test on a further data set.

The effect is basically that you ignore a couple of annotations. Given that we do not have the full set of annotations anyway, this change is the same as if an experiment or two hadn’t been conducted. Furthermore, the three proteins I found to be affected had a lot of annotations, so missing one should not have the worst-case effect I investigated. Nonetheless, you may want to do it properly first time round and take all the data into account when you start out. It should save you a lot of hassle later.

Stay vigilant, my friends!

Diseases exploiting the Power of Networks

Networks are pretty amazing. If you have a bunch of nodes that sit there and do very simple tasks that you can train a dog to do, and then you connect them in the right way, you can create a programme that can get through a level of Mario! In neural networks, all the nodes are doing is getting an input, then deciding what works best based on that input, and sending that information somewhere else. It’s like teaching your dog to get help when you’re in trouble. You say “Help!”, it makes a decision where to go to find help, and then someone else is alerted to your situation. Now you link those inputs and outputs, and suddenly you can solve some very interesting problems. Like making your dog get another dog to do something, who tells another dog… okay… at some point the analogy may break down. What I want to get across, is that networks can take very simple things and create complexity by just adding a few connections between them.

Getting back to the biological theme, proteins aren’t really all that simple. From what you’ll have read on other posts in this blog there’s a lot of effort that goes into modelling them. The position of even a small peptide loop can have a big effect on how a protein functions. So if you have a network of 15 – 25 thousand proteins you can imagine that the complexity of that system is quite incredible. That system is the human body.

Looking at systems from the perspective of interactions between components creating complex behaviour, it’s really not that surprising that human diseases often function by disrupting the network of protein interactions as found by this paper recently published in Cell. Assuming there’s a human disease which has survived for generations and generations, it is likely to be quite effective in how it functions. And if you have a complex system like the human body, the easiest way to disrupt that arising behaviour called “a healthy life”, is to alter a few of the interactions. Showing this for a large group of disease-associated missense mutations is however pretty impressive and time-consuming, which is probably why there are about 50 people from over 20 different institutions in the author list.

So what exactly did they show? They showed what happens when there is a mutation in a gene that causes a disease. Such a mutation replaces an amino acid in the protein sequence encoded by the gene. The resulting protein is then somehow different. The group around Marc Vidal and Susan Lindquist showed that rather than affecting the stability of the protein, the system tends to be targeted via the interactions of the protein. What is more, they showed that different mutations of the same gene can affect different sets of interactions. Using their “edgotyping” approach it is possible to characterize the type of effect a mutation will have and possibly predict the change in phenotype (i.e. the disease state).

Edgotyping classifies the effect of disease associated mutations (Sahni et al)

Now if the possibility of predicting how a single mutation (1 in ~300 amino acids), in a single protein (1 in ~ 25,000 proteins) affects the human body doesn’t convince you that networks are pretty powerful, I really don’t know what will. But now back to the important stuff…. how do you make a real life neural network with communicating dogs? And could you make them reach level two on Mario? Or… Super Mario Dog?

What does a farm look like? – Evaluating Community Detection Methods

Let’s assume you have a child. Tom is 5 years old, he’s a piece of work. He loves running around the house and the only time you can get a bit of rest is when he’s drawing. Tom loves drawing, so you use that whenever you need a bit of time to sit down. Today is one of those days, you didn’t sleep well and there was trouble at work. However, when you pick up Tom from daycare, he’s excited and full of energy, so you suggest Tom draw a farm which you can put on the wall in your office at work. You sit down and Tom is busy for half an hour. After half an hour Tom proudly presents his work. There are pigs in sties, horses in stables and cows on the meadow. He looks up at you and asks: “Is this a good farm?”. Of course, you say it is an amazing farm.

Only, what if you didn’t know what a farm looked like? You could ask the neighbour’s kid, Emily, to draw a farm and see what the images have in common? You could do this at a whole different scale and start a drawing competition for all the 5-year-olds in the neighbourhood and look which 5-year-olds draw most accurately to see who you should believe what a farm looks like. Clearly we’re not talking about children drawing farms any more. But what if Tom were a functional similarity metric that has just evaluated the results of a community detection algorithm run on a protein interaction network to generate communities?

That was a bit of a jump there. Let me explain. I have spent the last 2 years looking into how protein interaction networks (pins) can be partitioned into functional biological modules. It is widely believed that functional modules are an important scale of organization in humans or any other organisms (c.f. Hartwell et al 1999). These modules consist of proteins which perform the same or similar biological functions via interacting with each other. Thus, there may be a module which contains proteins that interact to e.g. heal wounds.

Finding Functional Modules

We attempt to find these modules by looking at a network which contains data on which proteins interact with which other ones (pins) and then use algorithms to group proteins together based on these interactions. The resulting protein communities contain proteins that interact more with each other than with the rest of the network. So that’s that, right?

…No, sadly it isn’t. The issue is that there are probably tens, if not hundreds of algorithms to find these communities and they don’t agree with each other. Furthermore, the underlying network contains a lot of false interactions and is missing a lot of true interactions. This affects different community detection methods in different ways. So we need a way to evaluate the results these community detection algorithms are presenting. But how do we judge what a good community is when it is exactly this community that we are looking for? How do we know it is “a good farm”, when we don’t know what a farm looks like?

Evaluating Community Detection by Functional Similarity

Luckily we do have some idea of what is “good”. Most proteins have annotations which suggest what biological functions they are involved in. As we are looking for functional modules which group together proteins involved in the same or a similar function, this is exactly along our alleyway! Lucky us :). Right?

Again, you might have expected this one, the answer is “not entirely”. Not only are there a lot of ways to find communities in a network, there are also a huge number of ways to use the above-mentioned functional annotations (GO annotations, www.geneontology.org) to calculate how “functionally similar” two proteins are. Now you might think that all these functional similarity measures use the same functional annotation data, so they should generally agree on what is a “good” and what is a “bad” community. This was my first intuition, so I checked. Below are two graphs which show the results of this test. In both cases I evaluate exactly the same sets of communities which I get from partitioning a pin using Link Clustering (Ahn et al 2010) at different resolutions. The plots show the average functional homogeneity of communities in different community size bands as judged by the Pandey Method on the left (Pandey et al 2008), and the simGIC method on the right (Pesquita et al 2007).

Functional homogeneity plots of a protein interaction network partitioned into communities by Link Clustering at different resolutions. The average functional homogeneity of communities is shown in different community size bands to give the coloured lines. Functional homogeneity is calculated as the average pairwise functional similarity between all protein pairs in a community. In Figure A) the Pandey method is used to calculate functional similarity between two proteins, while in Figure B) the simGIC method is used

You can clearly see that the two plots look different. Now it would be okay if they looked different and still agreed on the most important part: where is my pin partitioned into communities in the best way? To check this I look for maxima in the plots. Figure A) tells me that at a resolution of about 0.2 communities of size 2-35 are on average quite functionally homogeneous. At a slightly lower resolution (approx 0.1) communities of size 36+ look like they are partitioned decently. So depending on the community sizes I’m looking for, I can say which resolution I should go for. However, these peaks don’t show up in Figure B). We clearly need an evaluation of the evaluation metric. Which one should I believe?

Features that are common to both cases (the yellow peak around 0.4 and the magenta maximum around 0.3) might be “true”. But what is to say that the first set of peaks isn’t also important? In our earlier analogy, the neighbour’s daughter Emily has drawn a farm that looks quite different. You’re not about to say Tom’s wrong only because Emily drew a different picture, right? So right about now we need that drawing competition!

Evaluating Evaluation metrics: The drawing competition

Now we’ve stumbled twice already in how to evaluate our results. This time, we should make sure that we can evaluate the different results confidently. The plan is that we let the kids draw something else. Not a farm¸ because we don’t know how that looks like. We ask them to draw something related, like a house. We live in a house, so that should be easier to evaluate. And apparently houses and farms are not too dissimilar, so that the kids that do well in their house drawings, may be the ones that have the best farm drawings as well (or so we hope).

In terms of PINs, we have to create a network with community structure which looks a bit like a PIN. We can then use a community detection algorithm to find the communities and let our kids (functional similarity metrics) go away and evaluate the communities at different resolutions. This time we however know which maxima should show up, as we created the network and therefore know the community structure that should be found.

To create a PIN-similar network is not a mean feat and can be the topic of a whole PhD thesis, so I have focussed on a small, simple network, which is hopefully PIN-similar enough to be a meaningful evaluation for the functional similarity metrics. My network is generated on the basis of functional labels ordered in the ontology below.

A tree defining the relationship between labels 1-15. Nodes are annotated with these labels to generate a network. Labels which share more parent labels are closer related (i.e. 14 and 15 share the parent label 5).

Each node is assigned one or two labels between 6 and 15 randomly without replacement. Then the ontology is traced upwards to get the ancestral label sets associated with each node (i.e. label 15 has the ancestral labels 5 and 1). Edge probabilities are calculated between two nodes based on the number of common labels these nodes have in their ancestral label sets. Finally edges are added based on the edge probabilities to give a network with a density comparable to a PIN.

When this network is clustered into communities and functionally evaluated at different resolutions, the results look like this:

Average functional homogeneity of communities of size 3+ in a PIN-like random graph of 500 nodes. The top left graph (black) is based on label overlaps and is thus ground truth, while the other three are generated by the Pandey method (red), simGIC (blue), and simUI (green). The coloured graphs are generated by using the above-mentioned functional similarity metrics after 33% of the labels from 6-15 and 20% of the labels from 2-5 where “hidden” (reverted to the parent label).

The black graph shows how many labels nodes in the same community have in common at different resolutions. As these labels overlaps were used to generate the edge probabilities, this graph serves as a gold standard. The other three graphs were generated using the functional similarity metrics to be compared. To make the PIN-like network more realistic some labels were hidden, as we don’t always know the exact function of every protein when we evaluate PINs.

Now that we have the results, we just need to see how all the coloured graphs compare to the black ones, or in our analogy: how all the kid’s drawings compare to what our house looks like. But because we are doing science, one drawing competition sadly won’t do. What if one child draws our house very well, but happens to do a bit worse exactly in the parts where the house is similar to a farm. We’d think he’s the guy to listen to, and start thinking a farm looks like a shed?!

What we really need, is more drawing competitions. Lots of them. And this is where I am happy I’m at a computer where running one of these competitions takes maybe a minute. To get a bit more confidence in the results, I ran every simulation 100 times, for 27 different sets of parameters. And the results? Well, that’s the best bit. The results say Tom was drawing a great farm all along. Tom, the kid you’ve seen grow up and been bringing to daycare for years now, he was right. You weren’t lying at all when you said it was a great farm. Only in real life he’s not called Tom, he’s called the Pandey method ;).

Finding Communities in Networks: Link Clustering and the Affiliation Graph Model

The research I do revolves around how to break down protein interaction networks (PINs) into functional modules, or communities, using community detection methods. These communities are groups of proteins which interact more with one another than with the rest of the network. We decompose the PINs in this manner, as it is very difficult to determine how a protein functions through its interactions by simply looking at the entire network. The sheer amount of data in one of these networks, clouds the information we would actually like to see, as explained in my previous blog post.

In a journal club in August, I presented the overlapping community detection method Link Clustering by Ahn et al. which is the first community detection method I am aware of that assigns nodes to communities by clustering the edges between the nodes. I contrasted Link Clustering with the Affiliation Graph Model method proposed by Yang and Leskovec, as the authors use the same comparison technique to validate their methods. In a world where studies usually only work on a single network using a specific method, having two independent papers that use the same validation method is something that gets me much more excited than it probably should.

Link Clustering

Instead of assigning nodes to communities, Link Clustering focusses on the edges in a network. Clustering edges is essentially an equivalent problem to clustering nodes. This becomes obvious if you take a graph, decide to call all the edges “blobs” and say two of your new “blobs” have an edge between them if they share a node in the original network. This is called creating the “line graph” of the network and performing “blob” clustering on this is the same as performing edge clustering on the initial network.

To come back from my tangent, if an edge is assigned to community A, this means the nodes on either side of this edge are assigned community A by association. By assigning each edge to an individual community, Link Clustering creates overlapping node communities as shown in Figure 1 below.

Example of a simple, link-clustered network. Nodes can be seen to be in multiple communities as edges are assigned to communities (c.f. node 4 in red, green and yellow communities). [Ahn et al., 2010]

Figure 1: Example of a simple, link-clustered network. Nodes can be seen to be in multiple communities as edges are assigned to communities (c.f. node 4 in red, green and yellow communities). [Ahn et al., 2010]

In Link Clustering, edges are placed into the same community, based on similarity scores computed between all connected edges. Edge clusters are computed from these similarity scores using single-linkage clustering, where the edges with the highest similarity scores are iteratively merged into the same community. Using this method all edges end up in the same community in the end, so a threshold for the similarity score needs to be found at which the network is partitioned into communities in the best way. This threshold value represents a type of resolution parameter of the for community detection. At a low threshold there are few, large communities as many edges have been clustered together, while at a high threshold only highly similar edges are in the same community and there are many, small communities. Ahn et al. propose maximizing a function called the partition density to find the optimal resolution threshold. This function is simply a weighted average of the density of the communities. For those incredibly keen, it is given by the equation:

$D = \frac{2}{M} \sum_c m_c \frac{m_c - (n_c -1)}{(n_c - 2)(n_c -1)}$

Here, $M$ represents the total number of edges in the network and $m_c$ and $n_c$ represent the number of edges and nodes in the community $c$ respectively.

All of this partitioning is based on a similarity score between two edges. The crux of this method, the scoring measure, is explained in Figure 2.

Schematic Diagram showing how the similarity between edges e_ik and and e_jk is calculated in Link Clustering. The overlap of the inclusive neighbour sets of nodes i and j is divided by the union of these sets. The inclusive neighbour set of node i is computed by taking the neighbours of node i and including node i itself. The nodes in grey are overlapping between both sets, therefore the similarity between edges e_ik and e_jk is 4/12 or 1/3. [Ahn et al 2010]

Figure 2: Schematic Diagram showing how the similarity between edges $e_{ik}$ and and $e_{jk}$ is calculated in Link Clustering. The overlap of the inclusive neighbour sets of nodes i and j is divided by the union of these sets. The inclusive neighbour set of node i is computed by taking the neighbours of node i and including node i itself. The nodes in grey are overlapping between both sets, therefore the similarity between edges $e_{ik}$ and $e_{jk}$ is 4/12 or 1/3. [Ahn et al. 2010]

Affiliation Graph Model

The method which I would like to compare Link Clustering to is the Affiliation Graph Model (AGM). Briefly, this is a statistical community detection algorithm which generates overlapping communities. Node affiliations are recorded in the Community Affiliation Matrix shown in Figure 2.

Schematic of the Community Affiliation Matrix, which shows nodes (coloured squares) being affiliated with communities (circles A, B, and C) as shown by the arrows. Nodes can have multiple community affiliations as indicated by red and purple node squares. The probability of interaction between two nodes which are both in only community A, is $p_A$ . [Yang and Leskovec, 2012]

The AGM has a probability of interaction for each community, denoted $p_A$ for community $A$ . For nodes $u$ and $v$ , which may share several community affiliations, the probability of interaction, $p(u,v)$ is given by the equation:

$p(u,v) = 1 - \prod_{k \in C_{uv}} (1 - p_k),$

where $k$ denotes a community in the set of communities shared by nodes $u$ and $v$ , $C_{uv}$ . This model is be fitted to the network using a Metropolis-Hastings algorithm with a Markov chain constructed by pre-defined defined steps in the space of possible Community Affiliation Matrices (c.f. Yang and Leskovec 2012).

Result Comparison Method

Link Clustering itself has some very interesting results, especially relating to looking at real-world networks at different resolutions (which can be found in the paper). However, I want to focus on the validation method used in the paper which compares results generated from Link Clustering with that of other methods. This comparison method evaluates four different aspects of the partitions generated by different methods. These four aspects are: Community Quality, Overlap Quality, Community Coverage, and Overlap Coverage. The “Quality” aspects relate to network metadata indicating how good the obtained communities are, and the “Coverage” aspects relate to the amount of information extracted from a network.

The metadata referred to above is a general term for additional data available about the nodes in a network. In the case of PINs this metadata is related to the function of proteins in the network (their Gene Ontology annotations). Loosely, proteins with similar functions should be placed in the same community to improve the Community Quality measure. Overlap Quality concerns the boundaries between generated communities. If the functions assigned to proteins show similar boundaries between groups of functions, the Overlap Quality is high.

The “Coverage” values are more basic calculations. A partition has a high Community Coverage if the fraction of nodes assigned to communities with three or more nodes is high (non-trivial communities). A community of size two is uninformative, as it is the result of a single edge being in a community by itself. Overlap Coverage is simply the number of non-trivial community memberships per node. The two “Coverage” values are thus equal for non-overlapping community detection methods.

When comparing community detection methods, these four measures are computed for a partition generated by each method, and then rescaled so that the maximum score achieved by any method is 1 for each measure independently. These values are then added to give a score between 0 and 4 for each community detection method.

Results

The results shown in Figure 4 were generated by Yang and Leskovec to show their AGM outperforms Link Clustering (and other methods) on a variety of networks. While it is noteable that they used the same metric and networks used by Ahn et al. when proposing Link Clustering, this Figure must be taken with a pinch of salt. Yang and Leskovec acknowledge that the metric proposed by Ahn et al. rewards methods which find more communities in a network and thus do not fit the number of communities judged to be best by the AGM, but instead fit the same number as fitted in Link Clustering. Furthermore, it is peculiar that half of the networks used for comparison by Ahn et al. have disappeared in this comparison.

Community detection method comparison for Link Clustering (L), Clique Percolation (C), Mixed-Membership Stochastic Block Model (M) and the Affiliation Graph Model (A) on different PINs (PPI) and other social networks. Methods are compared by the metric proposed by Ahn et al. [Yang & Leskovec 2012]

Figure 4: Community detection method comparison for Link Clustering (L), Clique Percolation (C), Mixed-Membership Stochastic Block Model (M) and the Affiliation Graph Model (A) on different PINs (PPI) and other social networks. Methods are compared by the metric proposed by Ahn et al. [Yang & Leskovec 2012]

To conclude we can say that while it is very interesting that two papers use the same method to compare their methods to others for validation purposes, it wasn’t done completely accurately here. The comparison metric is however an interesting development to create a gold standard for method comparison. For my purposes, Community Quality is the most important, so maybe a weighted version of this may be more interesting.

Journal Club: A Novel Insight into Gene Ontology Semantic Similarity

In a past journal club, I presented a paper by Xu et al on their method of quantifying semantic similarity of proteins based on the Gene Ontology. The Gene Ontology (GO) is a directed acyclic graph (DAG, c.f. Figure 1) of terms with which a protein or a gene can be annotated. These terms are designed to describe the process the protein is involved in, what role it fulfils in this process, and where in the cell it is localized. Thus, when comparing the function of two proteins in a quantifiable way, it has become standard to refer back to the GO terms these proteins are annotated with and compare these based on their relationship in the DAG.

Figure 1: Schematic Diagram of a Directed Acyclic Graph (DAG).

As opposed to many methods, which measure the importance of a node (GO-term) in the DAG as its information content given an external database, the method proposed by Xu et al measures semantic similarity independently of external resources, which gives it the appearance of an independent measure. Furthermore, it claims to be good at dealing with badly annotated proteins, which is often a big problem in functional similarity calculations.

The similarity measure is a hybrid between node-based and edge-based methods, and is seemingly inspired by Wang et al’s 2007 paper and Shen et al’s 2010 paper. It is based on what they call “Shortest Semantic Differentiation Distance” or (SSDD), which is calculated over the shortest distance between two GO-terms on the DAG. When comparing the GO-terms A and B, the shortest path is measured by traversing the DAG upwards from node A to the lowest common ancestor of both nodes and down to node B.

The SSDD calculation over the shortest path is based on their so-called semantic Totipotency values assigned to the terms in the DAG that are part of the shortest path. The semantic Totipotency, T, of a term is calculated by:

Semantic Totipotency Measure

where the weight, ω, is given by:

Weight, ω

Here, Dst(t) denotest the number of descendents of the term t, and t_pdenotes the parent term of term t. Thus, the T-value of every node is both an expression of the depth of the DAG in this area and the coverage.

Finally, the SSDD is calculated by:

Semantic Similarity Differentiation Distance

Shortest Semantic Differentiation Distance

And subsequently the similarity of two GO terms is measured by:

Results

In their paper Xu et al showed the method to be competitive compared to other methods which compute protein functional similarity by pairwise GO-term comparisons, while also outperforming a common graph-based method in simUI. While these results look promising, the biological interpretability of such a semantic similarity measure remains difficult.

The strongest advantage of the SSDD method proposed was however its alleged invariance to annotation richness of proteins, which was presented as shown in Figure 2 below (Figure 5 in the paper).

Figure 2: The performance of difference methods dealing with sets of proteins with difference annotation richness.

The results in this figure show that SSDD exhibits only a slight decrease in Pearson Correlation Coefficient to a set of reference similarity values for proteins which are less well annotated. This ability to deal with badly annotated proteins is the true value of the SSDD method proposed by Xu et al. However, this investigation was performed by sets of proteins selected by the authors, and should thus be validated independently to confirm these surprising results.

Every Protein needs a Friend – Community Detection in Protein Interaction Networks

To make the OPIG soup, that has tasted of antibodies a lot lately, a little more diverse, I will try to spice things up with a dash of protein interaction networks, a pinch of community detection and a shot of functional similarity evaluation. I hope it remains edible!

In the 10 weeks I have spent at OPIG, my main focus has been on protein interaction networks, or more specifically, on this network:

View of the largest connected component of the HINT binary physical interaction network. Nodes represent proteins and edges are protein interactions.

Viewing this image, a popular German phrase comes to mind, which badly translated means: “As you see, you see nothing”. However, trying to “see” something in this, is what I’ve been trying to do. And as it turns out, I’m not the only person.

If we had a data set which says exactly which protein interacts with which other ones, then surely all biological pathway information must be incorporated in this data, and we should be able to cluster it into smaller modules or communities, which represent a biological function. This Gedankenexperiment is the theory which underlies my approach to these networks.

In reality, however, we don’t have this perfect data set. Protein interaction networks are very noisy with high estimated false positive and false negative rates for interactions, yet community detection algorithms have still been shown to be successful in outputting meaningful partitions of the network into communities. In this context “meaningful” refers to communities which group proteins together that have a similar biological function.

This brings us to a whole new problem. What is a “similar biological function” and how do you measure it? This question cannot be perfectly answered, but it seems the Gene Ontology annotations for biological process are a good place to start. In this framework, proteins are annotated with terms which describe the biological process they participate in. Of course there is not always a consensus about what term is to be assigned to a protein, and it is questionable how precisely a protein’s function within a process can be determined, but it wouldn’t be called work, if it was easy.

In my 10 weeks here, I’ve only scraped the tip of what is detection of functional communities in protein interaction networks, but it looks promising that the communities obtained may have some significance regarding biological modules. It is my hope that I can use data sets such as gene expression studies to further investigate this significance in the future, and maybe, if I’m very lucky, work towards helping people classify macrophage phenotypes or identify cancer in the distant future. The best place to do this, would definitely be in the friendly atmosphere that is OPIG!

Oxford Protein Informatics Group

or "OPIG" to friends