Tag Archives: Protein Interaction Networks

Confidence (scores) in STRING

There are many techniques for inferring protein interactions (be it physical binding or functional associations), and each one has its own quirks: applicability, biases, false positives, false negatives, etc. This means that the protein interaction networks we work with don’t map perfectly to the biological processes they attempt to capture, but are instead noisy observations.

The STRING database tries to quantify this uncertainty by assigning scores to proposed protein interactions based on the nature and quality of the supporting evidence. STRING contains functional protein associations derived from in-house predictions and homology transfers, as well as taken from a number of externally maintained databases. Each of these interactions is assigned a score between zero and one, which is (meant to be) the probability that the interaction really exists given the available evidence.

Throughout my short research project with OPIG last year I worked with STRING data for Borrelia Hermsii, a relatively small network of scored interactions across 815 proteins. I was working with v.10.0., the latest available database release, but also had the chance to compare this to v.9.1 data. I expected that with data from new experiments and improved scoring methodologies available, the more recent network would be more or less a re-scored superset of the older. Even if some low-scored interactions weren’t carried across the update, I didn’t expect these to be any significant proportion of the data. Interestingly enough, this was not the case.

Out of 31 264 scored protein-protein interactions in v.9.1. there were 10 478, i.e. almost exactly a third of the whole dataset, which didn’t make it across the update to v.10.0. The lost interactions don’t seem to have very much in common either — they come from a range of data sources and don’t appear to be located within the same region of the network. The update also includes 21 192 previously unrecorded interactions.


Gaussian kernel density estimates for the score distribution of interactions across the entire 9.1. Borrelia Hermsii dataset (navy) and across the discarded proportion of the dataset (dark red). Proportionally more low-scored interactions have been discarded.

Repeating the comparison with baker’s yeast (Saccharomyces cerevisiae), a much more extensively studied organism, shows this isn’t a one-off case either. The yeast network is much larger (777 589 scored interactions across 6400 proteins in STRING v.9.1.), and the changes introduced by v.10.0. appear to be scaled accordingly — 237 427 yeast interactions were omitted in the update, and 399 836 new ones were added.


Kernel density estimates for the score distribution for yeast in STRING v.9.1. While the overall (navy) and discarded (dark red) score distributions differ from the ones for Borrelia Hermsii above, a similar trend of omitting more low-scored edges is observed.

So what causes over 30% of the scored interactions in the database to disappear into thin air? At least in part this may have to do with thresholding and small changes to the scoring procedure. STRING truncates reported interactions to those with a score above 0.15. Estimating how many low-scored interactions have been lost from the original dataset in this way is difficult, but the wide coverage of gene co-expression data would suggest that they’re a far from negligible proportion of the scored networks. The changes to the co-expression scoring pipeline in the latest release [1], coupled with the relative abundance of co-expression data, could have easily shifted scores close to 0.15 on the other side of the threshold, and therefore might explain some of the dramatic difference.

However, this still doesn’t account for changes introduced in other channels, or for interactions which have non-overlapping types of supporting evidence recorded in the two database versions. Moreover, thresholding at 0.15 adds a layer of uncertainty to the dataset — there is no way to distinguish between interactions where there is very weak evidence (i.e. score below 0.15), pairs of proteins that can be safely assumed not to interact (i.e. a “true” score of 0), and pairs of proteins for which there is simply no data available. While very weak evidence might not be of much use when studying a small part of the network, it may have consequences on a larger scale: even if only a very small fraction of these interactions are true, they might be indicative of robustness in the network, which can’t be otherwise detected.

In conclusion, STRING is a valuable resource of protein interaction data but one ought to take the reported scores with a grain of salt if one is to take a stochastic approach to protein interaction networks. Perhaps if scoring pipelines were documented in a way that made them reproducible and if the data wasn’t thresholded, we would be able to study the uncertainty in protein interaction networks with a bit more confidence.


[1] Szklarczyk, Damian, et al. “STRING v10: protein–protein interaction networks, integrated over the tree of life.” Nucleic acids research (2014): gku1003

Every Protein needs a Friend – Community Detection in Protein Interaction Networks

To make the OPIG soup, that has tasted of antibodies a lot lately, a little more diverse, I will try to spice things up with a dash of protein interaction networks, a pinch of community detection and a shot of functional similarity evaluation. I hope it remains edible!


In the 10 weeks I have spent at OPIG, my main focus has been on protein interaction networks, or more specifically, on this network:

View of the largest connected component of the HINT binary physical interaction network

View of the largest connected component of the HINT binary physical interaction network. Nodes represent proteins and edges are protein interactions.

Viewing this image, a popular German phrase comes to mind, which badly translated means: “As you see, you see nothing”. However, trying to “see” something in this, is what I’ve been trying to do. And as it turns out, I’m not the only person.

If we had a data set which says exactly which protein interacts with which other ones, then surely all biological pathway information must be incorporated in this data, and we should be able to cluster it into smaller modules or communities, which represent a biological function. This Gedankenexperiment is the theory which underlies my approach to these networks.

In reality, however, we don’t have this perfect data set. Protein interaction networks are very noisy with high estimated false positive and false negative rates for interactions, yet community detection algorithms have still been shown to be successful in outputting meaningful partitions of the network into communities. In this context “meaningful” refers to communities which group proteins together that have a similar biological function.

This brings us to a whole new problem. What is a “similar biological function” and how do you measure it? This question cannot be perfectly answered, but it seems the Gene Ontology annotations for biological process are a good place to start. In this framework, proteins are annotated with terms which describe the biological process they participate in. Of course there is not always a consensus about what term is to be assigned to a protein, and it is questionable how precisely a protein’s function within a process can be determined, but it wouldn’t be called work, if it was easy.

In my 10 weeks here, I’ve only scraped the tip of what is detection of functional communities in protein interaction networks, but it looks promising that the communities obtained may have some significance regarding biological modules. It is my hope that I can use data sets such as gene expression studies to further investigate this significance in the future, and maybe, if I’m very lucky, work towards helping people classify macrophage phenotypes or identify cancer in the distant future. The best place to do this, would definitely be in the friendly atmosphere that is OPIG!