Confidence (scores) in STRING

There are many techniques for inferring protein interactions (be it physical binding or functional associations), and each one has its own quirks: applicability, biases, false positives, false negatives, etc. This means that the protein interaction networks we work with don’t map perfectly to the biological processes they attempt to capture, but are instead noisy observations.

The STRING database tries to quantify this uncertainty by assigning scores to proposed protein interactions based on the nature and quality of the supporting evidence. STRING contains functional protein associations derived from in-house predictions and homology transfers, as well as taken from a number of externally maintained databases. Each of these interactions is assigned a score between zero and one, which is (meant to be) the probability that the interaction really exists given the available evidence.

Throughout my short research project with OPIG last year I worked with STRING data for Borrelia Hermsii, a relatively small network of scored interactions across 815 proteins. I was working with v.10.0., the latest available database release, but also had the chance to compare this to v.9.1 data. I expected that with data from new experiments and improved scoring methodologies available, the more recent network would be more or less a re-scored superset of the older. Even if some low-scored interactions weren’t carried across the update, I didn’t expect these to be any significant proportion of the data. Interestingly enough, this was not the case.

Out of 31 264 scored protein-protein interactions in v.9.1. there were 10 478, i.e. almost exactly a third of the whole dataset, which didn’t make it across the update to v.10.0. The lost interactions don’t seem to have very much in common either — they come from a range of data sources and don’t appear to be located within the same region of the network. The update also includes 21 192 previously unrecorded interactions.

densityComparison

Gaussian kernel density estimates for the score distribution of interactions across the entire 9.1. Borrelia Hermsii dataset (navy) and across the discarded proportion of the dataset (dark red). Proportionally more low-scored interactions have been discarded.

Repeating the comparison with baker’s yeast (Saccharomyces cerevisiae), a much more extensively studied organism, shows this isn’t a one-off case either. The yeast network is much larger (777 589 scored interactions across 6400 proteins in STRING v.9.1.), and the changes introduced by v.10.0. appear to be scaled accordingly — 237 427 yeast interactions were omitted in the update, and 399 836 new ones were added.

discardedYeast

Kernel density estimates for the score distribution for yeast in STRING v.9.1. While the overall (navy) and discarded (dark red) score distributions differ from the ones for Borrelia Hermsii above, a similar trend of omitting more low-scored edges is observed.

So what causes over 30% of the scored interactions in the database to disappear into thin air? At least in part this may have to do with thresholding and small changes to the scoring procedure. STRING truncates reported interactions to those with a score above 0.15. Estimating how many low-scored interactions have been lost from the original dataset in this way is difficult, but the wide coverage of gene co-expression data would suggest that they’re a far from negligible proportion of the scored networks. The changes to the co-expression scoring pipeline in the latest release [1], coupled with the relative abundance of co-expression data, could have easily shifted scores close to 0.15 on the other side of the threshold, and therefore might explain some of the dramatic difference.

However, this still doesn’t account for changes introduced in other channels, or for interactions which have non-overlapping types of supporting evidence recorded in the two database versions. Moreover, thresholding at 0.15 adds a layer of uncertainty to the dataset — there is no way to distinguish between interactions where there is very weak evidence (i.e. score below 0.15), pairs of proteins that can be safely assumed not to interact (i.e. a “true” score of 0), and pairs of proteins for which there is simply no data available. While very weak evidence might not be of much use when studying a small part of the network, it may have consequences on a larger scale: even if only a very small fraction of these interactions are true, they might be indicative of robustness in the network, which can’t be otherwise detected.

In conclusion, STRING is a valuable resource of protein interaction data but one ought to take the reported scores with a grain of salt if one is to take a stochastic approach to protein interaction networks. Perhaps if scoring pipelines were documented in a way that made them reproducible and if the data wasn’t thresholded, we would be able to study the uncertainty in protein interaction networks with a bit more confidence.

References:

[1] Szklarczyk, Damian, et al. “STRING v10: protein–protein interaction networks, integrated over the tree of life.” Nucleic acids research (2014): gku1003