Category Archives: Group Meetings

What we discuss during cake at our Tuesday afternoon group meetings

When Does Chemical Elaboration Induce a Ligand To Change Its Binding Mode?

When Does Chemical Elaboration Induce a Ligand To Change Its Binding Mode?

For my journal club in June, I chose to present a Journal of Medicinal Chemistry article entitled “When Does Chemical Elaboration Induce a Ligand To Change Its Binding Mode?” by Malhotra and Karanicolas. This article uses a large scale collection of ligand pairs to investigate the circumstances in which elaborations of a ligand change the original binding mode.

One of the primary goals in medicinal chemistry is the optimisation of biological activity by chemical elaboration of a hit compound. This hit-to-lead optimisation often assumes that addition of functional groups to a given hit scaffold will not change the original binding mode.

In order to investigate the circumstances in which this assumption holds true and how often it holds true, they built up a large-scale collection of 297 related ligand pairs solved in complex with the same protein partner. Each pair consisted of a larger and smaller ligand; the larger ligand could have arisen from elaboration of the smaller ligand. They found that for 41 out of the 297 pairs (14%), the binding mode changed upon elaboration of the smaller ligand.

They investigated many physicochemical properties of the ligand, the protein-ligand complex and the protein binding pocket. They summarise the statistical significance and predictive power of the investigated properties with the table shown below.

They found that the property with the lowest p-value was the “rmsd after minimisation of the aligned complex” (RMAC). They developed this metric to probe whether the larger ligand could be accommodated in the protein without changing binding mode. They did so by aligning the shared substructure of the larger ligand onto the smaller ligand’s complex and then carrying out an energy minimisation. By monitoring the RMSD difference of the larger ligand relative to the initial pose (RMAC), they can gauge how compatible the larger ligand is with the protein. Larger RMAC values indicate greater incompatibility, hence a greater likelihood for the binding mode to not be preserved.

The authors generated receiver operating characteristic (ROC) plots to compare the predictive power of the properties considered. ROC curves are made by plotting the true positive rate (TPR) against the false positive rate (FPR). A random classifier would yield the dotted line from the bottom left to the top right, shown in the plots below. The best predictors would give a point in the top left corner of the plot. The properties that do well include RMAC, pocket volume, molecular weight, lipophilicity and potency.

They also combined properties to enhance predictive power and conclude that RMAC and molecular weight together offers good predictivity.Finally, the authors look at the pairs that have low RMAC values (i.e. the elaboration should be compatible with the protein pocket), yet show a change in binding mode. For these cases, a specific substitution may enable formation of a new, stronger interaction or for pseudosymmetric ligands, the alternate pose can mimic many of the interactions of the original pose.

Antibody Developability: Experimental Screening Assays

[This blog post is centered around the paper “Biophysical properties of the clinical-stage antibody landscape” (http://www.pnas.org/content/114/5/944.abstract) by Tushar Jain and coworkers. It is designed as a very basic intro for computational scientists into the world of experimental biophysical assays.]

A major concern in the development of antibody therapies is being able to predict “developability issues” at the screening stage, to avoid costly Phase II/Phase III clinical trial failures. Examples of such issues include an antibody being difficult to manufacture, possessing unsuitable pharmacodynamic or pharmokinetic profiles, having a propensity to aggregate (both in storage and in vivo) and being highly immunogenic.

This post is designed to give a clear and concise summary of the principles behind some of the most common biophysical experimental assays used to assess antibody candidates for future developability issues.

1. Ease of manufacture

HEK Titre (HEKt): This assay tests the expression level of the antibody (the higher the better). The heavy and light chain sequences are subcloned into vectors (such as pcDNA 3.4+, ThermoFisher) and these vectors are subsequently transfected into a suspension of Human embryonic kidney (HEK293) cells. After a set number of days the supernatant is harvested to assess the degree of expression.

2. Stability of 3D structure

Melting temperature using Differential Scanning Fluorimetry (Tm with DSF) Assay: This assay tests the thermal stability of the antibody. The higher the thermal stability, the less likely the protein will spontaneously unfold and become immunogenic. The antibody is mixed with a dye that fluoresces when in contact with hydrophobic regions, such as SPYRO orange. The mixture is then taken through a range of temperatures (eg. 40°C -> 95°C at a rate of 0.5°C/2min). As the protein begins to unfold, buried hydrophobic residues will become exposed and the level of fluorescence will suddenly increase. The value of T when the increase in fluorescence intensity is greatest gives us a Tm value.

(Further reading: http://www.beta-sheet.org/resources/T22-Niesen-fingerprinting_Oxford.pdf)

3. Stickiness assays (Aggregation propensity/Low solubility/High viscosity)

Affinity-capture Self-interaction Nanoparticle Spectroscopy (AC-SINS) Assay: This assay tests how likely an antibody is to interact with itself. It uses gold nanoparticles that are coated with anti-Fc antibodies. When a dilute solution of antibodies is added, they rapidly become immobilised on the gold beads. If these antibodies subsequently attract one another, it leads to shorter interatomic distances and longer absorption wavelengths that can be detected by spectroscopy.

(Further reading: https://www.ncbi.nlm.nih.gov/pubmed/24492294)

Clone Self-interaction by Bio-layer Interferometry (CSI-BLI) Assay: A more high-throughput method that uses a label-free technology to measure self-interaction. Antibodies are loaded onto the biosensor tip and white light is shone down the instrument to yield an internal reflection interference pattern. Then the tip is inserted into a solution of the same antibody, and if self-interaction occurs, then the interference pattern shifts by an amount proportional to the change in thickness of the biological layer. Images from: http://www.fortebio.com/bli-technology.html

(Further Reading: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3896597/)

Hydrophobic Interaction Chromatography (HIC) Assay: Antibodies are mixed into a polar mobile phase and then washed over a hydrophobic column. UV-absorbance or other techniques can then be used to determine the degree of adhesion.

(Further Reading: https://www.ncbi.nlm.nih.gov/pubmed/4094424)

Standup Monolayer Chromatography (SMAC) Assay: Antibodies are injected onto a pre-packed Zenix HPLC column and their retention times are calculated. The longer the retention time, the lower their colloidal stability and the more prone they are to aggregate.

(Further Reading: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4622974/)

Size-exclusion Chromatography (SEC) Assay: Antibodies are flowed through a column consisting of spherical beads with miniscule pores. Non-aggregated antibodies are small enough to get trapped in the pores, whereas aggregated antibodies will flow through the column more rapidly. Percentage aggregation can be worked out from the concentrations of the different fractions.

4. Degree of specificity

Cross-Interaction Chromatography (CIC) Assay: This assay measures an antibody’s retention time as it flows across a column conjugated with polyclonal human serum antibodies. If an antibody takes longer to exit the column, it indicates that its surface is likely to interact with several different in vivo targets.

(Further Reading: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3896597/)

Enzyme-linked Immunosorbent Assay (ELISA) – with common antigens or Baculovirus Particles (BVPs): Common antigens or BVPs are fixed onto a solid surface and then a solution containing the antibody of interest linked to an enzyme (such as horseradish peroxidase, HRP) is washed over them. Incubation lasts for about an hour before any unreacted antibodies are washed off. When the appropriate enzyme substrate is then added, it triggers emission of a visible, fluorescent or luminescent nature, which can be detected. The intensity is proportional to the amount of antibody stuck to the surface.

(Further Reading: https://www.thermofisher.com/uk/en/home/life-science/protein-biology/protein-biology-learning-center/protein-biology-resource-library/pierce-protein-methods/overview-elisa.html)

Poly-Specificity Reagent (PSR) Binding Assay: A more high-throughput method that uses fluorescence-activated cell sorting (FACS), a type of flow cytometry. A PSR is generated by biotinylating soluble membrane proteins (from Chinese hamster ovary (CHO) cells, for example) and then is incubated with IgG-presenting yeast. After washing a secondary labeling mix is added, and flow cytometry is used to determine a median fluorescence intensity – the higher the median intensity, the greater the chance of non-specific binding.

(Further Reading: https://www.ncbi.nlm.nih.gov/pubmed/24046438)

Le Tour de Farce v5.0

Every summer the OPIGlets go on a cycle ride across the scorched earth of Oxford in search of life-giving beer. Now in its fifth iteration, the annual Tour de Farce took place on us on Tuesday the 13th of June.

Establishments frequented included The Victoria, The Plough, Jacobs Inn (where we had dinner and didn’t get licked by their goats, certainly not), The Perch and finally The Punter. Whilst there were plans to go to The One for their inimitable “lucky 13s” by 11PM we were alas too late, so doubled down in The Punter.

Highlights of this years trip included certain members of the group almost immediately giving up when trying to ride a fixie and subsequently being shown up by our unicycling brethren.

Conformational diversity analysis reveals three functional mechanisms in proteins

Conformational diversity analysis reveals three functional mechanisms in proteins

This paper was published recently in Plos Comp Bio and looks at the conformational diversity (flexibility) of protein structures by comparing solved structures of identical sequences.

The premise of the work is that different crystal structures of the same protein represent instances of the conformational space of the protein. These different instances are identical in amino acid sequence but often differ in other ways they could come from different crystal forms or the protein could have different co-factors bound or have undergone post translational modifications.

The data set used in the paper came from CoDNaS (conformational diversity of the native state) Database URL:http://ufq.unq.edu.ar/codnas.

Only structures solved using X-ray crystallography to a resolution better than 2.5A were used and only proteins for which at least 5 conformers were available (average of 15.53 conformers per protein). Just under 5000 different protein chains made up the set. In order to describe the protein chains the measure used was maximum conformational diversity (the maximum RMSD between any of the conformers of a given protein chain).

The authors describe a relationship between this maximum conformational diversity and the presence absence of intrinsically disordered regions (IDRs). An IDR was defined as a segment of at least 5 contiguous residues with missing electron density (the first and last 20 residues of the chain were not included).

The proteins were divided into three groups.

Rigid

  • No IDRS

Partially disordered

  • IDRs in at least one conformer
  • IDR in the maximum RMSD pair of conformational diversity

Malleable

  • IDRs in at least one conformer
  • No IDR in the maximum RMSD pair of conformational diversity

Rigid proteins have in general lower conformational diversity than partially disordered than Malleable. The authors describe how these differences are not due to crystallographic conditions, protein length, number of crystal contacts or number of conformers.

The authors then go on to compare other properties based on these three types of protein chains including amino acid composition, loop RMSD and cavities and tunnels.

They summarise their findings with the figure below.

Biophysical Society 61st Annual Meeting – New Orleans, February 2017

As the sole representative of OPIG attending Biophys 2017 in New Orleans, I had to bear the heavy burden of a long and lonely flight and the fear of missing out on a week of the very grey Oxford winter. Having successfully crossed the border into the US, which was thankfully easier for me than it was for some of our scientific colleagues from around the world, I found my first time attending the conference to be full of very interesting and relevant science. While also covering a wide variety of experimental techniques and non-protein topics, the conference is so large and broad that there was more than enough to keep me busy over the five days, featuring folding, structure prediction, docking, networks, and molecular dynamics.

There were several excellent talks on the subject of folding pathways, misfolding and aggregation. A common theme was the importance of the kinetic stability of the native state, and the mechanisms by which it may be prevented from reaching a non-native global thermodynamic minimum. This is particularly important for serpins, large protease inhibitors which inactivate proteases by a suicide mechanism. The native and active state can be transformed into a lower energy conformation over long timescales. However, this also occurs by cleavage near the C-terminal end, which allows insertion of the C-terminal tail into a beta sheet, holding the cleaving protease inactive and therefore the stored energy is very important for function. Anne Gershenson described recent simulations and experiments to elucidate the order in which substructures of the complete fold assemble. There are many cooperative substructures in this case, and N-terminal helices form at an early stage. The overall topology appears to be consistent with a cotranslational folding mechanism inside the ER, but requires significant rearrangements after translation for adoption of the full native fold.

Cotranslational folding was also discussed by several others including the following: Patricia Clark is now using the YKB system of alternately folding fluorescent protein to find new translation stalling sequences; Anais Cassaignau described NMR experiments to show the interactions taking place between nascent chains and the ribosome at different stalled positions during translation; and Daniel Nissley presented a model to predict a shift in folding mechanism from post-translational to cotranslational due to specific designed synonymous codon changes, which agreed very well with experimental data.

To look more deeply into the evolution of folding mechanisms and protein stability, Susan Marqusee presented a study of the kinetics of folding of RNases, comparing the properties of inferred ancestral sequences to a present day thermophile and mesophilic E. coli. A number of reconstructed sequences were expressed, and it was found that moving along either evolutionary branch from the ancestor to modern day, folding and unfolding rates had both decreased, but the same three-state folding pathway via an intermediate is conserved for all ancestors. However, the energy transition between the intermediate and the unfolded state has evolved in opposite directions even while the kinetic stability remains similar. This has led to the greater thermodynamic stability seen in the modern day thermophile compared to the mesophile at higher temperatures and concentrations of denaturant.

Panel C shows that kinetic stability (low unfolding rate) seems to be selected for in both environments. Panel D shows that the thermodynamic stability of the intermediate (compared to the unfolded state) accounts for the differences in thermodynamic stability of the native state, when compared to the common ancestor (0,0). Link to paper

There were plenty of talks discussing the problems and mechanisms of protein aggregation, with two focussing on light chain amyloidosis. Marina Ramirez-Alvarado was investigating how fibrils begin to grow and showed using microscopy that both soluble light chains and fibrils (more slowly) are internalised by heart muscle cells. They can then be exposed at the cell surface and become a seed to recruit other soluble light chains to form fibrils. Shannon Esswein presented work on the enhancement of VL-VL dimerisation to prevent amyloid formation. The variable domain of the light chain (VL) can pair with itself in a similar orientation to its pairing with VH domains in normal antibodies, or in a non-canonical orientation. Adding disulphide bonds to stabilise these dimers prevented fibril formation, therefore they carried out a small scale screen of 27 aromatic and hydrophobic ligands to find those which would favour dimer formation by binding at the interface. Sulfasalazine was detected in this screen and was also shown to significantly reduce fibril formation and could therefore be used as a template for future drug design.

A ligand stabilises the dimer therefore fewer light chains are present as monomers, slowing the rate of the only route by which fibrils can be formed. Link to paper

Among the posters, Alan Perez-Rathke presented loop modelling by DiSGro in beta barrel membrane proteins which showed that the population of structures generated and scored favourably after relaxation at a pH 7 led to an open pore more often than at pH 5, consistent with experimental observations. There were two posters on the topic of prediction of membrane protein expression in bacteria and yeast presented by students of Bill Clemons, who also gave a great talk. Shyam Saladi has carefully curated datasets of successes and failures in expression in E. coli and trained a linear SVM on features such as RNA secondary structure and transmembrane segment hydrophobicity to predict the outcome for unknown proteins. This simple approach (preprint available here) achieved area under ROC curve of around 0.6 on a separate test set, and using more complex machine learning techniques is likely to improve this. Samuel Schulte is adapting the same method for prediction of expression in yeast.

Overall, it was a great conference and it was nice to hear about plenty of experimental work alongside the more familiar computational work. I would also highly recommend New Orleans as an excellent place to find great food, jazz and sunshine!

Using Antibody Next Generation Sequencing data to aid antibody engineering

       I consider myself a wet lab scientist and I had not done any dynamic programming language like Python before starting my DPhil. My main interests lie in development of improved antibody humanization campaigns, rational antibody phage display library constructions and antibody evolution. Having completed industrial placement at MedImmune, I saw the biotechnology industry from the inside and realized that scientists who could bridge computer science and wet lab fields are in high demand.

      The title of my DPhil is very broad, and research itself is data rather than hypothesis driven. Our research group collaborates with UCB Pharma, which has sequenced whole antibody repertoires across a number of species. Datasets might contain more than 10 million sequences of heavy and light variable chains. But even these datasets do not cover more than 1% of the theoretical repertoire, hence looking at entropies of sequences rather than mere sequences could provide insights into differences between intra- and inter- species datasets.

        NGS of antibody repertoires provides snapshots of repertoire diversity, entropy as well as sequences. Reddy, S.T. et al 2010 showed that this information could be successfully used to pull target specific variable chains. But most of research groups believe that main application of NGS is immunodiagnostics (Grieff et al., 2015).

       My project involves applying software developed by our research group namely, Anarci (Dunbar J and Deane CM., 2016) and ABodyBuilder (Leem J. et al 2016). Combination of both softwares allows analysis of NGS datasets at an unprecedented rate (1 million sequences per 7 hours). A number of manipulations can be performed on datasets to standardize them and make data reproducible, which is a big issue in science. It is possible to re-assign germlines, numbering schemes and complementary determining region (CDR) definitions of a 10 million dataset in less than a day. For instance, UCB provided data required our variable chains to be re-numbered according to IMGT numbering and CDR definition (Lefranc M., 2011). The reason for the IMGT numbering scheme selection is that it supports symmetrical amino acid numbering of CDRs, which allows for improved assignment of positions to amino acids that are located in the same structural space between different length CDRs (Figure 1).

                Figure 1. IMGT numbering and CDR definition of CDR3. Symmetrical assignment of positions to amino acids in HCDR3 allows for better localization of V,D,J genes: V gene encodes for the amino terminus, J gene encodes the carboxyl terminus of CDR3, and D gene the mid portion.

       To sum up, analysis of CDR lengths, CDR and framework amino acid compositions, finding novel patterns in antibody repertoires will open up new rational steps of antibody humanization and affinity maturation. The key step will be to determine amino acid scaffolds that define humanness of antibody or in other words, scaffolds that are not immunogenic in humans.

References:

  1. Dunbar J., and Deane CM., ANARCI: Antigen receptor numbering and receptor classification. Bioinformatics (2016)
  2. Grieff V., A bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status. Genome Medicine (2015)
  3. Leem J., et al. ABodyBuilder: automated antibody structure prediction with data-driven accuracy estimation. mAbs. (2016)
  4. Lefranc M., IMGT, the International ImMunoGeneTics Information System. Cold Spring Harb Protoc. (2011)
  5. Reddy ST., et al. Monoclonal antibodies isolated without screening by analyzing the variable-gene repertoire of plasma cells. Nat Biotech. (2010)

Multiomics data analysis

Cells are the basic functional and structural units of living organisms. They are the location of many different biological processes, which can be probed by various biological techniques. Until recently such data sets have been analysed separately. The aim is to better understand the underlying biological processes and how they influence each other. Therefore techniques that integrate the data from different sources might be applicable [1].

In the image below you see the four main entities that are active throughout the cell: Genome, RNA, proteins, and metabolites. All of them are in constant interaction, for example, some proteins are transcription factors and influence the transcription of DNA into RNA. Metabolites that are present in the cell also influence the activity of proteins as ligands but at the same time are altered through enzymatic activity. This ambiguity of interactions makes it clear that probing the system at a single level gives only limited insight into the structure and function of the cellular processes.

 

multiomics_schematic

The different levels of biological information (genome, proteome, …) work mutually and influence each other through processes as transcription regulation through transcription factors. All levels are influenced by external factors, as drug treatment or nutrient availability. Multiomics is the measurement of multiple of those populations and their integrated analysis.

In the last years, different ways to integrate such data have been developed. Broadly speaking there are three levels of data integration: conceptual integration, statistical integration, and model-based integration [2]. Conceptual integration means that the data sets are analysed separately and the conclusions are compared and integrated. This method can easily use already existing analysis pipelines but the way in which conclusions are compared and integrated is non-trivial. Statistical Integration combines data sets and analyses them jointly, reaching conclusions that match all data and potentially finding signals that are not observable with the conceptual approach. Model-based integration indicates the joint analysis of the data in a combination of training of a model, which itself might incorporate prior beliefs of a system.

[1] Gehlenborg, Nils, Seán I. O’donoghue, Nitin S. Baliga, Alexander Goesmann, Matthew A. Hibbs, Hiroaki Kitano, Oliver Kohlbacher et al. “Visualization of omics data for systems biology.” Nature methods 7 (2010): S56-S68.

[2] Cavill, Rachel, Danyel Jennen, Jos Kleinjans, and Jacob Jan Briedé. “Transcriptomic and metabolomic data integration.” Briefings in bioinformatics 17, no. 5 (2016): 891-901.

Prions

The most recent paper presented to the OPIG journal club from PLOS Pathogens, The Structural Architecture of an Infectious Mammalian Prion Using Electron Cryomicroscopy. But prior to that, I presented a bit of a background to prions in general.

In the 1960s, work was being undertaken by Tikvah Alper and John Stanley Griffith on the nature of a transmissible infection which caused scrapie in sheep. They were interested in how studies of the infection showed it was somehow resistant to ionizing radiation. Infectious elements such as bacteria or viruses were normally destroyed by radiation with the amount of radiation required having a relationship with the size of the infectious particle. However, the infection caused by the scrapie agent appeared to be too small to be caused by even a virus.

In 1982, Stanley Prusiner had successfully purified the infectious agent, discovering that it consisted of a protein. “Because the novel properties of the scrapie agent distinguish it from viruses, plasmids, and viroids, a new term “prion” was proposed to denote a small proteinaceous infectious particle which is resistant to inactivation by most procedures that modify nucleic acids.”
Prusiner’s discovery led to him being awarded the Nobel Prize in 1997.

Whilst there are many different forms of infection, such as parasites, bacteria, fungi and viruses, all of these have a genome. Prions on the other hand are just proteins. Coming in two forms, the naturally occurring cellular (PrPC) and the infectious form PrPSC (Sc referring to scrapie), through an as yet unknown mechanism, PrPSC prions are able to reproduce by forcing beneign PrPC forms into the wrong conformation.  It’s believed that through this conformational change, the following diseases are caused.

  • Bovine Spongiform encephalopathy (mad cow disease)
  • Scrapie in:
    • Sheep
    • Goats
  • Chronic wasting disease in:
    • Deer
    • Elk
    • Moose
    • Reindeer
  • Ostrich spongiform encephalopathy
  • Transmissible mink encephalopathy
  • Feline spongiform  encephalopathy
  • Exotic ungulate encephalopathy
    • Nyala
    • Oryx
    • Greater Kudu
  • Creutzfeldt-Jakob disease in humans

 

 

 

 

 

 

 

 

Whilst it’s commonly accepted that prions are the cause of the above diseases there’s still debate whether the fibrils which are formed when prions misfold are the cause of the disease or caused by it. Due to the nature of prions, attempting to cure these diseases proves extremely difficult. PrPSC is extremely stable and resistant to denaturation by most chemical and physical agents. “Prions have been shown to retain infectivity even following incineration or after being subjected to high autoclave temperatures“. It is thought that chronic wasting disease is normally transmitted through the saliva and faeces of infected animals, however it has been proposed that grass plants bind, retain, uptake, and transport infectious prions, persisting in the environment and causing animals consuming the plants to become infected.

It’s not all doom and gloom however, lichens may long have had a way to degrade prion fibrils. Not just a way, but because it’s apparently no big thing to them, have done so twice. Tests on three different lichens species: Lobaria pulmonaria, Cladonia rangiferina and Parmelia sulcata, indicated at least two logs of reduction, including reduction “following exposure to freshly-collected P. sulcata or an aqueous extract of the lichen”. This has the potential to inactivate the infectious particles persisting in the landscape or be a source for agents to degrade prions.

Using RDKit to load ligand SDFs into Pandas DataFrames

If you have downloaded lots of ligand SDF files from the PDB, then a good way of viewing/comparing all their properties would be to load it into a Pandas DataFrame.

RDKit has a very handy function just for this – it’s found under the PandasTool module.

I show an example below within Jupypter-notebook, in which I load in the SDF file, view the table of molecules and perform other RDKit functions to the molecules.

First import the PandasTools module:

from rdkit.Chem import PandasTools

Read in the SDF file:

SDFFile = "./Ligands_noHydrogens_noMissing_59_Instances.sdf"
BRDLigs = PandasTools.LoadSDF(SDFFile)

You can see the whole table by calling the dataframe:

BRDLigs

The ligand properties in the SDF file are stored as columns. You can view what these properties are, and in my case I have loaded 59 ligands each having up to 26 properties:

BRDLigs.info()

It is also very easy to perform other RDKit functions on the dataframe. For instance, I noticed there is no heavy atom column, so I added my own called ‘NumHeavyAtoms’:

BRDLigs['NumHeavyAtoms']=BRDLigs.apply(lambda x: x['ROMol'].GetNumHeavyAtoms(), axis=1)

Here is the column added to the table, alongside columns containing the molecules’ SMILES and RDKit molecule:

BRDLigs[['NumHeavyAtoms','SMILES','ROMol']]

Confidence (scores) in STRING

There are many techniques for inferring protein interactions (be it physical binding or functional associations), and each one has its own quirks: applicability, biases, false positives, false negatives, etc. This means that the protein interaction networks we work with don’t map perfectly to the biological processes they attempt to capture, but are instead noisy observations.

The STRING database tries to quantify this uncertainty by assigning scores to proposed protein interactions based on the nature and quality of the supporting evidence. STRING contains functional protein associations derived from in-house predictions and homology transfers, as well as taken from a number of externally maintained databases. Each of these interactions is assigned a score between zero and one, which is (meant to be) the probability that the interaction really exists given the available evidence.

Throughout my short research project with OPIG last year I worked with STRING data for Borrelia Hermsii, a relatively small network of scored interactions across 815 proteins. I was working with v.10.0., the latest available database release, but also had the chance to compare this to v.9.1 data. I expected that with data from new experiments and improved scoring methodologies available, the more recent network would be more or less a re-scored superset of the older. Even if some low-scored interactions weren’t carried across the update, I didn’t expect these to be any significant proportion of the data. Interestingly enough, this was not the case.

Out of 31 264 scored protein-protein interactions in v.9.1. there were 10 478, i.e. almost exactly a third of the whole dataset, which didn’t make it across the update to v.10.0. The lost interactions don’t seem to have very much in common either — they come from a range of data sources and don’t appear to be located within the same region of the network. The update also includes 21 192 previously unrecorded interactions.

densityComparison

Gaussian kernel density estimates for the score distribution of interactions across the entire 9.1. Borrelia Hermsii dataset (navy) and across the discarded proportion of the dataset (dark red). Proportionally more low-scored interactions have been discarded.

Repeating the comparison with baker’s yeast (Saccharomyces cerevisiae), a much more extensively studied organism, shows this isn’t a one-off case either. The yeast network is much larger (777 589 scored interactions across 6400 proteins in STRING v.9.1.), and the changes introduced by v.10.0. appear to be scaled accordingly — 237 427 yeast interactions were omitted in the update, and 399 836 new ones were added.

discardedYeast

Kernel density estimates for the score distribution for yeast in STRING v.9.1. While the overall (navy) and discarded (dark red) score distributions differ from the ones for Borrelia Hermsii above, a similar trend of omitting more low-scored edges is observed.

So what causes over 30% of the scored interactions in the database to disappear into thin air? At least in part this may have to do with thresholding and small changes to the scoring procedure. STRING truncates reported interactions to those with a score above 0.15. Estimating how many low-scored interactions have been lost from the original dataset in this way is difficult, but the wide coverage of gene co-expression data would suggest that they’re a far from negligible proportion of the scored networks. The changes to the co-expression scoring pipeline in the latest release [1], coupled with the relative abundance of co-expression data, could have easily shifted scores close to 0.15 on the other side of the threshold, and therefore might explain some of the dramatic difference.

However, this still doesn’t account for changes introduced in other channels, or for interactions which have non-overlapping types of supporting evidence recorded in the two database versions. Moreover, thresholding at 0.15 adds a layer of uncertainty to the dataset — there is no way to distinguish between interactions where there is very weak evidence (i.e. score below 0.15), pairs of proteins that can be safely assumed not to interact (i.e. a “true” score of 0), and pairs of proteins for which there is simply no data available. While very weak evidence might not be of much use when studying a small part of the network, it may have consequences on a larger scale: even if only a very small fraction of these interactions are true, they might be indicative of robustness in the network, which can’t be otherwise detected.

In conclusion, STRING is a valuable resource of protein interaction data but one ought to take the reported scores with a grain of salt if one is to take a stochastic approach to protein interaction networks. Perhaps if scoring pipelines were documented in a way that made them reproducible and if the data wasn’t thresholded, we would be able to study the uncertainty in protein interaction networks with a bit more confidence.

References:

[1] Szklarczyk, Damian, et al. “STRING v10: protein–protein interaction networks, integrated over the tree of life.” Nucleic acids research (2014): gku1003