Predicted protein contacts: is it the solution to (de novo) protein structure prediction?

So what is this buzz I hear about predicted protein contacts? Is it really the long awaited solution for one of the biggest open problems in biology today? Has protein structure prediction been solved?

Well, first things first. Let me give you a quick introduction to this predicted protein contact business (probably not quick enough for an elevator pitch, but hopefully you are not reading this in an elevator).

Nowadays, the scientific community has become very good at sequencing things (and by things I mean genetic things, like whole genomes of a bunch of different people and organisms). We are so good at it that mountains of sequence data are now available: genes, mRNAs, protein sequences. The question is what do we do with all this data?

Good scientists are coming up with new and creative ideas to extract knowledge from these mountains of data. For instance, one can build multiple sequence alignments using protein sequences for a given protein family. One of the ways in which information can be extracted from these multiple sequence alignments is by identifying extremely conserved columns (think of the alignment as a big matrix). Residues in these conserved positions are good candidates for being functionally important for the proteins in that particular family.

Another interesting thing that can be done is to look for pairs of residues that are mutating in a correlated fashion. In more practical terms, you are ascertaining how correlated is the information between two columns of a multiple sequence alignment; how often a change in one of them is countered by a change in the other. Why would anyone care about that? Simple. There is an assumption that residues that mutate in a correlated fashion are co-evolving. In other words, they share some sort of functional dependence (i.e. spatial proximity) that is under selective pressure.

Ok, that was a lot of hypotheticals, does it work? For many years, it didn’t. There were lots of issues with the way these correlations were computed and one of the biggest problems was to identify (and correct for) transitivity. Transitivity is the idea that you observe a false correlation between residues A and C because residues A,B and residues B,C are mutating in a correlated fashion. AS more powerful statistical methods were developed (borrowing some ideas from mechanical statistics), the transitivity issue has seemingly been solved.

The newest methods that detect co-evolving residues in a multiple sequence alignment are capable of detecting protein contacts with high precision. In this context, a contact is defined as two residues that are close together in a protein structure. How close?  Their C-betas must be 8 Angstroms or less apart. When sufficient sequence information is available (at least 500 sequences in the MSA), the average precision of the predicted contacts can reach 80%.

This is a powerful way of converting sequence information into distance constraints, which can be used for protein structure modelling. If a sufficient number of correct distance constraints is used, we can accurately predict the topology of a protein [1]. Recently, we have also observed great advances in the way that models are refined (that is, refining a model that contains the correct topology to atomic, near-experimental resolution). If you put those two things together, we start to look at a very nice picture.

So what’s the catch? The catch was there. Very subtle. “When sufficient sequence information is available”. Currently, there is an estimate that only 15% of the de novo protein structure prediction cases present sufficient sequence information for the prediction of protein contacts. One potential solution would be to sit and wait for more and more sequences to be obtained. Yet a potential pitfall of sitting and waiting is that there is no guarantee that we will have sufficient sequence information for a large number of protein families, as they may as well present less than 500 members.

Furthermore, scientists are not very good at sitting around and waiting. They need to keep themselves busy. There are many things that the community as whole can invest time on while we wait for more sequences to be generated. For instance, we want to be sure that, for the cases where there is a sufficient number of sequences, that we get the modelling step right (and predict the accurate protein topology). Predicted contacts also show potential as a tool for quality assessment and may prove to be a nice way of ascertaining whether you have confidence that a model with correct topology was created. More than that, model refinement still needs to improve if we want to make sure that we get from the correct topology to near-experimental resolution.

Protein structure prediction is a hard problem and with so much room for improvement, we still have a long way to go. Yet, this predicted contact business is a huge step in the right direction. Maybe, it won’t be long before models generated ab initio are considered as reliable as the ones generated using a template. Who knows what promised the future holds.

References:

[1] Kim DE, Dimaio F, Yu-Ruei Wang R, Song Y, Baker D. One contact for every twelve residues allows robust and accurate topology-level protein structure modeling. Proteins. 2014 Feb;82 Suppl 2:208-18. doi: 10.1002/prot.24374. Epub 2013 Sep 10.

 

 

 

Modelling antibodies, from Sequence, to Structure…

Antibody modelling has come a long way in the past 5 years. The Antibody Modelling Assessment (AMA) competitions (effectively an antibody version of CASP) have shown that most antibody design methods are capable of modelling the antibody variable fragment (Fv) at ≤ 1.5Å. Despite this feat, AMA-II provided two important lessons:

1. We can still improve our modelling of the framework region and the canonical CDRs.

Stage two of the AMA-II competition showed that CDR-H3 modelling improves once the correct crystal structure was provided (bar the H3 loop, of course). In addition, some of the canonical CDRs (e.g. L1) were modelled poorly, and some of the framework loops had also been poorly modelled.

2. We can’t treat orientation as if it doesn’t exist.

Many pipelines are either vague about how they predict the orientation, or have no explicit explanation on how the orientation will be predicted for the model structure. Given how important the orientation can be toward the antibody’s binding mode (Fera et al., 2014), it’s clear that this part of the pipeline has to be re-visited more carefully.

In addition to these lessons, one question remains:

What do we do with these models?

No pipeline, as far as we are aware, have no comments on what we should do beyond creating the model from a pipeline. What are its implications? Can we even use it for experiments, and use it as a potential therapeutic in the long-term? In light of these lessons and this blaring question, we developed our own method.

Before we begin, how does modelling work?

In my mind, most, if not all, pipelines follow this generic paradigm:pipeline2

Our method, ABodyBuilder, also follows this 4-step workflow;

  1. We choose the template structure based on sequence identity; below a threshold, we predict the structure of the heavy and light chains separately
  2. In the event that we use the structures from separate antibodies, we predict the orientation from the structure with the highest global sequence identity.
  3. We model the loops using FREAD (Choi, Deane, 2011)
  4. Graft the side chains in using SCWRL.

Following the modelling procedure, our method also annotates the accuracy of the model in a probabilistic context — i.e., an estimated probability that a particular region is modelled at a given RMSD threshold. Moreover, we also flag up any issues that an experimentalist can run into should they ever decide to model the antibody.

The accuracy estimation is a data-driven estimation of model quality. Many pipelines end up giving you just a model – but there’s no way of determining model accuracy until the native structure is determined. This is particularly problematic for CDRH3 where RMSDs can reach up to >4.0A between models and native structures, and it would be incredibly useful to have an a priori, expected estimation of model accuracy.

Furthermore, by commenting on motifs that can conflict with antibody development, we aim to offer a convenient solution for users when they are considering in vitro experiments with their target antibody. Ultimately, ABodyBuilder is designed with the user in mind, making an easy-to-use, informative software that facilitates antibody modelling for novel applications.

Le tour d’OPIG 2015

The third iteration of “Let’s use two wheels to transport us to many pubs” took place earlier this summer, on Wednesday 20th May. Following on from the great successes of the last two years, there was much anticipation, and the promise of a brand new route. This year we covered 8 miles, via the Chester, the King’s Arms at Sandford lock, the Prince of Wales in Iffley, and the Magdalen Arms. Nobody fell in the river or went hungry, so it was considered a success!

2015 route

IMG_20150520_210247519

IMG_20150520_181800722

IMG_20150520_190103582

IMG_20150520_190137098

 

 

 

 

 

 

IMG_20150520_203747651

IMG_20150520_210044077

Using B factors to assess flexibility

In my work of analysing antibody loops I have reached the point where I was interested in flexibility, more specifically challenging the somewhat popular belief that they have a high flexibility, especially the H3 loop. I wanted to use for this the B/Temperature/Debye-Waller factor which can be interpreted as a measure of the temperature dependent vibration of the atoms in the crystal, or in more gentler terms the flexibility at a certain position. I was keen to use the backbone atoms, and possibly the Cβ, but unfortunately the B factor shows some biases as it is used to mask other uncertainties due to high resolution, low electron density and as a result poor modelling. If we take a non redundant set of loops and split them in resolution shells of 0.2A we see how pronounced this bias is (Fig. 1 (a)).

b_factor_vs_res

Fig. 1(a) Comparison of average backbone B factors for loops found in structures at increasing resolution. A clear bias can be observed that correlates with the increase in resolution.

norm_b_factor_vs_res

Fig. 1(b) Normalization using the average the Z-score of the B factor of backbone atoms shows no bias at different resolution shells.

Comparing loops in neighbouring shells is virtually uninformative, and can lead to quite interesting results. In one analysis it came up that loops that are directly present in the binding site of antibodies have a higher average B factor than loops in structures without antigen where the movement is less constrained.

The issue here is that a complex structure (antibody-antigen) is larger, and has a poorer resolution, and therefore more biased B factors. To solve this issue I decided to normalize the B factors using the Z-score of the PDB file, where the mean and the standard deviation are computed from all the backbone atoms of amino acids inside the PDB file. This method to my knowledge was first described by (Parthasarathy and Murthy, 1997) [1] , although I came to the result without reading their paper, the normalization being quite intuitive. Using this measure we can finally compare loops from different structures at different resolutions (Fig. 1 (b)) with each other and we see what is expected: loops found in bound structures are less flexible than loops in unbound structures (Fig. 2). We can also answer our original question: does the H3 loop present an increased flexibility? The answer from Fig, 2 is no, if we compare a non-redundant sets of loops from antibodies to general proteins.

norm_b_factor_plot_heavy

Fig. 2 Flexibility comparison using the normalized B factor between a non-redundant set of non-IG like protein loops and different sets of H3 loops: bound to antigen (H3 bound), unbound (H3 unbound), both (H3). For each comparison ten samples with same number of examples and similar length distribution have been generated  and amassed (LMS) to correct for the possibility of length bias induced by the H3 loop which is known to have a propensity for longer loops than average.

References

[1] Parthasarathy, S. ; Murthy, M. R. N. (1997) Analysis of temperature factor distribution in high-resolution protein structures Protein Science, 6 (12). pp. 2561-2567. ISSN 0961-8368

Journal Club: Spontaneous transmembrane helix insertion thermodynamically mimics translocon-guided insertion

Many methods are available for prediction of topology of transmembrane helices, this being one of the success stories of protein structure prediction with accuracies over 90%. However, there are still areas where there is disagreement in some areas about the partitioning between the states of dissolved in water and positioned across a lipid bilayer. Complications arise because there are so many methods of measuring the thermodynamics of this transition – experimental and theoretical, in vivo and in vitro. It is uncertain what difference the translocon makes to the energetics of insertion – is the topology and conformation of a membrane protein the global thermodynamic minimum or just a kinetic product?

This paper uses three approaches to measure partitioning to test the agreement between different methods. The authors aim to reconcile differences calculated so far for insertion of an arginine residue into the membrane (ranging from +2 to +15 kcal/mol). This is an important question, because many transmembrane helices are only marginally hydrophobic and it is not known how and when they insert in the folding process. Arginine is chosen here because the pKa of 12.5 of the side chain is very high so it will not deprotonate in the centre of a bilayer and complications of protonation and deprotonation do not need to be considered. The same peptide is used for each method, of the form LnRLn, and the ratio between the interface and transmembrane states is used to calculate estimates of ΔG. In order to make sure that there were helices with a ΔG close to zero for accurate estimates, they used a range of values of n from 5-8.

The first method was an insertion assay using reconstituted microsomes, where this helix was inserted into the luminal domain of LepB. A glycosylation site was added at each end of the helix, but glycosylation takes place only on sites inside microsomes. Helices inserted into the membrane are only glycosylated once, whereas secreted helices are glycosylated twice and those which did not go through the translocon are not glycosylated. SDS-PAGE can separate these states by mass, and the ratio between single and double glycosylation gives the partitioning between inserted and interface helices out of those which entered the translocon. As expected, the trend is for longer helices with more leucine to favour the transmembrane state.

 Adapted from Figure 4a: The helix, H, either passes through the translocon into the lumen ("S") resulting in two glycosylations (green pentagons), or is inserted (TM) resulting in one glycosylation.

Adapted from Figure 4a: The helix, H, either passes through the translocon into the lumen (“S”) resulting in two glycosylations (green pentagons), or is inserted (TM) resulting in one glycosylation.

The second method was also experimental: oriented synchrotron radiation circular dichroism (ORSCD). Here they used just the peptide with one glycine at each end, as this would be able to equilibrate between the two states quickly. Theoretical spectra can be calculated for a helix , and therefore the ratio in which they must be combined to give the measured spectrum for a given peptide gives the ratio of transmembrane and interface states present.

Figure 2b: TM and IP are the theoretical spectra for the transmembrane and interface states, and the peptides fall somewhere in between.

Figure 2b: TM and IP are the theoretical spectra for the transmembrane and interface states, and the peptides fall somewhere in between.

Finally, the authors present 4 μs molecular dynamics simulations of the same peptides at 140°C, so that equilibration between the two states would be fast. The extended peptide at the start of the simulation quickly associates with the membrane and adopts a helical conformation. An important observation to note is that the transmembrane state is in fact at around 30° to the membrane normal, to allow the charged guanidinium group of the arginine to “snorkel” up to interact with charged phosphate groups of the lipids. Therefore this state is defined as transmembrane, in contrast to the OSRCD experiments where the theoretical TM spectrum was calculated for a perpendicular helix. This may be a source of some inaccuracy in the propensities calculated from OSRCD.

Figure 2c: Equilibration in the simulation for the L<sub>7</sub>RL<sub>7</sub> peptide. Transmembrane and interface states are seen in the partitioning and equilibration phases after the helix has formed.

Figure 2c: Equilibration in the simulation for the L7RL7 peptide. Transmembrane and interface states are seen in the partitioning and equilibration phases after the helix has formed.

Figure 3c: As the simulations run, the proportion of helices in the transmembrane state (PTM) converges to a different value for each peptide.

Figure 3c: As the simulations run, the proportion of helices in the transmembrane state (PTM) converges to a different value for each peptide.

Overall, the ΔG calculated experimental and molecular dynamics (MD) simulations agree very well. In fact, they agree better than those from previous studies of a similar format looking at polyleucine helices, where there was a consistent offset of 2 kcal/mol between the experiment and simulation derived values. The authors are unable to explain why the agreement for this study is better, but they indicate that it is unlikely to be related to any stabilisation by dimerisation in the experimental results, as a 4 μs MD simulation of two helices did not show them forming stable interactions. The calculated difference in insertion energy (ΔΔG) on replacing a leucine with argnine is therefore calculated to be +2.4-4.3 kcal/mol by experiment and +5.4-6.8 by simulation, depending on the length of the peptide (it is a more costly substitution for longer peptides as the charge is buried deeper). The difference between the experimental and simulation results is accounted for by their disagreement in the polyleucine study.

We thought this paper was a great example of experimental design, where the system was carefully chosen so that different experimental and theoretical approaches would be directly comparable. The outcome is good agreement between the methods, demonstrating that the vastly different values recorded previously seem to be because very different questions were being asked.

Protein loops – why do we care?

In my DPhil research, I work on the development of new methods for predicting protein loop structures. But what exactly are loops, and why should we care about their structures?

Many residues in a given protein will form regions of regular structure, in α-helices and β-sheets. The segments of the protein that join these secondary structure elements together, that do not have easily observable regular patterns in their structure, are referred to as loops. This does not mean, though, that loops are only a minor component of a protein structure – on average, half of the residues in a protein are found in loops [1], and they are typically found on the surface of the protein, which is largely responsible for its shape, dynamics and physiochemical properties [2].

Connecting different secondary structures together is often not the only purpose of a loop – they are often vitally important to a protein’s function. For example, they are known to play a role in protein-protein interactions, recognition sites, signalling cascades, ligand binding, DNA binding, and enzyme catalysis [3].

As regular readers of the blog are probably aware by now, one of the main areas of research for our group is antibodies. Loops are vital for an antibody’s function, since its ability to bind to an antigen is mainly determined by six hypervariable loops (the complementarity determining regions). The huge diversity in structure displayed by these loops is the key to how antibodies can bind to such different substances. Knowledge of loop structures is therefore extremely useful, enabling predictions to be made about the protein.

Loops involved in protein function: a methyltransferase binding to DNA (top left, PDB 1MHT); the active site of a triosephosphate isomerase enzyme (bottom left, PDB 1NEY); an antibody binding to its antigen (blue, surface representation) via its complementarity determining regions, shown as the coloured loops (centre, PDB 3NPS); the activation loop of a tyrosine kinase has a different conformation in the active (pink) and inactive (blue) forms (top right, PDBs 1IRK and 1IR3); a zinc finger, where the zinc ion is coordinated by the sidechain atoms of a loop (bottom right, PDB 4YH8).

Loops involved in protein function: a methyltransferase binding to DNA (top left, PDB 1MHT); the active site of a triosephosphate isomerase enzyme (bottom left, PDB 1NEY); an antibody binding to its antigen (blue, surface representation) via its complementarity determining regions, shown as the coloured loops (centre, PDB 3NPS); the activation loop of a tyrosine kinase has a different conformation in the active (pink) and inactive (blue) forms (top right, PDBs 1IRK and 1IR3); a zinc finger, where the zinc ion is coordinated by the sidechain atoms of a loop (bottom right, PDB 4YH8).

More insertions, deletions and substitutions occur in loops than in the more conserved α-helices and β-sheets [4]. This means that, for a homologous set of proteins, the loop regions are the parts that vary the most between structures. While this often makes the protein’s function possible, as in the case of antibodies, it leads to unaligned regions in a sequence alignment, standard homology modelling techniques can therefore not be used. This makes prediction of their structure difficultit is frequently the loop regions that are the least accurate parts of a protein model.

There are two types of loop modelling algorithm: knowledge-based and ab initio. Knowledge-based methods look for appropriate loop structures from a database of previously observed fragments, while ab initio methods generate possible loop structures without prior knowledge. There is some debate about with approach is the best. Knowledge-based methods can be very accurate when the target loop is close in structure to one seen before, but perform poorly when this is not the case; ab initio methods are able to access regions of the conformational space that have not been seen before, but fail to take advantage of any structural data that is available. For this reason, we are currently working on developing a new method that combines aspects of the two approaches, allowing us to take advantage of the available structural data whilst allowing us to predict novel structures.

[1] L. Regad, J. Martin, G. Nuel and A. Camproux, Mining protein loops using a structural alphabet and statistical exceptionality. BMC Bioinformatics, 2010, 11, 75.

[2] A. Fiser and A. Sali, ModLoop: automated modeling of loops in protein structures. Bioinformatics, 2003, 19, 2500-2501.

[3] J. Espadaler, E. Querol, F. X. Aviles and B. Oliva, Identification of function-associated loop motifs and application to protein function prediction. Bioinformatics, 2006, 22, 2237-2243.

[4] A. R. Panchenko and T. Madej, Structural similarity of loops in protein families: toward the understanding of protein evolution. BMC Evolutionary Biology, 2005, 5, 10.

Antibody binding site re-design

In this blog post I describe three successful studies on structure based re-design of antibody binding sites, leading to significant improvements of binding affinity.

In their study Clark et al.[1] re-designed a binding site of antibody AQC2 to improve its binding affinity to the I domain of human integrin VLA1. The authors assessed the effects of the mutations on the binding energy using the CHARMM[2,3] potential with the electrostatic and desolations energies calculated using the ICE software[4]. In total, 83 variants were identified for experimental validation, some of which included multiple mutations. The mutated antibodies were expressed in E. Coli and the affinity to the antigen was measured. The best mutant included a total of four mutations which improved the affinity by approximately one order of magnitude from 7 nM to 850 pM. The crystal structure of the best mutant was solved to further study the interaction of the mutant with the target.

Figure 1: Comparison of calculated and experimental binding free energies. (Lippow et al., 2007)

Figure 1: Comparison of calculated and experimental binding free energies. (Lippow et al., 2007)

Lippow et al.[5] studied the interactions of three antibodies – the anti-epidermal growth factor receptor drug cetuximab[6], the anti-lysozyme antibody D44.1 and the anti-lysosyme antibody D1.3 with their respective antigens. The energy calculations favoured mutations to large amino acids (such as Phe or Trp) of which most were found to be false positives. More accurate results were obtained using only the electrostatic term of the energy function. The authors improved the binding affinity of D44.1 by one order of magnitude and the affinity of centuximab by 2 orders of magnitude. The antibody D1.3 didn’t show many opportunities for electrostatic improvement and the authors suggest it might be an anomalous antibody.

Computational methods have recently been used to successfully introduce non-canonical amino acids (NCAA) into the antibody binding site. Xu et al.[7] introduced L-DOPA (L-3,4-dihydroxephenyalanine) into the CDRs of anti-protective antigen scFv antibody M18 to crosslink it with its native antigen. The authors used the program Rosetta 3.4 to create models of antibody-antigen complex with L-DOPA residues. The distance between L-DOPA and a lysine nucleophile was used as a predictor of crosslinking was. The crosslinking efficiency was quantified as a fraction of antibodies that underwent a mass change, measured using Western blot assays. The measured average efficiency of the mutants was 10% with the maximum efficiency of 52%.

[1]      Clark LA, Boriack-Sjodin PA, Eldredge J, Fitch C, Friedman B, Hanf KJM, et al. Affinity enhancement of an in vivo matured therapeutic antibody using structure-based computational design. Protein Sci 2006;15:949–60. doi:10.1110/ps.052030506.

[2]      Brooks BR, Bruccoleri RE, Olafson DJ, States DJ, Swaminathan S, Karplus M. CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations. J Comput Chem 1983;4:187–217.

[3]      MacKerel Jr. AD, Brooks III CL, Nilsson L, Roux B, Won Y, Karplus M. CHARMM: The Energy Function and Its Parameterization with an Overview of the Program. In: v. R. Schleyer et al. P, editor. vol. 1, John Wiley & Sons: Chichester; 1998, p. 271–7.

[4]      Kangas E, Tidor B. Optimizing electrostatic affinity in ligand–receptor binding: Theory, computation, and ligand properties. J Chem Phys 1998;109:7522. doi:10.1063/1.477375.

[5]      Lippow SM, Wittrup KD, Tidor B. Computational design of antibody-affinity improvement beyond in vivo maturation. Nat Biotechnol 2007;25:1171–6. doi:10.1038/nbt1336.

[6]      Sato JD, Kawamoto T, Le AD, Mendelsohn J, Polikoff J, Sato GH. Biological effects in vitro of monoclonal antibodies to human epidermal growth factor receptors. Mol Biol Med 1983;1:511–29.

[7]      Xu J, Tack D, Hughes RA, Ellington AD, Gray JJ. Structure-based non-canonical amino acid design to covalently crosslink an antibody-antigen complex. J Struct Biol 2014;185:215–22. doi:10.1016/j.jsb.2013.05.003.

GO annotations: Warning – secondary IDs!

I’m writing this post after a bit of a scare that has cost me at least 2 hours of my life so far, and possibly may cost me a lot more. This is something I wished I had read about before starting to use gene ontology annotations (GO terms), so hopefully this will reach some people that are in the position I was in a couple of years ago. The post assumes you know about the GO directed acyclic graph (DAG), which describes the relationships between GO terms, and a bit about functional similarity. Good reviews on this were written by Pesquita et al (2009) and Guzzi et al (2012)

 

Secondary GO-terms

Yesterday evening I discovered GO terms can have secondary accession numbers. For example the GO term GO:0060555 is a secondary ID for GO:0070266 which describes the biological process term “necroptotic process”. The existence of secondary IDs in itself is not particularly surprising given that ontologies like the gene ontology are at the forefront of research and thus also have to be updated with the latest knowledge. As is the case for the UniProt KB and the RefSeq databases identifiers are merged if they are found to refer to the same thing. RefSeq has a good overview for when and why this occurs in their FAQ. Keeping a record of secondary IDs in the database is important for compatibility with past applications of the ontology.

 

Why is this a problem?

This can become a problem when the reverse compatibility is not fully committed to. For example, if the downloadable annotation set for human proteins contains GO terms that are secondary IDs, while the downloadable ontology structure does not include these secondary IDs in the directed acyclic graph (DAG). This means someone may download all the annotations and check what their parent terms are in the DAG, but as these terms are not included in the DAG, they do not have a parent term set.

 

So why the scare?

It seems like this behaviour should become obvious very fast. It should either give a key error that you are looking for something that isn’t there, or an empty set should ring some alarm bells. The scare comes in when neither happens, and you just don’t notice it at all, only to find out a couple of years later…

In order to calculate the functional similarity between two proteins, I need the full GO term sets associated with every protein including ancestral terms. I have a local copy of the ontology structure which I can use to query for these ancestral terms based on the GO terms directly associated with the proteins. As I need to do this for between 6800 and 13000 proteins depending on the network, I automated the process. The problem is that MySQL returns an empty set when asking for a match on an accession number that isn’t there. Returning an empty set has the effect that terms not in the ontology structure are automatically filtered out. This came to light now that I’m looking to explicitly calculate the information content of each GO term (a metric for how prevalent a term is in a data set) and three terms could not be found in the ancestral term sets for any proteins.

 

How do I prevent this from happening to me?

As there are only few secondary ID terms associated with proteins in the downloadable association file, it is easy to manually look for mappings to the respective primary IDs in the EBI QuckGO tool. Then, before querying for ancestral term sets, just map the secondary IDs across to the newer IDs and you’re all set. It’s really not a hassle if you know it’s there. Sadly, there is only a sentence on the gene ontology site about secondary IDs and no mention whatsoever that these IDs are not incorporated in the GO-basic ontology download, which is recommended for finding ancestral term sets.

 

Are you okay now?

As I hinted at before, there are very few proteins annotated with secondary ID terms. In my case I found three secondary ID GO terms annotated to five different proteins, out of a total 6861 proteins I was checking. One protein was also annotated with the corresponding primary ID term, so there was no effect, and another protein was annotated with the only parent term of the secondary ID, which no other proteins were annotated to. Thus, the effect really boils down to three proteins with the same secondary ID. Of those three, only one is heavily affected in the worst case by its similarity score with 10 other proteins changing from 9.3 to 2.6 without the primary ID annotation (scores range from ~ 1.5 to 12). Those are 10 scores out of a total of approximately 24 000 000… I think I will survive. But I am yet to test on a further data set.

 

The effect is basically that you ignore a couple of annotations. Given that we do not have the full set of annotations anyway, this change is the same as if an experiment or two hadn’t been conducted. Furthermore, the three proteins I found to be affected had a lot of annotations, so missing one should not have the worst-case effect I investigated. Nonetheless, you may want to do it properly first time round and take all the data into account when you start out. It should save you a lot of hassle later.

 

Stay vigilant, my friends!

Ways to compare networks

Nowadays network comparison is becoming increasingly relevant. Why is this?  Mainly because it is a desirable way to compare complex systems that can often be represented as networks.

Network comparison aims to account for properties that are generated by the simultaneous interaction of all units rather than the properties of each single individual. Here are some cases where network comparison could be helpful:

–  Showing and highlighting “significant” changes on network evolution. A particular characteristic of interest could be the speed with which information flows.

– Quantifying how “close” two networks  are. Even when the networks have a different different number of nodes and edges, or in the case of spatially embedded networks, different scale.

As an example, look at the following two networks. Are the structures of these two road networks similar?

Screen Shot 2015-06-15 at 14.56.21

(Images obtained from the mac Maps app)

Or what about the structure of these two other networks?
two_geometric

One of the difficulties in comparing networks is that there is no clear way to compare networks as complete whole entities. Network comparison methods only compare certain attributes of the network, among these: density of edges, global clustering coefficient, degree distribution, counts of smaller graphs embedded in the network and others. Here are some ideas and methods that have been used as ways to compare networks.

  • Networks can be compared by their global properties and summary statistics, like network density, degree distribution, transitivity, average shortest path length and others. Usually a statistic combining the differences of several global properties was used.
  • Another way to compare networks is based on the fit of a statistical network model (eg.  ERGM) to each network. Then, the coefficients of the fitted models could be compared. Note that in this case, a good fit of the models would be required.
  • Statistics directly built for network comparison via subgraph counts. These statistics do not make any assumptions of the network generation process. For example, Netdis and GCD are two network comparison statistics that try to measure the structural difference between networks. These network comparison measures are based on counts of small subgraphs (3-5 nodes), like triangles, 2-stars, 3-stars, squares, cliques and others (see Figure below). subgraph_plot_jpeg These network comparison statistics create frequency vectors of subgraphs and then compare these frequencies between the networks to obtain an idea of the similarity of the networks relative to their subgraph counts.
  • Lastly, another way to indirectly compare networks is via network alignment methods. The objective of these methods is to create a “mapping”, f, from the node set of one network to the node set of another.  The following Figure shows two networks, light and dark blue. An alignment of the two networks is shown in red.Screen Shot 2015-06-15 at 21.43.29One of the objectives of an alignment is to maximise the number of conserved interactions, that is, if (u,v) is an edge in the first network and f is an alignment, then an edge (u,v) is conserved if (f(u),f(v)) is an edge in the second network. It can be noted that the edge demarcated by the green oval is a non-conserved interaction.
    NETAL is a commonly used network alignment method, although there are several, more recent, network alignment methodologies.

In the end, despite the variety of ways to compare networks, saying that two networks are similar, or different, is not that easy, as all methods face their own particular challenges, like networks that come from the same model but have different node size.

 

Diseases exploiting the Power of Networks

Networks are pretty amazing. If you have a bunch of nodes that sit there and do very simple tasks that you can train a dog to do, and then you connect them in the right way, you can create a programme that can get through a level of Mario! In neural networks, all the nodes are doing is getting an input, then deciding what works best based on that input, and sending that information somewhere else. It’s like teaching your dog to get help when you’re in trouble. You say “Help!”, it makes a decision where to go to find help, and then someone else is alerted to your situation. Now you link those inputs and outputs, and suddenly you can solve some very interesting problems. Like making your dog get another dog to do something, who tells another dog… okay… at some point the analogy may break down. What I want to get across, is that networks can take very simple things and create complexity by just adding a few connections between them.

Getting back to the biological theme, proteins aren’t really all that simple. From what you’ll have read on other posts in this blog there’s a lot of effort that goes into modelling them. The position of even a small peptide loop can have a big effect on how a protein functions. So if you have a network of 15 – 25 thousand proteins you can imagine that the complexity of that system is quite incredible. That system is the human body.

Looking at systems from the perspective of interactions between components creating complex behaviour, it’s really not that surprising that human diseases often function by disrupting the network of protein interactions as found by this paper recently published in Cell. Assuming there’s a human disease which has survived for generations and generations, it is likely to be quite effective in how it functions. And if you have a complex system like the human body, the easiest way to disrupt that arising behaviour called “a healthy life”, is to alter a few of the interactions. Showing this for a large group of disease-associated missense mutations is however pretty impressive and time-consuming, which is probably why there are about 50 people from over 20 different institutions in the author list.

So what exactly did they show? They showed what happens when there is a mutation in a gene that causes a disease. Such a mutation replaces an amino acid in the protein sequence encoded by the gene. The resulting protein is then somehow different. The group around Marc Vidal and Susan Lindquist showed that rather than affecting the stability of the protein, the system tends to be targeted via the interactions of the protein. What is more, they showed that different mutations of the same gene can affect different sets of interactions. Using their “edgotyping” approach it is possible to characterize the type of effect a mutation will have and possibly predict the change in phenotype (i.e. the disease state).

Edgotyping classifies the effect of disease associated mutations (Sahni et al)

Edgotyping classifies the effect of disease associated mutations (Sahni et al)

Now if the possibility of predicting how a single mutation (1 in ~300 amino acids), in a single protein (1 in ~ 25,000 proteins) affects the human body doesn’t convince you that networks are pretty powerful, I really don’t know what will. But now back to the important stuff…. how do you make a real life neural network with communicating dogs? And could you make them reach level two on Mario? Or… Super Mario Dog?