A topology-based distance measure for network data

In last week’s group meeting, I introduced our network comparison method (Netdis) and presented some new results that enable the method to be applied to larger networks.

The most tractable methods for network comparison are those which compare at the level of the entire network using statistics that describe global properties, but these statistics are not sensitive enough to be able to reconstruct phylogeny or shed light on evolutionary processes. In contrast, there are several network alignment based methods that compare networks using the properties of the individual proteins (nodes) e.g. local network similarity and/or protein functional or sequence similarity. The aim of these methods is to identify matching proteins/nodes between networks and use these to identify exact or close sub-network matches. These methods are usually computationally intensive and tend to yield an alignment which contains only a relatively small proportion of the network, although this has been alleviated to some extent in more recent methods.

Thus, we do not follow the network alignment paradigm, but instead we take our lead from alignment-free sequence comparison methods that have been used to identify evolutionary relationships. Alignment-free methods based on k-tuple counts (also called k-grams or k-words) have been applied to construct trees from sequence data. A key feature is the standardisation of the counts to separate the signal from the background noise. Inspired by alignment-free sequence comparison we use subgraph counts instead of sequence homology or functional one-to-one matches to compare networks. Our proposed method, Netdis, compares the subgraph content not of the networks themselves but instead of the ensemble of all protein neighbourhoods (ego-networks) in each network, through an averaging many-to-many approach. The comparison between these ensembles is summarised in a Netdis value, which in turn is used as input for phylogenetic tree reconstruction.

Effect of sub-sampling egos on the resulting grouping of networks generated by Netdis. Higher Rand index values indicate better fit to non-sampling results.

Fig1: Effect of sub-sampling egos on the resulting grouping of networks generated by Netdis. Higher Rand index values indicate better fit to non-sampling results.

Extensive tests on simulated and empirical data-sets show that it is not necessary to analyze all possible ego-networks within a network for Netdis to work. Our results indicate that in general, randomly sampling around 10% of egos in each network results in a very similar clustering of networks on average, compared to the tree with 100% sampling (Fig 1). This result has important implications for use-cases where eextremely large graphs are to be compared (e.g > 100,000 nodes). Related to the ego-nework sub-sampling idea is the notion of size-limiting the ego-networks that are to be analyzed by the algorithm. Our tests show that the vast majrity of ego-netowrks in most networks have a relatively low coverage of the overall network. Moreover, by introducing lower-size threshold on the egos, we observe better results on average. Together, this means a limited range of ego-network sizes to be analyzed for each network, which should lead to better statitical properties as the sub-sampling scheme is inspired by bootstrapping.

Building accurate models of membrane protein structures

Today I gave a talk on my research project when I joined the group. My research focuses on modeling of membrane proteins (MPs). Membrane proteins are the main class of drug targets and their mechanism of function is determined by their 3D structure. Almost 30% of the proteins in the sequenced genomes are membrane proteins. But only ~2% of the experimentally determined structures in the PDB are membranes. Therefore, computational methods have been introduced to deal with this limitation.

Homology modeling is one of the best performing computational methods which gives “accurate” models of proteins. Many homology modeling methods have been developed, with Modeller being one of the best known ones. But these methods have been tested and customised primarily on the soluble proteins. As we know there are main physical difference between the MPS and water soluble proteins. Therefor, to build a homology modeling pipeline for membrane proteins, we need a pipeline which in all its steps the unique environment of the membrane protein is taken into account.

Memoir is a tool for homology-based prediction of membrane protein structure (Figure below). As input memoir takes a target sequence and a template. First, using imembrane the lipid bilayer boundaries are detected on the template. Using this information MP-T, with its membrane specific substitution matrices, aligns the target and template. Then, Medeller is used to build the core model and finally FREAD, a fragment-based loop modeling, is used to fill in the missing loops.

Memoir Pipeline

Memoir Pipeline

Memoir methodology builds accurate models but potentially incomplete. Homology modeling often entails a trade-off between the level of accuracy and the level of coverage that can be achieved in predicted models. Therefore we aim to build Memoir 2.0, in which we increase coverage by modelling the missing structural information only if such prediction is sensible. Therefore, to complete the models in the best way we aim at:

  • 1-Examine the best ways to maximise FREAD coverage, maintaining accuracy
  • 2-Examine the best ab initio loop predictor for membrane proteins
  • Fread has two main parameters which contribute to its accuracy and coverage. The nature of the chosen database to look for a loop (i.e. membrane or soluble (mem/sol)) and the choice of the sequence identity (ESSS) cut-off:

  • ESSS >= 25: more accurate loop models are built (Hiacc)
  • ESSS > 0: more coverage is met but not necessary accurate models (Hicov)
  • To test the effect of these parameters on the prediction accuracy and coverage we chose to test set:

  • Mem_DS: 280 loops taken from MP X-ray structures.
  • Model_DS: 156 loops from homology models of MPs. The loop length in both test ranges from 4 to 17 residues
  • The comparisons on both dataset confirm that to achieve the highest accuracy and coverage the FREAD Pipeline should be performed as:

  • 1. Hiacc-mem
  • 2. Hicov-mem
  • 3. Hiacc-sol
  • 4. Hicov-sol
  • Memoir with the new FREAD Pipeline, called Memoir 2.0, achieves higher coverage in comparison to the original Memoir 1.0.

    But there are still loops which are not modeled by FREAD Pipeline. These loops should be modeled using an ab initio method. To test the performance of soulable ab initio loop predictors on the membrane proteins we predicted the loops of the above testset sing six ab initio methods available for download: Loopy, LoopBuilder, Mechano, Rapper, Modeller and Plop.

    Comparison between ab initio methods on membrane proteins

    Comparison between ab initio methods on membrane proteins

    Comparisons in the image above shows that:

  • FREAD is more accurate but, doesn’t achieve complete coverage.
  • Greater coverage is achieved using ab initio predictors.
  • Mechano, LoopBuilder and Loopy are the best ab initio predictors.
  • We have selected Mechano for Memoir 2.0 because it:

  • provides higher coverage than Loopy whilst retaining a similar accuracy.
  • is faster than LoopBuilder (Mechano is ~30 min faster on loop length of 12)
  • is able to model terminals.
  • In memoir 2.0 the C and N terminals of up to 8 residues are built using Mechano. Then, Mechano decoy’s are ranked by their Dfire score , and accepted only if they have exited the membrane. This check improves the average RMSD up to 4Å on DS_280 terminals.

    In conclusion, Memoir 2.0 provides higher coverage models while maintaining a reasonable accuracy level. Our comparison results showed that FREAD is significantly more accurate than the ab initio methods. But, greater coverage is achieved using ab initio predictors.Comparison oshows that the top ab initio predictors in terms of accuracy are Mechano, LoopBuilder and Loopy. Similar patterns were also present in the model dataset. We have selected Mechano as it provides higher coverage than Loopy whilst retaining a similar accuracy and is also much faster than LoopBuilder. Mechano also has the advantage that it is able to model terminals. Only loops smaller than 17 residues were considered for modelling since above this threshold the accuracy of predicted loops drops significantly.

    Hypotheses and Perspectives onto de novo protein structure prediction

    Before I start with my musings about my work and the topic of my D. Phil thesis, I would like to direct you to a couple of previous entries here on BLOPIG. If you are completely new to the field of protein structure prediction or if you just need to refresh your brain a bit, here are two interesting pieces that may give you a bit of context:

    A very long introductory post about protein structure prediction

    and

    de novo Protein Structure Prediction software: an elegant “monkey with a typewriter”

    Brilliant! Now, we are ready to start.

    In this OPIG group meeting, I presented some results that were obtained during my long quest to predict protein structures.

    Of course, no good science can happen without the postulation of question-driving hypotheses. This is where I will start my scientific rant: the underlying hypotheses that inspired me to inquire, investigate, explore, analyse, and repeat. A process all so familiar to many.

    As previously discussed (you did read the previous posts as suggested, didn’t you?), de novo protein structure prediction is a very hard problem. Computational approaches often struggle to search the humongous conformational space efficiently. Who can blame them? The number of possible protein conformations is so astronomically large that it would take MUCH longer than the age of the universe to look at every single possible protein conformation.

    If we go back to biology, protein molecules are constantly undergoing folding. More so, they manage to do so efficiently and accurately. How is that possible? And can we use that information to improve our computational methods?

    The initial hypothesis we formulated in the course of my degree was the following:

    “We [the scientific community] can benefit from better understanding the context under which protein molecules are folding in vivo. We can use biology as a source of inspiration to improve existing methods that perform structure prediction.”

    Hence came the idea to look at biology and search for inspiration. [Side note: It is my personal belief that there should be a back and forth process, a communication, between computational methods and biology. Biology can inspire computational methods, which in turn can shed light on biological hypotheses that are hard to validate experimentally]

    To direct the search for biological inspiration, it was paramount to understand the limitations of current prediction methods. I have narrowed down the limitations of de novo protein structure prediction approaches to three major issues:

    1- The heuristics that rely on sampling the conformational space using fragments extracted from know structures will fail when those fragments do not encompass or correctly describe the right answer.

    2- Even when the conformational space is reduced, say, to fragment space, the combinatorial problem persists. The energy landscape is rugged and unrepresentative of the actual in vivo landscape. Heuristics are not sampling the conformational space efficiently.

    3- Following from the previous point, the reason why the energy landscape is unrepresentative of the in vivo landscape is due to the inaccuracy of the knowledge-based potentials used in de novo structure prediction.

    Obviously, there are other relevant issues with de novo structure prediction. Nonetheless, I only have a limited amount of time for my D.Phil and those are the limitations I decided to focus on.

    To counter each of these offsets, we have looked for inspiration in biology.

    Our understanding from looking at different protein structures is that several conformational constraints are imposed by alpha-helices and beta-strands. That is a consequence of hydrogen bond formation within secondary structure elements. Unsurprisingly, when looking for fragments that represent the correct structure of a protein, it is much easier to identify good fragments for alpha-helical or beta-strand regions. Loop regions, on the other hand, are much harder to be described correctly by fragments extracted from known structures. We have incorporated this important information into a fragment library generation software in an attempt to address limitation number 1.

    We have investigated the applicability of a biological hypothesis, cotranslational protein folding, into a structure prediction context. Cotranslational protein folding is the notion that some proteins begin their folding process as they are being synthesised. We further hypothesise that cotranslational protein folding restricts the conformational space, promoting the formation of energetically-favourable intermediates, thus steering the folding path towards the right conformation. This hypothesis has been tested in order to improve the efficiency of the heuristics used to search the conformational space.

    Finally, following the current trend in protein structure prediction, we used evolutionary information to improve our knowledge-based potentials. Many methods now consider correlated mutations to improve their predictions, namely the idea that residues that mutate in a correlated fashion present spatial proximity in a protein structure. Multiple sequence alignments and elegant statistical techniques can be used to identify these correlated mutations. There is a substantial amount of evidence that this correlated evolution can significantly improve the output of structure prediction, leading us one step closer to solving the protein structure prediction problem. Incorporating this evolution-based information into our routine assisted us in addressing the lack of precision of existing energy potentials.

    Well, does it work? Surprisingly or not, in some cases it does! We have participated in a blind competition: the Critical Assessment for protein Structure Prediction (CASP). This event is rather unique and it brings together the whole structure prediction community. It also enables the community to gauge at how good we are at predicting protein structures. Working with completely blind predictions, we were able to produce one correct answer, which is a good thing (I guess).

    All of this comes together nicely in our biologically inspired pipeline to predict protein structures. I like to think of our computational pipeline as a microscope. We can use it to prod and look at biology. We can tinker with hypotheses, implement potentials and test them, see what is useful for us and what isn’t. It may not be exactly what get the papers published, but the investigative character of our structure prediction pipeline is definitely the favourite aspect of my work. It is the aspect that makes me feel like a scientist.

    Protein Structure Prediction, my own metaphorical microscope…

     

    Improving the accuracy of CDR-H3 structure prediction

    When designing an antibody for therapeutic use, knowledge of the structure (in particular the binding site) is a huge advantage. Unfortunately, obtaining even one of these structures experimentally, for example by x-ray crystallisation, is very difficult and time-consuming – researchers have therefore been turning to models.

    The ‘framework’ regions of antibodies are well conserved between structures, and therefore homology modelling can be used successfully. However, problems arise when modelling the six loops that make up the antigen binding site – called the complementarity determining regions, or CDRs. For five of these loops, only a small number of conformations have actually been observed, forming a set of structural classes – these are known as canonical structures. The class that a CDR loop belongs to can be predicted from its structure, making the prediction of their structures quite accurate. However, this is not the case for the H3 loop (the third CDR of the heavy chain) – there is a much larger structural diversity, making H3 structure prediction a challenging problem.

    Antibody structure, showing the six CDR loops that make up the antigen binding site. The H3 loop is found in the centre of the binding site, shown in pink. PDB entry 1IGT.

    Antibody structure, showing the six CDR loops that make up the antigen binding site. The H3 loop is found in the centre of the binding site, shown in pink. PDB entry 1IGT.

    H3 structure modelling can be considered as a specific case of general protein loop modelling. Starting with the sequence of the loop, and the structure of the remaining parts of the protein, there are three stages in a loop modelling algorithm: conformational sampling, the filtering out of physically unlikely structures, and ranking. There are two types of loop modelling algorithm, which differ in the way they perform the conformational sampling step: knowledge-based methods, and ab initio methods. Knowledge-based methods use databases of known structures to produce loop conformations, while ab initio methods do this computationally, without knowledge of existing structures. My research involves the testing and development of these loop modelling algorithms, with the aim of improving the standard of H3 structure prediction.

    A knowledge-based method that I have tested is FREAD. FREAD uses a database of protein fragments that could possibly be used as loop structures. This database is searched, and possible structures are returned depending on the similarity of their sequence to the target sequence, and the similarity of the anchor structures (the two residues on either side of the loop). On a set of 55 unbound H3 loop targets, ranging between 8 and 18 residues long, FREAD (using a database of known H3 structures) produced an average best prediction RMSD of 2.7 Å (the ‘best’ prediction is the loop structure closest to the native of all those returned by FREAD). FREAD is obviously very sensitive to the availability of H3 structures: if no similar structure has been observed before, FREAD will either return a poor answer or fail to find any suitable fragments at all. For this reason there is huge variation in the FREAD results – for example, the best prediction for one target had an RMSD of 0.18 Å, while for another, the best RMSD was 10.69 Å. Fourteen of the targets were predicted with an RMSD of below 1 Å. The coverage for this particular set of targets was 80%, which means that FREAD failed to find an answer for one in five targets.

    MECHANO is an ab initio algorithm that we have developed specifically for H3 loop prediction. Loops are built computationally, by adding residues sequentially onto one of the anchors. For each residue, φ/ψ dihedral angles are chosen from a distribution at random – the distributions used by MECHANO are residue-specific, and are a combination of general loop data and H3 loop data. Loops conformations are closed using a modified cyclic coordinate descent algorithm (CCD), where the dihedrals of each residue are changed, one at a time, to minimise the distance between the free end of the loop and its anchor point, whilst keeping the dihedral angles in the allowed regions of the Ramachandran plot. I have tested MECHANO on the same set of targets as FREAD, generating 5000 loop conformations per target: the average best prediction RMSD was 2.1 Å, and the results showed a clear length dependence – this is expected, since the conformational space to explore becomes larger as the number of residues increases. Even though the average best prediction RMSD is better than that of FREAD, only one of the best RMSDs produced by MECHANO was sub-angstrom, compared to 14 for FREAD. Since the MECHANO algorithm does not depend on previously observed structures, predictions were made for all targets (i.e. coverage = 100%).

    My current work is focused upon developing a ‘hybrid’ method, which combines elements of the FREAD and MECHANO algorithms. In this way, we hope to make predictions with the accuracy that can be achieved by FREAD, whilst maintaining 100% coverage. In its current form, the hybrid method, when tested on the 55-loop dataset from before, produces an average best prediction RMSD of 1.68 Å, with 16 targets having a best RMSD of below 1 Å – a very promising result! However, possibly the most difficult part of loop prediction is the ranking of the generated loop structures; i.e. choosing the conformation that is closest to the native. This is therefore my next challenge!

    Predicting Antibody Affinities – Progress?

    Antibodies are an invaluable class of therapeutic molecules — they have the capacity to bind any molecule (Martin and Thornton, 1996), and this comes from an antibody’s remarkable diversity (Georgiou et al., 2014). In particular, antibodies can bind their targets with high affinity, and most drug design campaigns seek to ‘mature’ an antibody, i.e. increase the antibody’s affinity for its target. However, the process of antibody maturation is, in experimental terms, time-consuming and expensive — if we had 6 CDRs (as in a typical antibody), with 10 residues each, and if you can have any of the 20 amino acids in the CDR positions, there are 20^60 mutants to test (and this is before considering any double or triple mutations!)

    So hold on, what is affinity exactly? Affinity represents the strength of binding, and it’s calculated as either a ratio of concentrations, or as a ratio of rate constants, i.e.equationsIn the simplest affinity maturation protocol, three steps are compulsory:

    1. Mutate the antibody’s structure correctly
    2. Assess the impact of mutation on KD
    3. Select and repeat.

    For the past year, we have centralised around part 2 — affinity prediction. This is a fundamental aspect of the affinity maturation pipeline in order to rationalise ‘good’ and ‘bad’ mutations in the context of maturing an antibody. We developed a statistical potential, CAPTAIN; essentially the idea is to gather contact statistics that are represented in antibody-antigen complexes, and use this information to predict affinities.

    But why use contact information? Does it provide anything useful? Based on analyses of the interfaces of antibody-antigen complexes in comparison to general protein-protein interfaces, we definitely see that antibodies rely on a different binding mode compared to general protein-protein complexes, and other studies have confirmed this notion (Ramaraj et al., 2012; Kunik and Ofran, 2013; Krawczyk et al., 2013).

    For our methodology, we trained on general protein-protein complexes (as most scoring functions do!) and specifically on antibody-protein complexes from the PDB. For our test set of antibody-protein complexes, we outperformed 16 other published methods, though for our antibody-peptide test set, we were one of the worst performers. We found that other published methods predict antibody-protein affinities poorly, though they make better predictions for antibody-peptide binding affinities. Ultimately, we achieve stronger performance as we use a more appropriate training set (antibody-antigen complexes) for the problem in hand (predicting antibody-antigen affinities). Our correlations were by no means ideal, and we believe that there are other aspects of antibody structures that must be used for improving affinity prediction, such as conformational entropy (Haidar et al., 2013) and VH-VL domain orientation (Dunbar et al., 2013; Fera et al., 2014).

    What’s clear though, is that using the right knowledge base is key to improving predictions for solving greater problems like affinity maturation. At present, most scoring functions are trained on general proteins, but this ‘one-fits-all’ approach has been subject to criticism (Ross et al., 2013). Our work supports the idea that scoring functions should be tailored specifically for the problem in hand.

     

    Looking for a null model of PPI ego-networks

    Protein-protein interaction (PPI) networks describe how proteins are connected to one another in terms of physical interactions. They can be used to aid our understanding of the individual roles of proteins (Sarajli ́c et al., 2013), the co-functioning properties of sets of proteins (West et al., 2013) and even the operation of the complete system (Janowski et al., 2014).

    Different approaches have been proposed to analyse, describe and predict these PPI networks, such as network summary statistics, clustering methods, random graph models and machine learning methods. However, despite the large biological, computational and statistical interest in PPI net- works, current models insufficiently describe PPI networks (Winterbach et al., 2013; Ali et al., 2014; Rito et al., 2010). It is commonly accepted that proteins perform functions usually in conjunction with other proteins, forming a functional module (Lewis et al., 2010). Hence local structure is found to be important in protein-protein interaction networks (Planas-Iglesias et al., 2013).

    Here we address the modelling problem locally by modelling the ego-networks of PPI networks by means of random graph models.

    Random graph models

    Loosely speaking, a random graph model is a set of rules that define an edge generation process among a set of nodes. Usually this set of rules relate to particular characteristics that are embedded in the network generation process. Here are three examples of such characteristics:

    •  Independence  (each edge has a probability p of being present).
    • Preferential attachment (nodes form edges with highly interacting nodes).
    • Space-based interactions (an edge is present between two nodes if the distance between them small).

    A classical model representing an independence structure is the ER(nv,p) model. This is a random graph on nv nodes, and where edges are present independently at random with probability p.

    ER3

    Now, the preferential attachment characteristic can be illustrated by the Chung-Lu model. That is, given an expected sequence of weights \{d_1,d_2,...,d_{n_v}\}. The probability of obtaining an edge between nodes i and j is given by  P((i,j)\in E)=d_id_j / \sum_j d_j.

    Screen Shot 2014-12-09 at 16.22.47

    Finally, a model representing a spaced based network generation process could be the Geometric model. Here, nodes are placed uniformly at random in a d-dimensional square [0,1]^d. Now, given a radius or threshold distance (r), edges are drawn among nodes v_i,\,v_j i\neq j  if d(v_i,v_j)\leq r.

    Screen Shot 2014-12-09 at 16.11.01

    From the latter figures it can be seen that different models often lead to different network structures. Thus, although standard random graph models do not reproduce a sufficiently similar network structure to the one of PPI networks, they could still be good approximations for different local regions in a PPI network.


     

    Finding a null model for PPI ego-networks

    Our approach consist in finding local regions of the PPI networks that could be represented well by the random graph models. To do so, we propose to extract all 2-step ego-networks and classifying them according to some simple characteristic, network density for example.

    Now, once the ego-networks of the PPI network have been extracted and binned according to their network density (ego-density). We assess the fit of the model to the PPI networks by comparing each bin of PPI ego-networks to the ego-networks extracted from a random graph model. This comparison is made by comparing the difference in the resulting number of subgraph counts, triangles for example, in each of the ego-networks within each bin.

    The following figure illustrates the underlying idea of this procedure:

     

    Screen Shot 2014-12-09 at 16.44.40

    Following this approach we aim to find bins for which, possibly different models, reproduce similar subgraph counts as the ones obtained in the PPI ego-networks. However we expect to fin bins for which none of the standard models fit.

    OOMMPPAA: A tool to aid directed synthesis by the combined analysis of activity and structural data

    Motivation

    Recently I published a paper on OOMMPPAA, my 3D matched-molecular pair tool to aid drug discovery. The tool is available to try here, download here and read about here. OOMMPPAA aims to tackle the big-data problem in drug discovery – where X-ray structures and activity data have become increasingly abundant. Below I will explain what a 3D MMP is.

    What are MMPs?

    MMPs are two compounds that are identical apart from one structural change, as shown in the example below. Across a set of a thousand compounds, tens of thousands of MMPs can be found. Each pair can be represented as a transformation. Below would be Br >> OH. Within the dataset possibly hundreds of Br >> OH will exist.

    mmp

    An example of an MMP

     

     

     

     

     

     

    Each of these transformations will also be associated with a change in a measurable compound property, (solubility, acidity, binding energy, number of bromine atoms…). Each general transformation (e.g. Br>>OH) therefore would have a distribution of values for a given property. From this distribution we can infer the likely effect of making this change on an unmeasured compound. From all the distributions (for all the transformations) we can then predict the most likely transformation to improve the property we are interested in.

    The issue with MMPs

    Until recently the MMP approach had been of limited use for predicting compound binding energies. This is for two core reasons.

    1) Most distributions would be pretty well normally distributed around zero. Binding is complex and the same change can have both positive and negative effects.

    2) Those distributions that are overwhelmingly positive, by definition, produce increased binding against many proteins. A key aim of drug discovery is to produce selective compounds that increase binding energy only for the protein you are interested in. So increasing binding energy like that is not overall very useful.

    3D MMPs to the rescue

    3D MMPs aim to resolve these issues and allow MMP analysis to be applied to binding energy predictions. 3D MMPs use structural data to place the transformations in the context of the protein. One method, VAMMPIRE, asks what is the overall effect of Br>>OH when it is near to a Leucine, Tyrosine and Tryptophan (for example). In this way selective changes can be found.

    Another method by BMS aggregates these changes across a target class, in their cases kinases. From this they can ask questions like, “Is it overall beneficial to add a cyclo-proply amide in this region of the kinase binding site”.

    What does my tool do differently?

    OOMMPPAA differs from the above in two core ways. First, OOMMPPAA uses a pharmacophore abstraction to analyse changes between molecules. This is an effective way of increasing the number of observations for each transition. Secondly OOMMPPAA does not aggregate activity changes into general trends but considers positive and negative activity changes separately. We show in the paper that this allows for more nuanced analysis of confounding factors in the available data.

    How do we measure translation speed?

    Two major trains of thought have emerged in how one can define the translation speed, one uses the cognate tRNA concentrations and the other the codon bias within the genome. The former is a natural measure, the amount of cognate tRNA available to the ribosome for a given codon has been shown to affect the translation. In the extreme case, when no cognate tRNA is available, the ribosome is even found to detach from the transcript after a period of waiting. The latter, the codon bias, is the relative quantities of codons found within a synonymous group. The codons found the most are assumed to be optimal as it is hypothesised that the genome will be optimised to produce proteins in the fastest most efficient manner. Lastly, there is a new third school of thought were one has to balance both the supply and the usage of any given codon. Namely if a codon is overused it will actually have a lower tRNA concentration than would be suggested by its tRNA gene copy numbers (an approximation of the tRNA’s concentration). Each of these three descriptions have been used in their own respective computational studies to show the association of the speed, represented as each measure, to the protein structure.

    A simplified schematic of ribosome profiling. Ribosome profiling begins with separating a cell’s polysomes (mRNA with ribosomes attached) from its lysate. Erosion by nuclease digestion removes all mRNA not shielded by a ribosome while also cleaving ribosomes attached to the same mRNA strand. Subsequent removal of the ribosomes leaves behind only the mRNA fragments which were undergoing translation at the point of cell lysis. Mapping these fragments back to the genome gives a codon-level resolution transcriptome-wide overview of the translation occurring within the cell. From this we can infer the optimality associated with any given codon from any given gene.

    A simplified schematic of ribosome profiling. Ribosome profiling begins with separating a cell’s polysomes (mRNA with ribosomes attached) from its lysate. Erosion by nuclease digestion removes all mRNA not shielded by a ribosome while also cleaving ribosomes attached to the same mRNA strand. Subsequent removal of the ribosomes leaves behind only the mRNA fragments which were undergoing translation at the point of cell lysis. Mapping these fragments back to the genome gives a codon-level resolution transcriptome-wide overview of the translation occurring within the cell. From this we can infer the translation speed associated with any given codon from any given gene.

    However, while these definitions have been in existence for the past few decades, there has been no objective way, till now, to test how accurate they actually are in measuring the translation speed. Essentially, we have based years of research on the extrapolation of a few coarse experiments, or in some cases purely theoretical models, to all translation. There now exists an experimental measure of the translation occurring in-vivo. Ribosome profiling, outlined in above, measures the translation occurring within a cell, mapping the position of the ribosome on the genome at the points of cell lysing. Averaging over many cells gives an accurate measure of the expected translation occurring on any given transcript at any time.

    Comparing the log transformed ribosome profile data to the translation speed as defined by each of the algorithms for B. Subtilis. We show the mean optimality against the mean optimality when stratified by codon, finding that the assigned values for each algorithm fails to capture the variation of the ribsome profiling data.

    Comparing the log transformed ribosome profile data to the translation speed as defined by each of the algorithms for B. Subtilis. We show the mean ribosome occupancy against the mean translation speed when stratified by codon, finding that the assigned values for each algorithm fails to capture the variation of the ribosome profiling data.

    As an initial comparison shown above, we compared some of the most popular speed measures based on the above descriptions to the ribosome profiling data. None of the measures were found to recreate the ribosome profiling data adequately. In fact, while some association is found, it is opposite to what we would expect! The faster the codon according to the algorithm the more a ribosome is likely to occupy it!We thought that this may be due to treating all the codons together instead of with respect to the genes they are from. Essentially, is a given codon actually fast if it is just within a gene that is in general fast? To test for this, we created a set of models which account for a shift in ribosome data profile depending on the source gene. However, these showed even less association to the speed algorithms!

    These findings suggest that the algorithms that the scientific community have based there work on for the past decades may in fact be poor representations of the translations speed. This leads to a conundrum, however, as these measures have been made use of in experimental studies, namely the paper by Sander et al (see journal club entry here). In addition, codon bias matching has been used extensively in increasing expression of non-native proteins in bacteria. Clearly these algorithms are a measure of something and, as such, this contradiction needs to be resolved in the near future.

    How to (not) perform a Molecular Dynamics simulation study

    At this week’s group meeting I presented my recently published paper:

    Knapp B, Dunbar J, Deane CM. Large scale characterization of the LC13 TCR and HLA-B8 structural landscape in reaction to 172 altered peptide ligands: a molecular dynamics simulation study. PLoS Comput Biol. 2014 Aug 7;10(8):e1003748. doi: 10.1371/journal.pcbi.1003748. 2014.

    This paper was presented on the front page of PLoS Computational Biology:

    cover

    The paper describes Molecular Dynamics simulations (MD) of T-cell receptor (TCR) / peptide / MHC complexes.

    MD simulation calculate the movement of atoms of a given system over time. The zoomed-in movements of the TCR regions (CDRs) recognizing a peptide/MHC are shown here:
    simulation

    Often scientists perform MD simulations of two highly similar complexes with different immunological outcome. An example is the Epstein-Barr Virus peptide FLRGRAYGL which leads to induction of an immune response when presented by HLA-B*08:01 to the LC 13 TCR. If position 7 of the peptide is mutated to A then the peptide does not induce an immune reaction any-more.
    In both simulations the TCR and MHC are identical only the peptide differs in one amino acid. Once the simulations of those two complexes are done one could for example investigate the root mean square fluctuation of both peptides to investigate if one of them is more flexible. And indeed this is the case:

    1v1RMSF

    Another type of analysis would be the solvent accessible surface area of the peptide or, as shown below, the CDR regions of the TCR beta chain:

    1v1SASA

    … and again it is obvious that those two simulations strongly differ from each other.

    Is it then a good idea to publish an article about these findings and state something like “Our results indicate that peptide flexibility and CDR solvent accessible surface area play an important in the activation of a T cell” ?

    As we know from our statistics courses a 1 vs 1 comparison is an anecdotic example but those often do not hold true for larger sample sizes.

    So, let’s try it with many more peptides and simulations (performing these simulations took almost 6 months on the Oxford Advanced Research Computing facility). We split the simulations in two categories, those which are more immunogenic and those which are less immunogenic:

    compareAll

    Let’s see what happens to the difference in RMSF that we found before:

    allRMSF

    So much for “More immunogenic peptides are more/less flexible as less immunogenic ones” …

    How about the SASA that we “found” before:

    allSASA

    … again almost perfectly identical for larger sample sizes.

    And yes, we have found some minor differences between more and less immunogenic peptides that hold true for larger sample sizes but non of them was striking enough to predict peptide immunogenicity based on MD simulations.

    What one really learns from this study is that you should not compare 1 MD simulation against 1 other MD simulation as this will almost certainly lead to false positive findings. Exactly the same applies for experimental data such as x-ray structures because this is a statistical problem rather than a prediction based on. If you want to make sustainable claims about something that you found then you will need a large sample size and a proper statistical sample size estimation and power analysis (as done in clinical studies) before you run any simulations. Comparing structures is always massive multiple testing and therefore high sample sizes are needed to draw valid conclusions.

    link pdf

    Research Talk: Ligand Fitting in X-ray Crystallography

    In the last group meeting, I reported on the success of ligand-fitting programs for the automated solution of ligand structures.

    In Fragment Screens by X-ray Crystallography, a library of small compounds (fragments) is soaked into protein crystals, and the resulting structures are determined by diffraction experiments. Some of the fragments will bind to the protein (~5% of the library), and these are detected by their appearance in the derived electron density.

    The models of binding fragments can be used to guide structure-based drug-design efforts, but first they must be built. Due to the large number of datasets (200-1000), the automated identification of the fragments that bind, and the automated building of atomic models is required for efficient processing of the data.

    Density Blobs

    Anecdotally, available ligand-fitting programs are unreliable when modelling fragments. We tested three ligand fitting programs in refitting a series of ligand structures. We found that they fail more frequently when the electron density for the ligand is weak. Many fragments that are seen to bind in screens do so only weakly, due to their size. So the weaker the fragment binds, the harder it will be for the automated programs to model.

    Success Rates Identifying the Correct Model

    Models are usually ranked by the Real-Space Correlation Coefficient (RSCC) between the model and the experimental electron density. This metric is good at identifying ‘correct’ models, and an RSCC > 0.7 normally indicates a correct, or at least mostly correct, model.

    Typically, the binding locations of ligands are found by searching for un-modelled peaks in the electron density map. Models are then generated in these locations, and are then scored and ranked. Good models can be identified and presented to the user. However, if a ‘good’ model is not generated, to be scored and ranked, the RSCCs of the ‘bad’ models will not tell you that there is something to be modelled, at a particular place, and binding may be missed…

    This is especially true for weak-binding ligands, which will not give a large electron density peak to give evidence that there is something there to be modelled.

    Currently, all of the datasets must be inspected manually, to check that a weak-binding fragment has not been missed…