Category Archives: Group Meetings

What we discuss during cake at our Tuesday afternoon group meetings

5 Thoughts For… Comparing Crystallographic Datasets

Most of the work I do involves comparing diffraction datasets from protein crystals. We often have two or more different crystals of the same crystal system, and want to spot differences between them. The crystals are nearly isomorphous, so that the structure of the protein (and crystal) is almost identical between the two datasets. However, it’s not just a case of overlaying the electron density maps, subtracting them and looking at the difference. Nor do we necessarily want to calculate Fo-Fo maps, where we calculate the difference by directly subtracting the diffraction data before calculating maps. By the nature of the crystallographic experiment, no two crystals are the same, and two (nearly identical) crystals can lead to two quite different datasets.

So, here’s a list of things I keep in mind when comparing crystallographic datasets…

Control the Resolution Limits

1) Ensure that the resolution limits in the datasets are the same, both at the high AND the low resolution limits.

The High resolution limit. The best known, and (usually) the most important statistic of a dataset. This is a measure of the amount of information that’s been collected about the dataset. Higher resolution data gives more detail for the electron density. Therefore, if you compare a 3A map to a 1A map, you’re comparing fundamentally different objects, and the differences between them will be predominantly from the different amount of information in each dataset. It’s then very difficult to ascertain what’s interesting, and what is an artefact of this difference. As a first step, truncate all datasets at the resolution you wish to compare them at.

The Low Resolution Limit. At the other end of the dataset, there can be differences in the low resolution data collected. Low resolution reflections correspond to much larger-scale features in the electron density. Therefore, it’s just as important to have the same low-resolution limit for both datasets, otherwise you get large “waves” of electron density (low-frequency fourier terms) in one dataset that are not present in the other. Because low-resolution terms are much stronger than high resolution reflections, these features stand out very strongly, and can also obscure “real” differences between the datasets you’re trying to compare. Truncate all datasets at the same low resolution limit as well.

Consider the Unit Cell

2) Even if the resolution limits are the same, the number of reflections in maps can be different.

The Unit Cell size and shape. Even if the crystals you’re using are the same crystal form, no two crystals are the same. The unit cell (the building block of the crystal) can be slightly different sizes and shapes between crystals, varying in size by a few percent. This can occur by a variety of reasons, from the unpredictable process of cooling the crystal to cryogenic temperatures to entirely stochastic differences from the process of crystallisation. Since the “resolution” of reflections depends on the size of the unit cell, two reflections with the same miller index can have different “resolutions” when it comes to selecting reflections for map calculation. Therefore, if you’re calculating maps from nearly-isomorphous but non-identical crystals, consider calculating maps based on an high and a low miller index cutoff, rather than a resolution cutoff. This ensures the same amount of information in each map (number of free parameters).

Watch for Missing Reflections

3) Remove any missing reflections from both datasets.

Reflections can be missing from datasets for a number of reasons, such as falling into gaps/dead pixels on the detector. However, this isn’t going to happen systematically with all crystals, as different crystals will be mounted in different orientations. When a reflection is missed in one dataset, it’s best to remove it from the dataset you’re comparing it to as well. This can have an important effect when the completeness of low- or high-resolution shells is low, whatever the reason.

Not All Crystal Errors are Created Equal…

4) Different Crystals have different measurement errors.

Observation uncertainties of reflections will vary from crystal to crystal. This may be due to a poor-quality crystal, or a crystal that has suffered from more radiation damage than another. These errors lead to uncertainty and error in the electron density maps. Therefore, if you’re looking for a reference crystal, you probably want to choose one with as small uncertainties, σ(F), in the reflections as possible.

Proteins are Flexible

5) Even though the crystals are similar, the protein may adopt slightly difference conformations.

In real-space, the protein structure varies from crystal to crystal. For the same crystal form, there will be the same number of protein copies in the unit cell, and they will be largely in the same conformation. However, the structures are not identical, and the inherent flexibility of the protein can mean that the conformation seen in the crystal can change slightly from crystal to crystal. This effect is largest in the most flexible regions of the protein, such as unconstrained C- and N- termini, as well as flexible loops and crystal contacts.

SAS-5 assists in building centrioles of nematode worms Caenorhabditis elegans

We have recently published a paper in eLife describing the structural basis for the role of protein SAS-5 in initiating the formation of a new centriole, called a daughter centriole. But why do we care and why is this discovery important?

We, as humans – a branch of multi-cellular organisms, are in constant demand of new cells in our bodies. We need them to grow from an early embryo to adult, and also to replace dead or damaged cells. Cells don’t just appear from nowhere but undergo a tightly controlled process called cell cycle. At the core of cell cycle lies segregation of duplicated genetic material into two daughter cells. Pairs of chromosomes need to be pulled apart millions of millions times a day. Errors will lead to cancer. To avoid this apocalyptic scenario, evolution supplied us with centrioles. Those large molecular machines sprout microtubules radially to form characteristic asters which then bind to individual chromosomes and pull them apart. In order to achieve continuity, centrioles duplicate once per cell cycle.

Similarly to many large macromolecular assemblies, centrioles exhibit symmetry. A few unique proteins come in multiple copies to build this gigantic cylindrical molecular structure: 250 nm wide and 500 nm long (the size of a centriole in humans). The very core of the centriole looks like a 9-fold symmetrical stack of cartwheels, at which periphery microtubules are vertically installed. We study protein composition of this fascinating structure in the effort to understand the process of assembling a new centriole.

Molecular architecture of centrioles.

SAS-5 is an indispensable component in C. elegans centriole biogenesis. SAS-5 physically associates with another centriolar protein, called SAS-6, forming a complex which is required to build new centrioles. This process is regulated by phosphorylation events, allowing for subsequent recruitment of SAS-4 and microtubules. In most other systems SAS-6 forms a cartwheel (central tube in C. elegans), which forms the basis for the 9-fold symmetry of centrioles. Unlike SAS-6, SAS-5 exhibits strong spatial dynamics, shuttling between the cytoplasm and centrioles throughout the cell cycle. Although SAS-5 is an essential protein, depletion of which completely terminates centrosome-dependent cell division, its exact mechanistic  role in this  process remains  obscure.

IN BRIEF: WHAT WE DID
Using X-ray crystallography and a range of biophysical techniques, we have determined the molecular architecture of SAS-5. We show that SAS-5 forms a complex oligomeric structure, mediated by two self-associating domains: a trimeric coiled coil and a novel globular dimeric Implico domain. Disruption of either domain leads to centriole duplication failure in worm embryos, indicating that large SAS-5 assemblies are necessary for function. We propose that SAS-5 provides multivalent attachment sites that are critical for promoting assembly of SAS-6 into a cartwheel, and thus centriole formation.

For details, check out our latest paper 10.7554/eLife.07410!

@kbrogala

Top panel: cartoon overview of the proposed mechanism of centriole formation. In cytoplasm, SAS-5 exists at low concentrations as a dimer, and each of those dimers can stochastically bind two molecules of SAS-6. Once SAS-5 / SAS-6 complex is targeted to the centrioles, it starts to self-oligomerise. Such self-oligomerisation of SAS-5 allows for the attached molecules of SAS-6 to form a cartwheel. Bottom panel: detailed overview of the proposed process of centriole formation. In cytoplasm, where concentration of SAS-5 is low, the strong Implico domain (SAS-5 Imp, ZZ shape) of SAS-5 holds the molecule in a dimeric form. Each SAS-5 protomer can bind (through the disordered linker) to the coiled coil of dimeric SAS-6. Once SAS-5 / SAS-6 complex is targeted to the site where a daughter centriole is to be created, SAS-5 forms higher-order oligomers through self-oligomerisation of its coiled coil domain (SAS-5 CC – triple horizontal bar). Such large oligomer of SAS-5 provides multiple attachments sites for SAS-6 dimers in a very confied space. This results in a burst of local concentration of SAS-6 through the avidity effect, allowing an otherwise weak oligomer of SAS-6 to also form larger species. Effectively, this seeds the growth of a cartwheel (or a spiral in C. elegans), which in turn serves as a template for a new centriole.

 

Investigating GPCR kink variation

G-protein coupled receptors (GPCRs) are the target of 50-60% of drugs, including many of those involved in the treatment of cancer and cardiovascular disease. Over 100 GPCR crystal structures are now available, but these are for only around 30 different receptors, and there are still hundreds more receptors for which no structure exists. There is huge diversity in the ligands which bind to GPCRs, so it may often be difficult to predict the shape of a binding pocket for a specific receptor of interest, especially if no close relatives have a structure solved.

Helix kinks (see previous blog posts) are a structural feature of GPCRs which are thought to be important for function. An ability to predict their presence and the magnitude of helix direction change is important for obtaining an accurate structure. A kink prediction method has already been used in the context of GPCR structure prediction, which scored the overall structures after replacing kink segments with others from a database. This made it possible to predict the change in a kink angle based on the stability of the whole GPCR structure.

To better inform this kind of modelling, we wanted to investigate specifically how much variation there is in kink angles between GPCRs. To do this we used the tool Kink Finder to measure angles in all of the transmembrane helices of the GPCRs in the GPCRDB, and estimate a confidence interval on those angles. Then we could state whether the variation that we see in GPCR kink angles is greater than what we would expect from measurement error alone.

Each helix appears to show different behaviour. Some helices were very well conserved, but others showed a huge amount of variation. For these helices with very variable angles, it would be interesting to know if this is a change related to sequence differences, or conformational flexibility between more than one preferred conformation. We found an example where significantly different angles were found even in the same receptor. In this case, the kink angle size is related to whether the structure has an agonist or an antagonist bound, so we propose that this is a functionally relevant and flexible kink.

We also carried out the same analysis on helices from other families of membrane and soluble proteins, and found many more highly variable kinks (one example shown below). This shows that they should be a very important consideration when carrying out homology modelling, and that their conformational flexibility could also be important for function in many other contexts.

not_conserved_kink

Journal Club: The Origin of CDR H3 Structural Diversity

Antibody binding site is broadly composed of the six hypervariable loops, the CDRs. There are three loops on the antibody light chain (L1, L2 and L3) and three loops on the antibody heavy chain (H1, H2 and H3).

Out of the six loops, five appear to adopt a constrained set of structural conformations (L1, L2, L3, H1 and H2). The conformations of H3 appear much less constrained, which was suggested to be the result of its higher relative importance in antigen recognition (however it is not a necessary condition). The only observations to date about the shapes of CDR-H3 is the existence of the extended and kinked conformations of its anchor.

The function of the kink was investigated recently by Weitzner et al. Here, the authors contrasted the geometry found in the antibody CDR-H3 loops to a set of 15k non-antibody polypeptides. They found that even though the extended conformation appears to be more favorable, the kinked one can also be found in many cases, particularly in the PDZ domains.

Weitzner et al. find that the extended conformation is much more common in non-antibody loops. However, the kinked conformation, even though less frequent is not outright rare. The situation is the opposite in antibodies where the majority of H3 conformations are kinked rather than extended.

The authors contrasted the sequence patterns of kinked antibody loops and kinked non-antibody loops and did not find anything predictive of the kinked conformation — suggesting that the effect might be non-local. Nonetheless, the secondary structure pattern of the kinked H3 and the kinked non-antibody loops appears similar.

Even though there might be no sequence-kink link, the authors indicate how their findings might improve H3 structure prediction. They demonstrate that admixing the kinked non-antibody loops into a template dataset for an H3 modeling software might provide more relevant templates.

In conclusion, the main message of the paper (selon moi) is putting forward of the hypothesis as to the role of the H3 kink. Since the kink is much more prevalent in H3 than in non-antibody proteins, there is a strong suggestion that there might be a special role for it. The authors suggest that the kinked conformation allows for more structural diversity, that would otherwise be restricted in the more rigid beta-stranded extended conformation. Thus, antibodies might have opted for a system wherein, they do not need to add dramatic mutations to their H3 in order to get more structural flexibility.

 

A topology-based distance measure for network data

In last week’s group meeting, I introduced our network comparison method (Netdis) and presented some new results that enable the method to be applied to larger networks.

The most tractable methods for network comparison are those which compare at the level of the entire network using statistics that describe global properties, but these statistics are not sensitive enough to be able to reconstruct phylogeny or shed light on evolutionary processes. In contrast, there are several network alignment based methods that compare networks using the properties of the individual proteins (nodes) e.g. local network similarity and/or protein functional or sequence similarity. The aim of these methods is to identify matching proteins/nodes between networks and use these to identify exact or close sub-network matches. These methods are usually computationally intensive and tend to yield an alignment which contains only a relatively small proportion of the network, although this has been alleviated to some extent in more recent methods.

Thus, we do not follow the network alignment paradigm, but instead we take our lead from alignment-free sequence comparison methods that have been used to identify evolutionary relationships. Alignment-free methods based on k-tuple counts (also called k-grams or k-words) have been applied to construct trees from sequence data. A key feature is the standardisation of the counts to separate the signal from the background noise. Inspired by alignment-free sequence comparison we use subgraph counts instead of sequence homology or functional one-to-one matches to compare networks. Our proposed method, Netdis, compares the subgraph content not of the networks themselves but instead of the ensemble of all protein neighbourhoods (ego-networks) in each network, through an averaging many-to-many approach. The comparison between these ensembles is summarised in a Netdis value, which in turn is used as input for phylogenetic tree reconstruction.

Effect of sub-sampling egos on the resulting grouping of networks generated by Netdis. Higher Rand index values indicate better fit to non-sampling results.

Fig1: Effect of sub-sampling egos on the resulting grouping of networks generated by Netdis. Higher Rand index values indicate better fit to non-sampling results.

Extensive tests on simulated and empirical data-sets show that it is not necessary to analyze all possible ego-networks within a network for Netdis to work. Our results indicate that in general, randomly sampling around 10% of egos in each network results in a very similar clustering of networks on average, compared to the tree with 100% sampling (Fig 1). This result has important implications for use-cases where eextremely large graphs are to be compared (e.g > 100,000 nodes). Related to the ego-nework sub-sampling idea is the notion of size-limiting the ego-networks that are to be analyzed by the algorithm. Our tests show that the vast majrity of ego-netowrks in most networks have a relatively low coverage of the overall network. Moreover, by introducing lower-size threshold on the egos, we observe better results on average. Together, this means a limited range of ego-network sizes to be analyzed for each network, which should lead to better statitical properties as the sub-sampling scheme is inspired by bootstrapping.

Building accurate models of membrane protein structures

Today I gave a talk on my research project when I joined the group. My research focuses on modeling of membrane proteins (MPs). Membrane proteins are the main class of drug targets and their mechanism of function is determined by their 3D structure. Almost 30% of the proteins in the sequenced genomes are membrane proteins. But only ~2% of the experimentally determined structures in the PDB are membranes. Therefore, computational methods have been introduced to deal with this limitation.

Homology modeling is one of the best performing computational methods which gives “accurate” models of proteins. Many homology modeling methods have been developed, with Modeller being one of the best known ones. But these methods have been tested and customised primarily on the soluble proteins. As we know there are main physical difference between the MPS and water soluble proteins. Therefor, to build a homology modeling pipeline for membrane proteins, we need a pipeline which in all its steps the unique environment of the membrane protein is taken into account.

Memoir is a tool for homology-based prediction of membrane protein structure (Figure below). As input memoir takes a target sequence and a template. First, using imembrane the lipid bilayer boundaries are detected on the template. Using this information MP-T, with its membrane specific substitution matrices, aligns the target and template. Then, Medeller is used to build the core model and finally FREAD, a fragment-based loop modeling, is used to fill in the missing loops.

Memoir Pipeline

Memoir Pipeline

Memoir methodology builds accurate models but potentially incomplete. Homology modeling often entails a trade-off between the level of accuracy and the level of coverage that can be achieved in predicted models. Therefore we aim to build Memoir 2.0, in which we increase coverage by modelling the missing structural information only if such prediction is sensible. Therefore, to complete the models in the best way we aim at:

  • 1-Examine the best ways to maximise FREAD coverage, maintaining accuracy
  • 2-Examine the best ab initio loop predictor for membrane proteins
  • Fread has two main parameters which contribute to its accuracy and coverage. The nature of the chosen database to look for a loop (i.e. membrane or soluble (mem/sol)) and the choice of the sequence identity (ESSS) cut-off:

  • ESSS >= 25: more accurate loop models are built (Hiacc)
  • ESSS > 0: more coverage is met but not necessary accurate models (Hicov)
  • To test the effect of these parameters on the prediction accuracy and coverage we chose to test set:

  • Mem_DS: 280 loops taken from MP X-ray structures.
  • Model_DS: 156 loops from homology models of MPs. The loop length in both test ranges from 4 to 17 residues
  • The comparisons on both dataset confirm that to achieve the highest accuracy and coverage the FREAD Pipeline should be performed as:

  • 1. Hiacc-mem
  • 2. Hicov-mem
  • 3. Hiacc-sol
  • 4. Hicov-sol
  • Memoir with the new FREAD Pipeline, called Memoir 2.0, achieves higher coverage in comparison to the original Memoir 1.0.

    But there are still loops which are not modeled by FREAD Pipeline. These loops should be modeled using an ab initio method. To test the performance of soulable ab initio loop predictors on the membrane proteins we predicted the loops of the above testset sing six ab initio methods available for download: Loopy, LoopBuilder, Mechano, Rapper, Modeller and Plop.

    Comparison between ab initio methods on membrane proteins

    Comparison between ab initio methods on membrane proteins

    Comparisons in the image above shows that:

  • FREAD is more accurate but, doesn’t achieve complete coverage.
  • Greater coverage is achieved using ab initio predictors.
  • Mechano, LoopBuilder and Loopy are the best ab initio predictors.
  • We have selected Mechano for Memoir 2.0 because it:

  • provides higher coverage than Loopy whilst retaining a similar accuracy.
  • is faster than LoopBuilder (Mechano is ~30 min faster on loop length of 12)
  • is able to model terminals.
  • In memoir 2.0 the C and N terminals of up to 8 residues are built using Mechano. Then, Mechano decoy’s are ranked by their Dfire score , and accepted only if they have exited the membrane. This check improves the average RMSD up to 4Å on DS_280 terminals.

    In conclusion, Memoir 2.0 provides higher coverage models while maintaining a reasonable accuracy level. Our comparison results showed that FREAD is significantly more accurate than the ab initio methods. But, greater coverage is achieved using ab initio predictors.Comparison oshows that the top ab initio predictors in terms of accuracy are Mechano, LoopBuilder and Loopy. Similar patterns were also present in the model dataset. We have selected Mechano as it provides higher coverage than Loopy whilst retaining a similar accuracy and is also much faster than LoopBuilder. Mechano also has the advantage that it is able to model terminals. Only loops smaller than 17 residues were considered for modelling since above this threshold the accuracy of predicted loops drops significantly.

    Hypotheses and Perspectives onto de novo protein structure prediction

    Before I start with my musings about my work and the topic of my D. Phil thesis, I would like to direct you to a couple of previous entries here on BLOPIG. If you are completely new to the field of protein structure prediction or if you just need to refresh your brain a bit, here are two interesting pieces that may give you a bit of context:

    A very long introductory post about protein structure prediction

    and

    de novo Protein Structure Prediction software: an elegant “monkey with a typewriter”

    Brilliant! Now, we are ready to start.

    In this OPIG group meeting, I presented some results that were obtained during my long quest to predict protein structures.

    Of course, no good science can happen without the postulation of question-driving hypotheses. This is where I will start my scientific rant: the underlying hypotheses that inspired me to inquire, investigate, explore, analyse, and repeat. A process all so familiar to many.

    As previously discussed (you did read the previous posts as suggested, didn’t you?), de novo protein structure prediction is a very hard problem. Computational approaches often struggle to search the humongous conformational space efficiently. Who can blame them? The number of possible protein conformations is so astronomically large that it would take MUCH longer than the age of the universe to look at every single possible protein conformation.

    If we go back to biology, protein molecules are constantly undergoing folding. More so, they manage to do so efficiently and accurately. How is that possible? And can we use that information to improve our computational methods?

    The initial hypothesis we formulated in the course of my degree was the following:

    “We [the scientific community] can benefit from better understanding the context under which protein molecules are folding in vivo. We can use biology as a source of inspiration to improve existing methods that perform structure prediction.”

    Hence came the idea to look at biology and search for inspiration. [Side note: It is my personal belief that there should be a back and forth process, a communication, between computational methods and biology. Biology can inspire computational methods, which in turn can shed light on biological hypotheses that are hard to validate experimentally]

    To direct the search for biological inspiration, it was paramount to understand the limitations of current prediction methods. I have narrowed down the limitations of de novo protein structure prediction approaches to three major issues:

    1- The heuristics that rely on sampling the conformational space using fragments extracted from know structures will fail when those fragments do not encompass or correctly describe the right answer.

    2- Even when the conformational space is reduced, say, to fragment space, the combinatorial problem persists. The energy landscape is rugged and unrepresentative of the actual in vivo landscape. Heuristics are not sampling the conformational space efficiently.

    3- Following from the previous point, the reason why the energy landscape is unrepresentative of the in vivo landscape is due to the inaccuracy of the knowledge-based potentials used in de novo structure prediction.

    Obviously, there are other relevant issues with de novo structure prediction. Nonetheless, I only have a limited amount of time for my D.Phil and those are the limitations I decided to focus on.

    To counter each of these offsets, we have looked for inspiration in biology.

    Our understanding from looking at different protein structures is that several conformational constraints are imposed by alpha-helices and beta-strands. That is a consequence of hydrogen bond formation within secondary structure elements. Unsurprisingly, when looking for fragments that represent the correct structure of a protein, it is much easier to identify good fragments for alpha-helical or beta-strand regions. Loop regions, on the other hand, are much harder to be described correctly by fragments extracted from known structures. We have incorporated this important information into a fragment library generation software in an attempt to address limitation number 1.

    We have investigated the applicability of a biological hypothesis, cotranslational protein folding, into a structure prediction context. Cotranslational protein folding is the notion that some proteins begin their folding process as they are being synthesised. We further hypothesise that cotranslational protein folding restricts the conformational space, promoting the formation of energetically-favourable intermediates, thus steering the folding path towards the right conformation. This hypothesis has been tested in order to improve the efficiency of the heuristics used to search the conformational space.

    Finally, following the current trend in protein structure prediction, we used evolutionary information to improve our knowledge-based potentials. Many methods now consider correlated mutations to improve their predictions, namely the idea that residues that mutate in a correlated fashion present spatial proximity in a protein structure. Multiple sequence alignments and elegant statistical techniques can be used to identify these correlated mutations. There is a substantial amount of evidence that this correlated evolution can significantly improve the output of structure prediction, leading us one step closer to solving the protein structure prediction problem. Incorporating this evolution-based information into our routine assisted us in addressing the lack of precision of existing energy potentials.

    Well, does it work? Surprisingly or not, in some cases it does! We have participated in a blind competition: the Critical Assessment for protein Structure Prediction (CASP). This event is rather unique and it brings together the whole structure prediction community. It also enables the community to gauge at how good we are at predicting protein structures. Working with completely blind predictions, we were able to produce one correct answer, which is a good thing (I guess).

    All of this comes together nicely in our biologically inspired pipeline to predict protein structures. I like to think of our computational pipeline as a microscope. We can use it to prod and look at biology. We can tinker with hypotheses, implement potentials and test them, see what is useful for us and what isn’t. It may not be exactly what get the papers published, but the investigative character of our structure prediction pipeline is definitely the favourite aspect of my work. It is the aspect that makes me feel like a scientist.

    Protein Structure Prediction, my own metaphorical microscope…

     

    Improving the accuracy of CDR-H3 structure prediction

    When designing an antibody for therapeutic use, knowledge of the structure (in particular the binding site) is a huge advantage. Unfortunately, obtaining even one of these structures experimentally, for example by x-ray crystallisation, is very difficult and time-consuming – researchers have therefore been turning to models.

    The ‘framework’ regions of antibodies are well conserved between structures, and therefore homology modelling can be used successfully. However, problems arise when modelling the six loops that make up the antigen binding site – called the complementarity determining regions, or CDRs. For five of these loops, only a small number of conformations have actually been observed, forming a set of structural classes – these are known as canonical structures. The class that a CDR loop belongs to can be predicted from its structure, making the prediction of their structures quite accurate. However, this is not the case for the H3 loop (the third CDR of the heavy chain) – there is a much larger structural diversity, making H3 structure prediction a challenging problem.

    Antibody structure, showing the six CDR loops that make up the antigen binding site. The H3 loop is found in the centre of the binding site, shown in pink. PDB entry 1IGT.

    Antibody structure, showing the six CDR loops that make up the antigen binding site. The H3 loop is found in the centre of the binding site, shown in pink. PDB entry 1IGT.

    H3 structure modelling can be considered as a specific case of general protein loop modelling. Starting with the sequence of the loop, and the structure of the remaining parts of the protein, there are three stages in a loop modelling algorithm: conformational sampling, the filtering out of physically unlikely structures, and ranking. There are two types of loop modelling algorithm, which differ in the way they perform the conformational sampling step: knowledge-based methods, and ab initio methods. Knowledge-based methods use databases of known structures to produce loop conformations, while ab initio methods do this computationally, without knowledge of existing structures. My research involves the testing and development of these loop modelling algorithms, with the aim of improving the standard of H3 structure prediction.

    A knowledge-based method that I have tested is FREAD. FREAD uses a database of protein fragments that could possibly be used as loop structures. This database is searched, and possible structures are returned depending on the similarity of their sequence to the target sequence, and the similarity of the anchor structures (the two residues on either side of the loop). On a set of 55 unbound H3 loop targets, ranging between 8 and 18 residues long, FREAD (using a database of known H3 structures) produced an average best prediction RMSD of 2.7 Å (the ‘best’ prediction is the loop structure closest to the native of all those returned by FREAD). FREAD is obviously very sensitive to the availability of H3 structures: if no similar structure has been observed before, FREAD will either return a poor answer or fail to find any suitable fragments at all. For this reason there is huge variation in the FREAD results – for example, the best prediction for one target had an RMSD of 0.18 Å, while for another, the best RMSD was 10.69 Å. Fourteen of the targets were predicted with an RMSD of below 1 Å. The coverage for this particular set of targets was 80%, which means that FREAD failed to find an answer for one in five targets.

    MECHANO is an ab initio algorithm that we have developed specifically for H3 loop prediction. Loops are built computationally, by adding residues sequentially onto one of the anchors. For each residue, φ/ψ dihedral angles are chosen from a distribution at random – the distributions used by MECHANO are residue-specific, and are a combination of general loop data and H3 loop data. Loops conformations are closed using a modified cyclic coordinate descent algorithm (CCD), where the dihedrals of each residue are changed, one at a time, to minimise the distance between the free end of the loop and its anchor point, whilst keeping the dihedral angles in the allowed regions of the Ramachandran plot. I have tested MECHANO on the same set of targets as FREAD, generating 5000 loop conformations per target: the average best prediction RMSD was 2.1 Å, and the results showed a clear length dependence – this is expected, since the conformational space to explore becomes larger as the number of residues increases. Even though the average best prediction RMSD is better than that of FREAD, only one of the best RMSDs produced by MECHANO was sub-angstrom, compared to 14 for FREAD. Since the MECHANO algorithm does not depend on previously observed structures, predictions were made for all targets (i.e. coverage = 100%).

    My current work is focused upon developing a ‘hybrid’ method, which combines elements of the FREAD and MECHANO algorithms. In this way, we hope to make predictions with the accuracy that can be achieved by FREAD, whilst maintaining 100% coverage. In its current form, the hybrid method, when tested on the 55-loop dataset from before, produces an average best prediction RMSD of 1.68 Å, with 16 targets having a best RMSD of below 1 Å – a very promising result! However, possibly the most difficult part of loop prediction is the ranking of the generated loop structures; i.e. choosing the conformation that is closest to the native. This is therefore my next challenge!

    Looking for a null model of PPI ego-networks

    Protein-protein interaction (PPI) networks describe how proteins are connected to one another in terms of physical interactions. They can be used to aid our understanding of the individual roles of proteins (Sarajli ́c et al., 2013), the co-functioning properties of sets of proteins (West et al., 2013) and even the operation of the complete system (Janowski et al., 2014).

    Different approaches have been proposed to analyse, describe and predict these PPI networks, such as network summary statistics, clustering methods, random graph models and machine learning methods. However, despite the large biological, computational and statistical interest in PPI net- works, current models insufficiently describe PPI networks (Winterbach et al., 2013; Ali et al., 2014; Rito et al., 2010). It is commonly accepted that proteins perform functions usually in conjunction with other proteins, forming a functional module (Lewis et al., 2010). Hence local structure is found to be important in protein-protein interaction networks (Planas-Iglesias et al., 2013).

    Here we address the modelling problem locally by modelling the ego-networks of PPI networks by means of random graph models.

    Random graph models

    Loosely speaking, a random graph model is a set of rules that define an edge generation process among a set of nodes. Usually this set of rules relate to particular characteristics that are embedded in the network generation process. Here are three examples of such characteristics:

    •  Independence  (each edge has a probability p of being present).
    • Preferential attachment (nodes form edges with highly interacting nodes).
    • Space-based interactions (an edge is present between two nodes if the distance between them small).

    A classical model representing an independence structure is the ER(nv,p) model. This is a random graph on nv nodes, and where edges are present independently at random with probability p.

    ER3

    Now, the preferential attachment characteristic can be illustrated by the Chung-Lu model. That is, given an expected sequence of weights \{d_1,d_2,...,d_{n_v}\}. The probability of obtaining an edge between nodes i and j is given by  P((i,j)\in E)=d_id_j / \sum_j d_j.

    Screen Shot 2014-12-09 at 16.22.47

    Finally, a model representing a spaced based network generation process could be the Geometric model. Here, nodes are placed uniformly at random in a d-dimensional square [0,1]^d. Now, given a radius or threshold distance (r), edges are drawn among nodes v_i,\,v_j i\neq j  if d(v_i,v_j)\leq r.

    Screen Shot 2014-12-09 at 16.11.01

    From the latter figures it can be seen that different models often lead to different network structures. Thus, although standard random graph models do not reproduce a sufficiently similar network structure to the one of PPI networks, they could still be good approximations for different local regions in a PPI network.


     

    Finding a null model for PPI ego-networks

    Our approach consist in finding local regions of the PPI networks that could be represented well by the random graph models. To do so, we propose to extract all 2-step ego-networks and classifying them according to some simple characteristic, network density for example.

    Now, once the ego-networks of the PPI network have been extracted and binned according to their network density (ego-density). We assess the fit of the model to the PPI networks by comparing each bin of PPI ego-networks to the ego-networks extracted from a random graph model. This comparison is made by comparing the difference in the resulting number of subgraph counts, triangles for example, in each of the ego-networks within each bin.

    The following figure illustrates the underlying idea of this procedure:

     

    Screen Shot 2014-12-09 at 16.44.40

    Following this approach we aim to find bins for which, possibly different models, reproduce similar subgraph counts as the ones obtained in the PPI ego-networks. However we expect to fin bins for which none of the standard models fit.

    How do we measure translation speed?

    Two major trains of thought have emerged in how one can define the translation speed, one uses the cognate tRNA concentrations and the other the codon bias within the genome. The former is a natural measure, the amount of cognate tRNA available to the ribosome for a given codon has been shown to affect the translation. In the extreme case, when no cognate tRNA is available, the ribosome is even found to detach from the transcript after a period of waiting. The latter, the codon bias, is the relative quantities of codons found within a synonymous group. The codons found the most are assumed to be optimal as it is hypothesised that the genome will be optimised to produce proteins in the fastest most efficient manner. Lastly, there is a new third school of thought were one has to balance both the supply and the usage of any given codon. Namely if a codon is overused it will actually have a lower tRNA concentration than would be suggested by its tRNA gene copy numbers (an approximation of the tRNA’s concentration). Each of these three descriptions have been used in their own respective computational studies to show the association of the speed, represented as each measure, to the protein structure.

    A simplified schematic of ribosome profiling. Ribosome profiling begins with separating a cell’s polysomes (mRNA with ribosomes attached) from its lysate. Erosion by nuclease digestion removes all mRNA not shielded by a ribosome while also cleaving ribosomes attached to the same mRNA strand. Subsequent removal of the ribosomes leaves behind only the mRNA fragments which were undergoing translation at the point of cell lysis. Mapping these fragments back to the genome gives a codon-level resolution transcriptome-wide overview of the translation occurring within the cell. From this we can infer the optimality associated with any given codon from any given gene.

    A simplified schematic of ribosome profiling. Ribosome profiling begins with separating a cell’s polysomes (mRNA with ribosomes attached) from its lysate. Erosion by nuclease digestion removes all mRNA not shielded by a ribosome while also cleaving ribosomes attached to the same mRNA strand. Subsequent removal of the ribosomes leaves behind only the mRNA fragments which were undergoing translation at the point of cell lysis. Mapping these fragments back to the genome gives a codon-level resolution transcriptome-wide overview of the translation occurring within the cell. From this we can infer the translation speed associated with any given codon from any given gene.

    However, while these definitions have been in existence for the past few decades, there has been no objective way, till now, to test how accurate they actually are in measuring the translation speed. Essentially, we have based years of research on the extrapolation of a few coarse experiments, or in some cases purely theoretical models, to all translation. There now exists an experimental measure of the translation occurring in-vivo. Ribosome profiling, outlined in above, measures the translation occurring within a cell, mapping the position of the ribosome on the genome at the points of cell lysing. Averaging over many cells gives an accurate measure of the expected translation occurring on any given transcript at any time.

    Comparing the log transformed ribosome profile data to the translation speed as defined by each of the algorithms for B. Subtilis. We show the mean optimality against the mean optimality when stratified by codon, finding that the assigned values for each algorithm fails to capture the variation of the ribsome profiling data.

    Comparing the log transformed ribosome profile data to the translation speed as defined by each of the algorithms for B. Subtilis. We show the mean ribosome occupancy against the mean translation speed when stratified by codon, finding that the assigned values for each algorithm fails to capture the variation of the ribosome profiling data.

    As an initial comparison shown above, we compared some of the most popular speed measures based on the above descriptions to the ribosome profiling data. None of the measures were found to recreate the ribosome profiling data adequately. In fact, while some association is found, it is opposite to what we would expect! The faster the codon according to the algorithm the more a ribosome is likely to occupy it!We thought that this may be due to treating all the codons together instead of with respect to the genes they are from. Essentially, is a given codon actually fast if it is just within a gene that is in general fast? To test for this, we created a set of models which account for a shift in ribosome data profile depending on the source gene. However, these showed even less association to the speed algorithms!

    These findings suggest that the algorithms that the scientific community have based there work on for the past decades may in fact be poor representations of the translations speed. This leads to a conundrum, however, as these measures have been made use of in experimental studies, namely the paper by Sander et al (see journal club entry here). In addition, codon bias matching has been used extensively in increasing expression of non-native proteins in bacteria. Clearly these algorithms are a measure of something and, as such, this contradiction needs to be resolved in the near future.