Monthly Archives: November 2014

How do we measure translation speed?

Two major trains of thought have emerged in how one can define the translation speed, one uses the cognate tRNA concentrations and the other the codon bias within the genome. The former is a natural measure, the amount of cognate tRNA available to the ribosome for a given codon has been shown to affect the translation. In the extreme case, when no cognate tRNA is available, the ribosome is even found to detach from the transcript after a period of waiting. The latter, the codon bias, is the relative quantities of codons found within a synonymous group. The codons found the most are assumed to be optimal as it is hypothesised that the genome will be optimised to produce proteins in the fastest most efficient manner. Lastly, there is a new third school of thought were one has to balance both the supply and the usage of any given codon. Namely if a codon is overused it will actually have a lower tRNA concentration than would be suggested by its tRNA gene copy numbers (an approximation of the tRNA’s concentration). Each of these three descriptions have been used in their own respective computational studies to show the association of the speed, represented as each measure, to the protein structure.

A simplified schematic of ribosome profiling. Ribosome profiling begins with separating a cell’s polysomes (mRNA with ribosomes attached) from its lysate. Erosion by nuclease digestion removes all mRNA not shielded by a ribosome while also cleaving ribosomes attached to the same mRNA strand. Subsequent removal of the ribosomes leaves behind only the mRNA fragments which were undergoing translation at the point of cell lysis. Mapping these fragments back to the genome gives a codon-level resolution transcriptome-wide overview of the translation occurring within the cell. From this we can infer the optimality associated with any given codon from any given gene.

A simplified schematic of ribosome profiling. Ribosome profiling begins with separating a cell’s polysomes (mRNA with ribosomes attached) from its lysate. Erosion by nuclease digestion removes all mRNA not shielded by a ribosome while also cleaving ribosomes attached to the same mRNA strand. Subsequent removal of the ribosomes leaves behind only the mRNA fragments which were undergoing translation at the point of cell lysis. Mapping these fragments back to the genome gives a codon-level resolution transcriptome-wide overview of the translation occurring within the cell. From this we can infer the translation speed associated with any given codon from any given gene.

However, while these definitions have been in existence for the past few decades, there has been no objective way, till now, to test how accurate they actually are in measuring the translation speed. Essentially, we have based years of research on the extrapolation of a few coarse experiments, or in some cases purely theoretical models, to all translation. There now exists an experimental measure of the translation occurring in-vivo. Ribosome profiling, outlined in above, measures the translation occurring within a cell, mapping the position of the ribosome on the genome at the points of cell lysing. Averaging over many cells gives an accurate measure of the expected translation occurring on any given transcript at any time.

Comparing the log transformed ribosome profile data to the translation speed as defined by each of the algorithms for B. Subtilis. We show the mean optimality against the mean optimality when stratified by codon, finding that the assigned values for each algorithm fails to capture the variation of the ribsome profiling data.

Comparing the log transformed ribosome profile data to the translation speed as defined by each of the algorithms for B. Subtilis. We show the mean ribosome occupancy against the mean translation speed when stratified by codon, finding that the assigned values for each algorithm fails to capture the variation of the ribosome profiling data.

As an initial comparison shown above, we compared some of the most popular speed measures based on the above descriptions to the ribosome profiling data. None of the measures were found to recreate the ribosome profiling data adequately. In fact, while some association is found, it is opposite to what we would expect! The faster the codon according to the algorithm the more a ribosome is likely to occupy it!We thought that this may be due to treating all the codons together instead of with respect to the genes they are from. Essentially, is a given codon actually fast if it is just within a gene that is in general fast? To test for this, we created a set of models which account for a shift in ribosome data profile depending on the source gene. However, these showed even less association to the speed algorithms!

These findings suggest that the algorithms that the scientific community have based there work on for the past decades may in fact be poor representations of the translations speed. This leads to a conundrum, however, as these measures have been made use of in experimental studies, namely the paper by Sander et al (see journal club entry here). In addition, codon bias matching has been used extensively in increasing expression of non-native proteins in bacteria. Clearly these algorithms are a measure of something and, as such, this contradiction needs to be resolved in the near future.

How to (not) perform a Molecular Dynamics simulation study

At this week’s group meeting I presented my recently published paper:

Knapp B, Dunbar J, Deane CM. Large scale characterization of the LC13 TCR and HLA-B8 structural landscape in reaction to 172 altered peptide ligands: a molecular dynamics simulation study. PLoS Comput Biol. 2014 Aug 7;10(8):e1003748. doi: 10.1371/journal.pcbi.1003748. 2014.

This paper was presented on the front page of PLoS Computational Biology:

cover

The paper describes Molecular Dynamics simulations (MD) of T-cell receptor (TCR) / peptide / MHC complexes.

MD simulation calculate the movement of atoms of a given system over time. The zoomed-in movements of the TCR regions (CDRs) recognizing a peptide/MHC are shown here:
simulation

Often scientists perform MD simulations of two highly similar complexes with different immunological outcome. An example is the Epstein-Barr Virus peptide FLRGRAYGL which leads to induction of an immune response when presented by HLA-B*08:01 to the LC 13 TCR. If position 7 of the peptide is mutated to A then the peptide does not induce an immune reaction any-more.
In both simulations the TCR and MHC are identical only the peptide differs in one amino acid. Once the simulations of those two complexes are done one could for example investigate the root mean square fluctuation of both peptides to investigate if one of them is more flexible. And indeed this is the case:

1v1RMSF

Another type of analysis would be the solvent accessible surface area of the peptide or, as shown below, the CDR regions of the TCR beta chain:

1v1SASA

… and again it is obvious that those two simulations strongly differ from each other.

Is it then a good idea to publish an article about these findings and state something like “Our results indicate that peptide flexibility and CDR solvent accessible surface area play an important in the activation of a T cell” ?

As we know from our statistics courses a 1 vs 1 comparison is an anecdotic example but those often do not hold true for larger sample sizes.

So, let’s try it with many more peptides and simulations (performing these simulations took almost 6 months on the Oxford Advanced Research Computing facility). We split the simulations in two categories, those which are more immunogenic and those which are less immunogenic:

compareAll

Let’s see what happens to the difference in RMSF that we found before:

allRMSF

So much for “More immunogenic peptides are more/less flexible as less immunogenic ones” …

How about the SASA that we “found” before:

allSASA

… again almost perfectly identical for larger sample sizes.

And yes, we have found some minor differences between more and less immunogenic peptides that hold true for larger sample sizes but non of them was striking enough to predict peptide immunogenicity based on MD simulations.

What one really learns from this study is that you should not compare 1 MD simulation against 1 other MD simulation as this will almost certainly lead to false positive findings. Exactly the same applies for experimental data such as x-ray structures because this is a statistical problem rather than a prediction based on. If you want to make sustainable claims about something that you found then you will need a large sample size and a proper statistical sample size estimation and power analysis (as done in clinical studies) before you run any simulations. Comparing structures is always massive multiple testing and therefore high sample sizes are needed to draw valid conclusions.

link pdf

Research Talk: Ligand Fitting in X-ray Crystallography

In the last group meeting, I reported on the success of ligand-fitting programs for the automated solution of ligand structures.

In Fragment Screens by X-ray Crystallography, a library of small compounds (fragments) is soaked into protein crystals, and the resulting structures are determined by diffraction experiments. Some of the fragments will bind to the protein (~5% of the library), and these are detected by their appearance in the derived electron density.

The models of binding fragments can be used to guide structure-based drug-design efforts, but first they must be built. Due to the large number of datasets (200-1000), the automated identification of the fragments that bind, and the automated building of atomic models is required for efficient processing of the data.

Density Blobs

Anecdotally, available ligand-fitting programs are unreliable when modelling fragments. We tested three ligand fitting programs in refitting a series of ligand structures. We found that they fail more frequently when the electron density for the ligand is weak. Many fragments that are seen to bind in screens do so only weakly, due to their size. So the weaker the fragment binds, the harder it will be for the automated programs to model.

Success Rates Identifying the Correct Model

Models are usually ranked by the Real-Space Correlation Coefficient (RSCC) between the model and the experimental electron density. This metric is good at identifying ‘correct’ models, and an RSCC > 0.7 normally indicates a correct, or at least mostly correct, model.

Typically, the binding locations of ligands are found by searching for un-modelled peaks in the electron density map. Models are then generated in these locations, and are then scored and ranked. Good models can be identified and presented to the user. However, if a ‘good’ model is not generated, to be scored and ranked, the RSCCs of the ‘bad’ models will not tell you that there is something to be modelled, at a particular place, and binding may be missed…

This is especially true for weak-binding ligands, which will not give a large electron density peak to give evidence that there is something there to be modelled.

Currently, all of the datasets must be inspected manually, to check that a weak-binding fragment has not been missed…

Augmented Modelling with Natural Move Monte Carlo Simulations

In the last group meeting I reported on the progress that I have made regarding the development of a protocol for the systematic use of Natural Move Monte Carlo simulations.

Natural Move Monte Carlo simulations
Natural Moves are degrees of freedom that describe the collective motion of groups of residues. In DNA this might be the concerted motion of a double helix; in proteins this could be the movement of a stable secondary structure element such as a beta-sheet. These segments are joined by so called melting areas. At each simulation step the segments are propagated independently in an MC fashion. The resulting chain breaks are resolved by a chain closure algorithm that acts on the melting areas. This results in a reduction of degrees of freedom of several orders of magnitude. Therefore, large complexes and conformational changes can be sampled more effectively.

In order to get sensible results, however, the initial decomposition of the system is important. The challenge is to accurately represent the plasticity of the system, while keeping the number of degrees of freedom as small as possible. Detailed insight into the flexibility of the system might be gained from experimental sources such as NMR or computational methods such as MD simulations and Normal Mode Analysis. This can help with defining segments and melting areas. However, there are many systems for which this data is not available. Even if it is, there is no guarantee that the segmentation is correct.

Therefore, I am developing a protocol that allows for the evaluation of a range of different test cases that each reflect a unique set of segments and melting areas.

Augmented Modelling Protocol
This protocol is aimed at the systematic evaluation of NMMC segmentations. It allows researchers to feed experimental information, biological knowledge and educated guesses into molecular simulations and so provides a framework for testing competing hypotheses. The protocol has four steps.

Step 1: Segmentation of the system into low-level segments
The initial segmentation contains all possible areas of flexibility that may play a role in conformational changes in the system of interest. This decision may be influenced by many sources. For now, however, we only consider secondary structure information. Helices and beta strands are treated as potential segments. Unstructured regions such as kinks, loops and random coils are treated as melting areas. For a small fold with four helices we get the segmentation shown in figure 1a.

Step 2: Formulate test cases
Generate multiple test cases that reflect hypotheses about the mechanism of interest. In this step we try to narrow down the degrees of freedom as much as possible in order to retain sampling efficiency. This is done by selectively deactivating some melting areas that were defined in step 1. For a system with three melting areas that can either be on or off, 2^3 = 8 different test cases may be generated (example shown in figure 1b).

Segmentation of a small α-fold.

Figure 1 a) Segmentation of a small α-fold. The blue rectangles represent α-helices. The dashed lines indicate the presence of melting areas I, II and III. Each melting area can be switched on or off (1/0) b) Example of a test case in which the first of three melting area is switched off. c) The six degrees of freedom along which a segment is propagated.

Step 3: Perform simulations
Sample the conformational space of all test cases that were generated in step 2. We generally use Parallel Tempering or Simulated Tempering algorithm to accelerate the sampling process. These methods rely on the modulation of temperature to overcome energy barriers.

Step 4: Evaluate results
Score the results against a given control and rank the test cases accordingly. The scoring might be done by comparing experimental distributions of observables with those generated by simulations (e.g. Kullback-Leibler divergence). A test case that reproduces desired expectation values of observables might then be considered as a candidate hypothesis for a certain structural mechanism.

What’s next?
I am currently working on example uses for this protocol. These include questions regarding aspects of protein folding and the stability of the empty MHC II binding groove.