Racing along transcripts: Correlating ribosome profiling and protein structure.

A long long time ago, in a galaxy far away, I gave a presentation about the state of my research to the group (can you tell I’m excited for the new Star Wars!). Since then, little has changed due to my absenteeism from Oxford, which means ((un)luckily) the state of work is by and large the same. Now, my work focusses on the effect that the translation speed of a given mRNA sequence can have on the eventual protein product, specifically through the phenomena of cotranslational folding. I’ve discussed the evidence behind this in prior posts (see here and here), though I find the below video a good reminder of why we can’t always just go as fast as we like.

So given that translation speed is important, how do we in fact measure it? Traditional measures, such as tAI and CAI, infer them using the codon bias within the genome or by comparing the counts of tRNA genes in a genome. However, while these have been shown to somewhat relate to speed, they are still solely theoretical in their construction. An alternative is ribosome profiling, which I’ve discussed in depth before (see here), which provides an actual experimental measure of the time taken to translate each codon in an mRNA sequence. In my latest work, I have compiled ribosome profiling data from 7 different experiments, consisting of 6 diverse organisms and processed them all in the same fashion from their respective raw data. Combined, the dataset gives ribosome profiling “speed” values for approximately 25 thousand genes across the various organisms.

Our first task with this dataset is to see how well the traditional measures compare to the ribosome profiling data. For this, we calculated the correlation against CAI, MinMax, nTE and tAI, with the results presented in the figure above. We find that basically no measure adequately captures the entirety of the translation speed; some measures failing completely, others obviously capturing some part of the behaviour, and then some others even predicting the reverse! Given that no measure captured the behaviour adequately, we realised that existing results that related the translation speed to the protein structure, may, in fact, be wrong. Thus, we decided that we should recreate the analysis using our dataset to either validate or correct the original observations. To do this we combined our ribosome profiling dataset with matching PDB structures, such that we had the sequence, the structure, and the translation speed for approximately 4500 genes over the 6 species. While I won’t go in to details here (see upcoming paper – touch wood), we analysed the relationship between the speed and the solvent accessibility, the secondary structure, and linker regions. We found striking differences to the observations found in the literature that I’ll be excited to share in the near future.

Journal Club: Mechanical force releases nascent chain-mediated ribosome arrest in vitro and in vivo

For this week’s journal club, I presented the paper by Goldman et al, “Mechanical force releases nascent chain-mediated ribosome arrest in vitro and in vivo”. The reason for choosing this paper is that it discussed an influence on protein folding/creation/translation that is not considered in any of today’s modelling efforts and I think it is massively important that every so often we, as a community, step-back and appreciate the complexity of the system we attempt to understand. This work focuses on the the SecM protein, which is known to regulate SecA (which is part of the translocon) which in turn regulates SecM. The bio-mechanical manner in which this regulation takes place is not fully understood. However, SecM contains within its sequence a peptide motiff that binds so strongly to the ribosome tunnel wall that translation is stopped. It is hypothesised that SecA regulates SecM by applying a force to the nascent chain to pull it past this stalling point and, hence, allow translation to continue.

To begin their study, Goldman wanted to confirm that one could advance past the stall point merely by the application of force. By attaching the nascent chain and the ribosome to nano-tweezers and a micro-pipette respectively they could do this. However, to confirm that the system was stalled before applying a (larger) force, they created a sequence which included CaM, a protein which periodically hops between a folded and unfolded state when pulled at 7pN, followed by the section of SecM which causes the stalling. The nano-tweezers were able to sense the slight variations in length at 7pn from the unfolding and refolding of CaM, though no continuing extension, which would indicate translation, was found. This indicated the system had truly stalled due to the SecM sequence. Once at this point, Goldman increased the applied force, at which point distance between the pipette and the optical tweezers slowly increased until detachment when the stop codon was reached. As well as confirming that force on the nascent chain could make the SecM system proceed past the stalling point, they also noted a force dependence to the speed with which it would overcome this barrier.

With this force dependence established, they pondered whether a domain folding upchain of the stall point could generate enough force that it could cause translation to continue. To investigate, Goldman created a protein that contained Top7 followed by a linker of variable length, followed by the SeqM stalling motif, which was in turn followed by GFP. Shown in the figure above, altering the length of the linker region defined the location of Top7 while it attempts to fold. A long linker allows Top7 to fold completely clear from the ribosome tunnel. A short linker means that a it can’t fold due to many of its residues being inside the ribosome tunnel. Between these extremes, however, the protein may only have a few residues within the tunnel and by stretching the nascent chain it may access them so as to be able to fold. In addition, Top7 was chosen specifically as it was known to fold even under light pressure. Hence, by newtons third, Top7 would fold even while its C terminus would be under strain into the ribosome, it in turn generates an equal and opposite force on the stalling peptide sequence within the heart of the ribosome tunnel, which should allow translation to proceed past the stall. Crucially, if Top7 folded too far away from the ribosome, this interaction would not occur and translation would not continue.

Goldman’s experiments showed that this is in fact the case; they found that only linkers of 15 to 22 amino acid would successfully complete translation. This confirms that a protein folding at the mouth of the ribosome tunnel can generate sizeable force (they calculate roughly 12pN in this instance). Now I find this whole system especially interesting as the I wonder how this may generalise to all translation, both in terms of interactions of the nascent chain with the side wall and the domain folding at the ribosome tunnel mouth. Should I consider these when I calculate translation speeds for example? Oh well, we need a reasonable model for translation while ignoring these special cases first before I really need to worry!

A designed conformational shift to control protein binding specificity

Proteins and their binding partners with complementary shapes form complexes. Fisher was onto something when he introduced the “key and lock mechanism” in 1896. For the first time the shape of molecules was considered crucial for molecular recognition. Since then there have been various attempts to improve upon this theory in order incorporate the structural diversity of proteins and their binding partners.

Koshland proposed the “induced fit” mechanism in 1956, which states that interacting partners undergo local conformational changes upon complex formation to strengthen binding. An additional mechanism “conformational selection” was introduced by Monod, Wyman and Changeux who argued that the conformational change occurred before binding driven by the inherent conformational heterogeneity of proteins. Given a protein that fluctuates between two states A and B and a substrate C that only interacts with one of these states, the chance of complex formation depends on the probability of our protein being in state A or B. Furthermore, one could imagine a scenario where a protein has multiple binding partners, each binding to a different conformational state. This means that some proteins exists in an equilibrium of different structural states, which determines the prevalence of interactions with different binding partners.

Figure 1. The “pincer mode”.

Based on this observation Michielssens et al. used various in-silico methods to manipulate the populations of binding-competent states of ubiquitin in order to change its protein binding behaviour. Ubiquitin is known to take on two equally visited states along the “pincer mode” (the collective motion describing the first PCA-eigenvector); closed and open.

Figure 2. A schematic of the conformational equilibrium of ubiquitin that can take on a closed or open state. Depending on its conformation i can bind different substrates.

Different binding partners prefer either the closed, open or both states. By introducing point mutations in the core of ubiquitin, away from the binding interface, Michielssens et al. managed to shift the conformational equilibrium between open and closed states, thereby changing binding specificity.

Point mutations were selected according to the following criteria:

⁃ residues must be located in the hydrophobic core
⁃ binding interface must be unchanged by the mutation
⁃ only hydrophobic residues may be introduced (as well as serine/threonine)
⁃ glycine and tryptophan were excluded because of their size

Figure 3. Conformational preference of ubiquitin mutants. ddG_mut = dG_open – dG_closed.

Fast growth thermal integration (FGTI), a method based on molecular dynamics, was used to calculate the relative de-/stabilisation of the open/closed state caused by each mutation. Interestingly, most mutations that caused stabilisation of the open states were concentrated on one residues, Isoleucine 36 (Slide 7).
For the 15 most significant mutations a complete free energy profile was calculated using Umbrella sampling.

Figure 4. Free energy profiles for six different ubiquitin mutants, calculated using umbrella sampling simulations. Mutants preferring the closed substate are shown in red, open substate stabilizing mutants are depicted in blue, those without a preference are shown in gray.

Figure 5. Prediction of binding free energy differences between wild-type ubiquitin and different point mutations (ddG_binding = dG_binding,mutant􏰵 – dG_binding,wild-type).

To further validate that they correctly categorised their mutants into stabilising the open or closed state, six X-ray structure of ubiquitin in complex with a binding partner that prefers either the open or closed state were simulated with each of their mutations. Figure 5 shows the change in binding free energy that is caused by the mutation in compatible, neutral and incompatible complexes (compatible may refer to an “open favouring mutation” (blue) in an open complex (blue) and vice versa).

Figure 6. Comparison of change in binding free energy predicted from the calculated results for ubiquitin and the experimental result.

In their last step a selection of open and closed mutations was introduced into an open complex and the change in binding free energy was compared between experiment (NMR) and their simulations. For this example, their mutants behaved as expected and an increase in binding free energy was observed when the closed mutations were introduced into the open complex while only subtle changes were seen when the “compatible” closed mutations were introduced.

The authors suggest that in the future this computational protocol may be a corner stone to designing allosteric switches. However, given that this approach requires pre-existing knowledge and is tailored to proteins with well defined conformational states it may take some time until we discover its full potential.

Short project: “Network Approach to Identifying the Mode of Action of Environmental Changes in Yeast”

I recently had the pleasure of working for 11 weeks with the wonderful people in OPIG. I studied protein interaction networks and how we might discern the parts of the network that are important for disease (and otherwise). In the past, people have looked at differential gene expression or used community detection to this end, but both of these approaches have drawbacks. The former misses the fact that biological systems are rarely just binary systems or interactions. Community detection addresses this, but it in turn does not take into account the dynamic nature of proteins in the cell – how do their interactions change over time? What about interactions or proteins that are only present in some cells? Community detection tries to look at all proteins and ignores important context like this.

My aim was to develop approaches that combined these elements. We used Pearson’s correlation coefficient on gene expression data and community detection on an interaction network. We showed that the distribution of the correlation of pairs of genes is weighted towards 1.0 for those that interact compared to those that do not, and for those in the same community compared to those that are not – see the figure above. We went on to assign a “score” to communities based on their correlation in each set of expression data. For example, one community might have a high score in expression data from cells undergoing amino acid starvation. We ended up with a list of communities which seemed to be important in certain environmental conditions. We made use of functional enrichment – drawing on the lovely Malte’s work – to try and verify these scores.

I had a great time with some lovely people and produced something that I thought was very interesting. I really hope I see this work pop up again and get taken to interesting places! So long, and thanks for all the cookies!

Click here for some more pretty plots and a code repository (by request only).

Journal Club: Accessing Protein Conformational Ensembles using RT X-ray Crystallography

This week I presented a paper that investigates the differences between crystallographic datasets collected from crystals at RT (room-temperature) and crystals at CT (cryogenic temperatures). Full paper here.

The cooling of protein crystals to cryogenic temperatures is widely used as a method of reducing radiation damage and enabling collection of whole datasets from a single crystal. In fact, this approach has been so successful that approximately 95% of structures in the PDB have been collected at CT.

However, the main assumption of cryo-cooling is that the “freezing”/cooling process happens quickly enough that it does not disturb the conformational distributions of the protein, and that the RT ensemble is “trapped” when cooled to CT.

Although it is well established that cryo-cooling of the crystal does not distort the overall structure or fold of the protein, this paper investigates some of the more subtle changes that cryo-cooling can introduce, such as the distortion of sidechain conformations or the quenching of dynamic CONTACT networks. These features of proteins could be important for the understanding of phenomena such as binding or allosteric modulation, and so accurate information about the protein is essential. If this information is regulartly lost in the cryo-cooling process, it could be a strong argument for a return to collection at RT where feasible.

By using the RINGER method, the authors find that the sidechain conformations are commonly affected by the cryo-cooling process: the conformers present at CT are sometimes completely different to the conformers observed at RT. In total, they find that cryo-cooling affects a significant number of residues (predominantly those on the surface of the protein, but also those that are buried). 18.9% of residues have rotamer distributions that change between RT and CT, and 37.7% of residues have a conformer that changes occupancy by 20% or more.

Overall, the authors conclude that, where possible, datasets should be collected at RT, as the derived models offer a more realistic description of the biologically-relevant conformational ensemble of the protein.

At this week’s group meeting I presented on my second SABS short project, which is supervised by Charlotte Deane, Mason Porter, and Jonny Wray from e-Therapeutics. It has the title “Multilayer-Network Analysis of Protein Interaction Networks”.
Protein interactions can be represented using networks. Accordingly, approaches that have been developed in network science are appropriate for the analysis of protein interactions, and they can lead to the detection of new drug targets. Thus far, only ordinary (“monolayer”) protein interaction networks have been exploited for drug discovery. However, because “multilayer networks” allow the representation of multiple types of interactions and of time-dependent interactions, they have the potential to improve insight from network-based approaches [1].
Aim of my project was to employ known multilayer methods on well-established data to investigate potential use cases of multilayer protein interaction networks. We focussed on various community detection methods [3,4] to find groups of proteins as candidates of functional, biological modules. Additionally, temporal centrality [5] measures were used to identify important proteins across time.

References:
[1] Kivelä, Mikko, et al. “Multilayer networks.” Journal of Complex Networks (2014) [2] Calvano, Steve E., et al. “A network-based analysis of systemic inflammation in humans.” Nature (2005) [3] Peixoto, Tiago P. “Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models.” PRE (2014) [4] Mucha, Peter J., et al. “Community structure in time-dependent, multiscale, and multiplex networks.” Science (2010) [5] Taylor, Dane, et al. “Eigenvector-Based Centrality Measures for Temporal Networks.” arXiv preprint (2015).

ISMB wrap-up (it was coming, we insist…!)

So ISMB 2015 seems a bit far away from now (just under 2 months!), but Dublin was an adventure, filled with lots of Guinness, Guinness, and … Guinness. Oh and of course, science (credits to Eleanor for the science of Guinness video)! We definitely had lots of that, and you can see some of our pictures from the week too (credits to Sam; https://goo.gl/photos/2qm9CPbfHtoC3VfH9)

Here are some key pieces of work that got to each of us here at OPIG.

Claire:

Jianlin Cheng, from the University of Missouri, presented his research into model quality assessment methods – ways of choosing the model that is closest to the native structure from a set of decoys. Their method (MULTICOM) is a consensus method, which calcualtes an average rank from 14 different quality assessment methods. By combining this average rank with clustering and model combination to select five top models for a target, their method produces good results – in CASP11, the group were ranked third when considering the top-ranked models and second when considering the top five.

Alistair:

Accounting for the cell cycle in single cell RNA-seq

The current ubiquitous of RNA-seq throughout academia speaks volumes to the strength of the methodology. It provides a transcript-wide measure of a cell’s expression at the point of cell lysis; from which one can investigate gene fusion, SNPs and changes in expression, to name only a few possibilities. Traditionally, these measurement are made using a cell culture and as such the expression levels, and derived results, are based on averages taken over a number of cells. Recent advances have allowed the resolution to increase to the point where measurements can now instead be made on single isolated cells. With this increase in capability, it should now be possible to measure and identify subpopulations within a given culture. However, the inherent variability of expression, such as that caused by the cell cycle, often overshadows any change that could be attributed to these subpopulations. If one could characterise this variability, then this could be removed from the data and perhaps these subpopulations would then be elucidated.

Oliver Stegle gave a great presentation on doing exactly this for the cell cycle. They modeled the different phases as a set of latent variables such that they are inferred directly from the data (rather than merely observed). Via this model, they estimated that upwards of 30% of the inherent variability could be accounted for, and hence subtracted from the data. Applying such a technique to culture of T cells, they were able to identify the the different stages of differentiation of naive T cells into T helper 2 cells. Crucially, these would of been obscured had the cell cycle not been identified. Given this success with just accounting for the cell cycle, Stegle suggested that their technique can be expanded upon to elucidate other sources of gene expression heterogeneity while making it easier to identify these cellular subpopulations.

Jaro:

Dr. Julia Shifman from Hebrew University of Jerusalem studies protein-protein interactions. In her 3DSiG presentation she focused on the presence of “cold-spots” in protein sequence where in sillico mutations to several different amino acids improve the binding affinity. Such cold-spots are often observed at the periphery of the complex, where no interaction is observed.

Paper link

Malte:

Alex Cornish from Imperial College London presented his work on the structural difference between cell-type specific interaction networks. To generate these, he weighted protein-protein interaction network edges by cell-type specific gene expression data from the FANTOM5 project. Using these cell-type specific networks, he finds that it is possible to associate specific networks with specific diseases based on the distances between disease-associated genes in the networks. Furthermore, these disease – cell type associations can be used to assess the similarity between diseases.

Jin:

Barry Grant presented an overview of the research activity in his group — namely nucleotide switch proteins (e.g. GTPases, such as Ras and Kinesin). In the case of Kinesin, the group used simple statistical methods such as principal components analysis to draw inferences between conformation and functional states. The group then used correlated motions to construct a community network that describes how certain groups of residues behave in certain conformational states.

Sam:

Discovery of CREBBP Bromodomain Inhibitors by High-Throughput Docking and Hit Optimization Guided by Molecular Dynamics

Min Xu, Andrea Unzue, Jing Dong, Dimitrios Spiliotopoulos, Cristina Nevado, and Amedeo Caflisch Paper link

In this paper MD simulations were used to confirm the binding modes found by in silico docking and to guide the chemical synthesis of hit optimisation. In one example three binding modes were observed during MD, which helped to identify favourable electrostatic interactions that improved selectivity. For another compound a favourable polar interaction in the binding site was observed during MD, which helped to increase its potency from micro- to nanomolar. Thus, given a robust in-silico screening and docking approach, it seems that MD simulations can be a useful addition to confirm binding modes and inspire derivatives that might have been overseen in static structures.

Eleanor:

Bumps and traffic lights along the translation of secretory proteins
Shelley Mahlab, Michal LinialIn her talk, Michal described how a theoretical measure of translation speed, tAI, shows the differences between the translation speeds of proteins targeted to different locations. Proteins with a signal peptide (either secreted or transmembrane) have significantly slower codons than proteins without, over approximately the first 30 codons, perhaps allowing more time for binding to the signal recognition particle. Transmembrane proteins had a lower predicted speed overall, which may be important for their insertion and correct folding.

ERC starting grant mock interview

Yesterday’s group meeting got transformed into a mock interview for the final evaluation step of my ERC starting grant application. To be successful with an ERC starting grant application one has to pass three evaluation steps.

The first step consists of a short research proposal (max 5 pages), CV, successful previous grant writing, early achievements track record, and a publication list. If a panel of reviewers (usually around 4-6) decides that this is “excellent” (this word is the main evaluation criterion) then the application is transferred to step two.

In step two a full scientific proposal is evaluated. The unfair procedure is that if step one is not successful then the full proposal is not even read (although it had to be submitted together with step one).

Fortunately, my proposal passed step one and step two. The final hurdle will be a 10 minutes interview + 15 minutes questions in Brussels where the final decision will take place.

I already had one mock interview with some of the 2020 research fellows (thanks to Konrad, Remi, and Laurel), one with David Gavaghan, and the third one took place yesterday with our whole research group.

After those three mock interviews I hope to be properly prepared for the real interview!

Molecular Dynamics of Antibody CDRs

Efficient design of antibodies as therapeutic agents requires understanding of their structure and behavior in solution. I have recently performed molecular dynamics simulations to investigate the flexibility and solution dynamics of complementarity determining regions (CDRs). Eight structures of the Fv region of antibody SPE7 were found in the Protein Data Bank with identical sequences. Twenty-five replicas of 100 ns simulations were performed on the Fvregion of one of these structures to investigate whether the CDRs adopted the conformation of one of the other X-Ray structures. The simulations showed the H3 and L3 loops start from one conformation and adopt another experimentally determined conformation.

This confirms the potential of molecular dynamics to be used to investigate antibody binding and flexibility. Further investigation would involve simulating different systems, for example using solution NMR resolved structures, and comparing the conformations deduced here to the canonical forms of CDR loops. Looking forward it is hoped molecular dynamics could be used to predict the bound conformation of an antibody from the unbound structure.

Click here for simulation videos.

Slow and steady improvements in the prediction of one-dimensional protein features

What do you do when you have a big, complex problem whose solution is not necessarily trivial? You break the problem into smaller, easier to solve parts, solve each of these sub-problems and merge the results to find the solution of the original, bigger problem. This is an algorithm design paradigm known as the divide and conquer approach.

In protein informatics, we use divide and conquer strategies to deal with a plethora of large and complicated problems. From protein structure prediction to protein-protein interaction networks, we have a wide range of sub and sub-sub problems whose solutions are supposed to help us with the bigger picture.

In particular, prediction of the so called one-dimensional protein features are fundamental sub-problems with a wide range of applications such as protein structure modelling, homology detection, functional characterization and others. Here, one-dimensional protein features refer to secondary structure, backbone dihedral and C-alpha angles, and solvent accessible surface area.

In this week’s group meeting, I discussed the latest advancements in prediction of one-dimensional features as described in an article published by Heffernan R. and colleagues in Scientific Reports (2015):

“Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning.”

In this article, the authors describe the implementation of SPIDER2, a deep learning approach to predict secondary structure, solvent accessible surface area, and four backbone angles (the traditional dihedrals phi and psi, and the recently explored theta and tau).

“Deep learning” is the buzzword (buzz-two-words or buzzsentence, maybe?) of the moment. For those of you who have no idea what I am talking about, deep learning is an umbrella term for a series of convoluted machine learning methods. The term deep comes from the multiple hidden layers of neurons used during learning.

Deep learning is a very fashionable term for a reason. These methods have been shown to produce state-of-the-art results for a wide range of applications in several fields, including bioinformatics. As a matter of fact, one of the leading methods for contact prediction (previously introduced in this blog post), uses a deep learning approach to improve the precision of predicted protein contacts.

Machine learning has already been explored to predict one-dimensional protein features, showing promising (and more importantly, useful) results. With the emergence of new, more powerful machine learning techniques such as deep learning, previous software are now becoming obsolete.

Based on this premise, Heffernan R. and colleagues implemented and applied their deep learning approach to improve the prediction of one-dimensional protein features. Their training process was rigorous: they performed a 10-fold cross validation using their training set of ~4500 proteins and, on top of that, they also had two independent test sets (a ~1200 protein test set and a set based on the targets of CASP11). Proteins in all sets did not share more than 25% (30% sequence identity for the CASP set) to any other protein in any of the sets.

The method described in the paper, SPIDER2, was thoroughly compared with state-of-the art prediction software for each of the one-dimensional protein features that it is capable of predicting. Results show that SPIDER2 achieves a small, yet significant improvement compared to other methods.

It is just like they say, slow and steady wins the race, right? In this case, I am not so sure. It would be interesting to see how much the small increments in precision obtained by SPIDER2 can improve the bigger picture, whichever your bigger picture is. The thing about divide and conquer is that if you become marginally better at solving one of the parts, that doesn’t necessarily imply that you will improve the solution of the bigger, main problem.

If we think about it, during the “conquer” stage (that is, when you are merging the solution of the smaller parts to get to the bigger picture), you may make compromises that completely disregard any minor improvements for the sub-problems. For instance, in my bigger picture, de novo protein structure prediction, predicted local properties can be sacrificed to ensure a more globally consistent model. More than that, most methods that perform de novo structure prediction already account for a certain degree of error or uncertainty for, say, secondary structure prediction. This is particularly important for the border regions between secondary structure elements (i.e. where an alpha-helix ends and a loop begins). Therefore, even if you improve the precision of your predictions for those border regions, the best approach for structure prediction may still consider those slightly more precise border predictions as unreliable.

The other moral of this story is far more pessimistic. If you think about it, there were significant advancements in machine learning, which led to the creation of ever-more-so complicated neural network architectures. However, when we look back to how much improvement we observed when these highly elaborate techniques were applied to an old problem (prediction of one-dimensional protein features), it seems that the pay-off wasn’t as significant (at least as I would expect). Maybe, I am a glass half-empty kind of guy, but given the buzz surrounding deep learning, I think minor improvements is a bit of a let down. Not to take any credit away from the authors. Their work was rigorous and scientifically very sound. It is just that maybe we are reaching our limits when it comes to applying machine learning to predict secondary structure. Maybe when the next generation of buzzword-worthy machine learning techniques appear, we will observe an even smaller improvement to secondary structure prediction. Which leaves a very bitter unanswered question in all our minds: if machine learning is not the answer, what is?

Oxford Protein Informatics Group

or "OPIG" to friends