Journal Club: Accessing Protein Conformational Ensembles using RT X-ray Crystallography

This week I presented a paper that investigates the differences between crystallographic datasets collected from crystals at RT (room-temperature) and crystals at CT (cryogenic temperatures). Full paper here.

The cooling of protein crystals to cryogenic temperatures is widely used as a method of reducing radiation damage and enabling collection of whole datasets from a single crystal. In fact, this approach has been so successful that approximately 95% of structures in the PDB have been collected at CT.

However, the main assumption of cryo-cooling is that the “freezing”/cooling process happens quickly enough that it does not disturb the conformational distributions of the protein, and that the RT ensemble is “trapped” when cooled to CT.

Although it is well established that cryo-cooling of the crystal does not distort the overall structure or fold of the protein, this paper investigates some of the more subtle changes that cryo-cooling can introduce, such as the distortion of sidechain conformations or the quenching of dynamic CONTACT networks. These features of proteins could be important for the understanding of phenomena such as binding or allosteric modulation, and so accurate information about the protein is essential. If this information is regulartly lost in the cryo-cooling process, it could be a strong argument for a return to collection at RT where feasible.

By using the RINGER method, the authors find that the sidechain conformations are commonly affected by the cryo-cooling process: the conformers present at CT are sometimes completely different to the conformers observed at RT. In total, they find that cryo-cooling affects a significant number of residues (predominantly those on the surface of the protein, but also those that are buried). 18.9% of residues have rotamer distributions that change between RT and CT, and 37.7% of residues have a conformer that changes occupancy by 20% or more.

Overall, the authors conclude that, where possible, datasets should be collected at RT, as the derived models offer a more realistic description of the biologically-relevant conformational ensemble of the protein.

At this week’s group meeting I presented on my second SABS short project, which is supervised by Charlotte Deane, Mason Porter, and Jonny Wray from e-Therapeutics. It has the title “Multilayer-Network Analysis of Protein Interaction Networks”.
Protein interactions can be represented using networks. Accordingly, approaches that have been developed in network science are appropriate for the analysis of protein interactions, and they can lead to the detection of new drug targets. Thus far, only ordinary (“monolayer”) protein interaction networks have been exploited for drug discovery. However, because “multilayer networks” allow the representation of multiple types of interactions and of time-dependent interactions, they have the potential to improve insight from network-based approaches [1].
Aim of my project was to employ known multilayer methods on well-established data to investigate potential use cases of multilayer protein interaction networks. We focussed on various community detection methods [3,4] to find groups of proteins as candidates of functional, biological modules. Additionally, temporal centrality [5] measures were used to identify important proteins across time.

[1] Kivelä, Mikko, et al. “Multilayer networks.” Journal of Complex Networks (2014) [2] Calvano, Steve E., et al. “A network-based analysis of systemic inflammation in humans.” Nature (2005) [3] Peixoto, Tiago P. “Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models.” PRE (2014) [4] Mucha, Peter J., et al. “Community structure in time-dependent, multiscale, and multiplex networks.” Science (2010) [5] Taylor, Dane, et al. “Eigenvector-Based Centrality Measures for Temporal Networks.” arXiv preprint (2015).

ISMB wrap-up (it was coming, we insist…!)

So ISMB 2015 seems a bit far away from now (just under 2 months!), but Dublin was an adventure, filled with lots of Guinness, Guinness, and … Guinness. Oh and of course, science (credits to Eleanor for the science of Guinness video)! We definitely had lots of that, and you can see some of our pictures from the week too (credits to Sam;

Here are some key pieces of work that got to each of us here at OPIG.

Jianlin Cheng, from the University of Missouri, presented his research into model quality assessment methods – ways of choosing the model that is closest to the native structure from a set of decoys. Their method (MULTICOM) is a consensus method, which calcualtes an average rank from 14 different quality assessment methods. By combining this average rank with clustering and model combination to select five top models for a target, their method produces good results – in CASP11, the group were ranked third when considering the top-ranked models and second when considering the top five.
Accounting for the cell cycle in single cell RNA-seq
The current ubiquitous of RNA-seq throughout academia speaks volumes to the strength of the methodology.  It provides a transcript-wide measure of a cell’s expression at the point of cell lysis; from which one can investigate gene fusion, SNPs and changes in expression, to name only a few possibilities.  Traditionally, these measurement are made using a cell culture and as such the expression levels, and derived results, are based on averages taken over a number of cells. Recent advances have allowed the resolution to increase to the point where measurements can now instead be made on single isolated cells. With this increase in capability, it should now be possible to measure and identify subpopulations within a given culture. However, the inherent variability of expression, such as that caused by the cell cycle, often overshadows any change that could be attributed to these subpopulations. If one could characterise this variability, then this could be removed from the data and perhaps these subpopulations would then be elucidated.
Oliver Stegle gave a great presentation on doing exactly this for the cell cycle. They modeled the different phases as a set of latent variables such that they are inferred directly from the data (rather than merely observed). Via this model, they estimated that upwards of 30% of the inherent variability could be accounted for, and hence subtracted from the data. Applying such a technique to culture of T cells, they were able to identify the the different stages of differentiation of naive T cells into T helper 2 cells. Crucially, these would of been obscured had the cell cycle not been identified. Given this success with just accounting for the cell cycle, Stegle suggested that their technique can be expanded upon to elucidate other sources of gene expression heterogeneity while making it easier to identify these cellular subpopulations.
Dr. Julia Shifman from Hebrew University of Jerusalem studies protein-protein interactions. In her 3DSiG presentation she focused on the presence of “cold-spots” in protein sequence where in sillico mutations to several different amino acids improve the binding affinity. Such cold-spots are often observed at the periphery of the complex, where no interaction is observed.
Alex Cornish from Imperial College London presented his work on the structural difference between cell-type specific interaction networks. To generate these, he weighted protein-protein interaction network edges by cell-type specific gene expression data from the FANTOM5 project. Using these cell-type specific networks, he finds that it is possible to associate specific networks with specific diseases based on the distances between disease-associated genes in the networks. Furthermore, these disease – cell type associations can be used to assess the similarity between diseases.
Barry Grant presented an overview of the research activity in his group — namely nucleotide switch proteins (e.g. GTPases, such as Ras and Kinesin). In the case of Kinesin, the group used simple statistical methods such as principal components analysis to draw inferences between conformation and functional states. The group then used correlated motions to construct a community network that describes how certain groups of residues behave in certain conformational states.
Discovery of CREBBP Bromodomain Inhibitors by High-Throughput Docking and Hit Optimization Guided by Molecular Dynamics
Min Xu, Andrea Unzue, Jing Dong, Dimitrios Spiliotopoulos, Cristina Nevado, and Amedeo Caflisch Paper link
In this paper MD simulations were used to confirm the binding modes found by in silico docking and to guide the chemical synthesis of hit optimisation. In one example three binding modes were observed during MD, which helped to identify favourable electrostatic interactions  that improved selectivity. For another compound a favourable polar interaction in the binding site was observed during MD, which helped to increase its potency from micro- to nanomolar. Thus, given a robust in-silico screening and docking approach, it seems that MD simulations can be a useful addition to confirm binding modes and inspire derivatives that might have been overseen in static structures.
Bumps and traffic lights along the translation of secretory proteins
Shelley Mahlab, Michal LinialIn her talk, Michal described how a theoretical measure of translation speed, tAI, shows the differences between the translation speeds of proteins targeted to different locations. Proteins with a signal peptide (either secreted or transmembrane) have significantly slower codons than proteins without, over approximately the first 30 codons, perhaps allowing more time for binding to the signal recognition particle. Transmembrane proteins had a lower predicted speed overall, which may be important for their insertion and correct folding.

ERC starting grant mock interview

Yesterday’s group meeting got transformed into a mock interview for the final evaluation step of my ERC starting grant application. To be successful with an ERC starting grant application one has to pass three evaluation steps.

The first step consists of a short research proposal (max 5 pages), CV, successful previous grant writing, early achievements track record, and a publication list. If a panel of reviewers (usually around 4-6) decides that this is “excellent” (this word is the main evaluation criterion) then the application is transferred to step two.

In step two a full scientific proposal is evaluated. The unfair procedure is that if step one is not successful then the full proposal is not even read (although it had to be submitted together with step one).

Fortunately, my proposal passed step one and step two. The final hurdle will be a 10 minutes interview + 15 minutes questions in Brussels where the final decision will take place.

I already had one mock interview with some of the 2020 research fellows (thanks to Konrad, Remi, and Laurel), one with David Gavaghan, and the third one took place yesterday with our whole research group.

After those three mock interviews I hope to be properly prepared for the real interview!

Molecular Dynamics of Antibody CDRs

Efficient design of antibodies as therapeutic agents requires understanding of their structure and behavior in solution. I have recently performed molecular dynamics simulations to investigate the flexibility and solution dynamics of complementarity determining regions (CDRs). Eight structures of the Fv region of antibody SPE7 were found in the Protein Data Bank with identical sequences. Twenty-five replicas of 100 ns simulations were performed on the Fvregion of one of these structures to investigate whether the CDRs adopted the conformation of one of the other X-Ray structures. The simulations showed the H3 and L3 loops start from one conformation and adopt another experimentally determined conformation.

This confirms the potential of molecular dynamics to be used to investigate antibody binding and flexibility. Further investigation would involve simulating different systems, for example using solution NMR resolved structures, and comparing the conformations deduced here to the canonical forms of CDR loops. Looking forward it is hoped molecular dynamics could be used to predict the bound conformation of an antibody from the unbound structure.

Click here for simulation videos.

Slow and steady improvements in the prediction of one-dimensional protein features

What do you do when you have a big, complex problem whose solution is not necessarily trivial? You break the problem into smaller, easier to solve parts,  solve each of these sub-problems and merge the results to find the solution of the original, bigger problem. This is an algorithm design paradigm known as the divide and conquer approach.

In protein informatics, we use divide and conquer strategies to deal with a plethora of large and complicated problems. From protein structure prediction to protein-protein interaction networks, we have a wide range of sub and sub-sub problems whose solutions are supposed to help us with the bigger picture.

In particular, prediction of the so called one-dimensional protein features are fundamental sub-problems with a wide range of applications such as protein structure modelling,  homology detection, functional characterization and others. Here, one-dimensional protein features refer to secondary structure, backbone dihedral and C-alpha angles, and solvent accessible surface area.

In this week’s group meeting, I discussed the latest advancements in prediction of one-dimensional features as described in an article published by Heffernan R. and colleagues in Scientific Reports (2015):

“Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning.”

In this article, the authors describe the implementation of SPIDER2, a deep learning approach to predict secondary structure, solvent accessible surface area, and four backbone angles (the traditional dihedrals phi and psi, and the recently explored theta and tau).

“Deep learning” is the buzzword (buzz-two-words or buzzsentence, maybe?) of the moment. For those of you who have no idea what I am talking about, deep learning is an umbrella term for a series of convoluted machine learning methods. The term deep comes from the multiple hidden layers of neurons used during learning.

Deep learning is a very fashionable term for a reason. These methods have been shown to produce state-of-the-art results for a wide range of applications in several fields, including bioinformatics. As a matter of fact, one of the leading methods for contact prediction (previously introduced in this blog post), uses a deep learning approach to improve the precision of predicted protein contacts.

Machine learning has already been explored to predict one-dimensional protein features, showing promising (and more importantly, useful) results. With the emergence of new, more powerful machine learning techniques such as deep learning, previous software are now becoming obsolete.

Based on this premise, Heffernan R. and colleagues implemented and applied their deep learning approach to improve the prediction of one-dimensional protein features. Their training process was rigorous: they performed a 10-fold cross validation using their training set of ~4500 proteins and, on top of that, they also had two independent test sets (a ~1200 protein test set and a set based on the targets of CASP11).  Proteins in all sets did not share more than 25% (30% sequence identity for the CASP set) to any other protein in any of the sets.

The method described in the paper, SPIDER2, was thoroughly compared with state-of-the art prediction software for each of the one-dimensional protein features that it  is capable of predicting. Results show that SPIDER2 achieves a small, yet significant improvement compared to other methods.

It is just like they say, slow and steady wins the race, right? In this case, I am not so sure. It would be interesting to see how much the small increments in precision obtained by SPIDER2 can improve the bigger picture, whichever your bigger picture is. The thing about divide and conquer is that if you become marginally better at solving one of the parts, that doesn’t necessarily imply that you will improve the solution of the bigger, main problem.

If we think about it, during the “conquer” stage (that is, when you are merging the solution of the smaller parts to get to the bigger picture),  you may make compromises that completely disregard any minor improvements for the sub-problems. For instance, in my bigger picture, de novo protein structure prediction, predicted local properties can be sacrificed to ensure a more globally consistent model. More than that, most methods that perform de novo structure prediction already account for a certain degree of error or uncertainty for, say, secondary structure prediction. This is particularly important for the border regions between secondary structure elements (i.e. where an alpha-helix ends and a loop begins). Therefore, even if you improve the precision of your predictions for those border regions, the best approach for structure prediction may still consider those slightly more precise border predictions as unreliable.

The other moral of this story is far more pessimistic. If you think about it, there were significant advancements in machine learning, which led to the creation of ever-more-so complicated neural network architectures. However, when we look back to how much improvement we observed when these highly elaborate techniques were applied to an old problem (prediction of one-dimensional protein features), it seems that the pay-off wasn’t as significant (at least as I would expect). Maybe, I am a glass half-empty kind of guy, but given the buzz surrounding deep learning, I think minor improvements is a bit of a let down. Not to take any credit away from the authors. Their work was rigorous and scientifically very sound. It is just that maybe we are reaching our limits when it comes to applying machine learning to predict secondary structure. Maybe when the next generation of buzzword-worthy machine learning techniques appear, we will observe an even smaller improvement to secondary structure prediction. Which leaves a very bitter unanswered question in all our minds: if machine learning is not the answer, what is?

Predicted protein contacts: is it the solution to (de novo) protein structure prediction?

So what is this buzz I hear about predicted protein contacts? Is it really the long awaited solution for one of the biggest open problems in biology today? Has protein structure prediction been solved?

Well, first things first. Let me give you a quick introduction to this predicted protein contact business (probably not quick enough for an elevator pitch, but hopefully you are not reading this in an elevator).

Nowadays, the scientific community has become very good at sequencing things (and by things I mean genetic things, like whole genomes of a bunch of different people and organisms). We are so good at it that mountains of sequence data are now available: genes, mRNAs, protein sequences. The question is what do we do with all this data?

Good scientists are coming up with new and creative ideas to extract knowledge from these mountains of data. For instance, one can build multiple sequence alignments using protein sequences for a given protein family. One of the ways in which information can be extracted from these multiple sequence alignments is by identifying extremely conserved columns (think of the alignment as a big matrix). Residues in these conserved positions are good candidates for being functionally important for the proteins in that particular family.

Another interesting thing that can be done is to look for pairs of residues that are mutating in a correlated fashion. In more practical terms, you are ascertaining how correlated is the information between two columns of a multiple sequence alignment; how often a change in one of them is countered by a change in the other. Why would anyone care about that? Simple. There is an assumption that residues that mutate in a correlated fashion are co-evolving. In other words, they share some sort of functional dependence (i.e. spatial proximity) that is under selective pressure.

Ok, that was a lot of hypotheticals, does it work? For many years, it didn’t. There were lots of issues with the way these correlations were computed and one of the biggest problems was to identify (and correct for) transitivity. Transitivity is the idea that you observe a false correlation between residues A and C because residues A,B and residues B,C are mutating in a correlated fashion. AS more powerful statistical methods were developed (borrowing some ideas from mechanical statistics), the transitivity issue has seemingly been solved.

The newest methods that detect co-evolving residues in a multiple sequence alignment are capable of detecting protein contacts with high precision. In this context, a contact is defined as two residues that are close together in a protein structure. How close?  Their C-betas must be 8 Angstroms or less apart. When sufficient sequence information is available (at least 500 sequences in the MSA), the average precision of the predicted contacts can reach 80%.

This is a powerful way of converting sequence information into distance constraints, which can be used for protein structure modelling. If a sufficient number of correct distance constraints is used, we can accurately predict the topology of a protein [1]. Recently, we have also observed great advances in the way that models are refined (that is, refining a model that contains the correct topology to atomic, near-experimental resolution). If you put those two things together, we start to look at a very nice picture.

So what’s the catch? The catch was there. Very subtle. “When sufficient sequence information is available”. Currently, there is an estimate that only 15% of the de novo protein structure prediction cases present sufficient sequence information for the prediction of protein contacts. One potential solution would be to sit and wait for more and more sequences to be obtained. Yet a potential pitfall of sitting and waiting is that there is no guarantee that we will have sufficient sequence information for a large number of protein families, as they may as well present less than 500 members.

Furthermore, scientists are not very good at sitting around and waiting. They need to keep themselves busy. There are many things that the community as whole can invest time on while we wait for more sequences to be generated. For instance, we want to be sure that, for the cases where there is a sufficient number of sequences, that we get the modelling step right (and predict the accurate protein topology). Predicted contacts also show potential as a tool for quality assessment and may prove to be a nice way of ascertaining whether you have confidence that a model with correct topology was created. More than that, model refinement still needs to improve if we want to make sure that we get from the correct topology to near-experimental resolution.

Protein structure prediction is a hard problem and with so much room for improvement, we still have a long way to go. Yet, this predicted contact business is a huge step in the right direction. Maybe, it won’t be long before models generated ab initio are considered as reliable as the ones generated using a template. Who knows what promised the future holds.


[1] Kim DE, Dimaio F, Yu-Ruei Wang R, Song Y, Baker D. One contact for every twelve residues allows robust and accurate topology-level protein structure modeling. Proteins. 2014 Feb;82 Suppl 2:208-18. doi: 10.1002/prot.24374. Epub 2013 Sep 10.




Modelling antibodies, from Sequence, to Structure…

Antibody modelling has come a long way in the past 5 years. The Antibody Modelling Assessment (AMA) competitions (effectively an antibody version of CASP) have shown that most antibody design methods are capable of modelling the antibody variable fragment (Fv) at ≤ 1.5Å. Despite this feat, AMA-II provided two important lessons:

1. We can still improve our modelling of the framework region and the canonical CDRs.

Stage two of the AMA-II competition showed that CDR-H3 modelling improves once the correct crystal structure was provided (bar the H3 loop, of course). In addition, some of the canonical CDRs (e.g. L1) were modelled poorly, and some of the framework loops had also been poorly modelled.

2. We can’t treat orientation as if it doesn’t exist.

Many pipelines are either vague about how they predict the orientation, or have no explicit explanation on how the orientation will be predicted for the model structure. Given how important the orientation can be toward the antibody’s binding mode (Fera et al., 2014), it’s clear that this part of the pipeline has to be re-visited more carefully.

In addition to these lessons, one question remains:

What do we do with these models?

No pipeline, as far as we are aware, have no comments on what we should do beyond creating the model from a pipeline. What are its implications? Can we even use it for experiments, and use it as a potential therapeutic in the long-term? In light of these lessons and this blaring question, we developed our own method.

Before we begin, how does modelling work?

In my mind, most, if not all, pipelines follow this generic paradigm:pipeline2

Our method, ABodyBuilder, also follows this 4-step workflow;

  1. We choose the template structure based on sequence identity; below a threshold, we predict the structure of the heavy and light chains separately
  2. In the event that we use the structures from separate antibodies, we predict the orientation from the structure with the highest global sequence identity.
  3. We model the loops using FREAD (Choi, Deane, 2011)
  4. Graft the side chains in using SCWRL.

Following the modelling procedure, our method also annotates the accuracy of the model in a probabilistic context — i.e., an estimated probability that a particular region is modelled at a given RMSD threshold. Moreover, we also flag up any issues that an experimentalist can run into should they ever decide to model the antibody.

The accuracy estimation is a data-driven estimation of model quality. Many pipelines end up giving you just a model – but there’s no way of determining model accuracy until the native structure is determined. This is particularly problematic for CDRH3 where RMSDs can reach up to >4.0A between models and native structures, and it would be incredibly useful to have an a priori, expected estimation of model accuracy.

Furthermore, by commenting on motifs that can conflict with antibody development, we aim to offer a convenient solution for users when they are considering in vitro experiments with their target antibody. Ultimately, ABodyBuilder is designed with the user in mind, making an easy-to-use, informative software that facilitates antibody modelling for novel applications.

Le tour d’OPIG 2015

The third iteration of “Let’s use two wheels to transport us to many pubs” took place earlier this summer, on Wednesday 20th May. Following on from the great successes of the last two years, there was much anticipation, and the promise of a brand new route. This year we covered 8 miles, via the Chester, the King’s Arms at Sandford lock, the Prince of Wales in Iffley, and the Magdalen Arms. Nobody fell in the river or went hungry, so it was considered a success!

2015 route













Using B factors to assess flexibility

In my work of analysing antibody loops I have reached the point where I was interested in flexibility, more specifically challenging the somewhat popular belief that they have a high flexibility, especially the H3 loop. I wanted to use for this the B/Temperature/Debye-Waller factor which can be interpreted as a measure of the temperature dependent vibration of the atoms in the crystal, or in more gentler terms the flexibility at a certain position. I was keen to use the backbone atoms, and possibly the Cβ, but unfortunately the B factor shows some biases as it is used to mask other uncertainties due to high resolution, low electron density and as a result poor modelling. If we take a non redundant set of loops and split them in resolution shells of 0.2A we see how pronounced this bias is (Fig. 1 (a)).


Fig. 1(a) Comparison of average backbone B factors for loops found in structures at increasing resolution. A clear bias can be observed that correlates with the increase in resolution.


Fig. 1(b) Normalization using the average the Z-score of the B factor of backbone atoms shows no bias at different resolution shells.

Comparing loops in neighbouring shells is virtually uninformative, and can lead to quite interesting results. In one analysis it came up that loops that are directly present in the binding site of antibodies have a higher average B factor than loops in structures without antigen where the movement is less constrained.

The issue here is that a complex structure (antibody-antigen) is larger, and has a poorer resolution, and therefore more biased B factors. To solve this issue I decided to normalize the B factors using the Z-score of the PDB file, where the mean and the standard deviation are computed from all the backbone atoms of amino acids inside the PDB file. This method to my knowledge was first described by (Parthasarathy and Murthy, 1997) [1] , although I came to the result without reading their paper, the normalization being quite intuitive. Using this measure we can finally compare loops from different structures at different resolutions (Fig. 1 (b)) with each other and we see what is expected: loops found in bound structures are less flexible than loops in unbound structures (Fig. 2). We can also answer our original question: does the H3 loop present an increased flexibility? The answer from Fig, 2 is no, if we compare a non-redundant sets of loops from antibodies to general proteins.


Fig. 2 Flexibility comparison using the normalized B factor between a non-redundant set of non-IG like protein loops and different sets of H3 loops: bound to antigen (H3 bound), unbound (H3 unbound), both (H3). For each comparison ten samples with same number of examples and similar length distribution have been generated  and amassed (LMS) to correct for the possibility of length bias induced by the H3 loop which is known to have a propensity for longer loops than average.


[1] Parthasarathy, S. ; Murthy, M. R. N. (1997) Analysis of temperature factor distribution in high-resolution protein structures Protein Science, 6 (12). pp. 2561-2567. ISSN 0961-8368