Category Archives: Group Meetings

What we discuss during cake at our Tuesday afternoon group meetings

Journal Club: Mechanical force releases nascent chain-mediated ribosome arrest in vitro and in vivo

For this week’s journal club, I presented the paper by Goldman et al, “Mechanical force releases nascent chain-mediated ribosome arrest in vitro and in vivo”. The reason for choosing this paper is that it discussed an influence on protein folding/creation/translation that is not considered in any of today’s modelling efforts and I think it is massively important that every so often we, as a community, step-back and appreciate the complexity of the system we attempt to understand. This work focuses on the the SecM protein, which is known to regulate SecA (which is part of the translocon) which in turn regulates SecM. The bio-mechanical manner in which this regulation takes place is not fully understood. However, SecM contains within its sequence a peptide motiff that binds so strongly to the ribosome tunnel wall that translation is stopped. It is hypothesised that SecA regulates SecM by applying a force to the nascent chain to pull it past this stalling point and, hence, allow translation to continue.

To begin their study, Goldman wanted to confirm that one could advance past the stall point merely by the application of force. By attaching the nascent chain and the ribosome to nano-tweezers and a micro-pipette respectively they could do this. However, to confirm that the system was stalled before applying a (larger) force, they created a sequence which included CaM, a protein which periodically hops between a folded and unfolded state when pulled at 7pN, followed by the section of SecM which causes the stalling. The nano-tweezers were able to sense the slight variations in length at 7pn from the unfolding and refolding of CaM, though no continuing extension, which would indicate translation, was found. This indicated the system had truly stalled due to the SecM sequence. Once at this point, Goldman increased the applied force, at which point distance between the pipette and the optical tweezers slowly increased until detachment when the stop codon was reached. As well as confirming that force on the nascent chain could make the SecM system proceed past the stalling point, they also noted a force dependence to the speed with which it would overcome this barrier.

With this force dependence established, they pondered whether a domain folding upchain of the stall point could generate enough force that it could cause translation to continue. To investigate, Goldman created a protein that contained Top7 followed by a linker of variable length, followed by the SeqM stalling motif, which was in turn followed by GFP. Shown in the figure above, altering the length of the linker region defined the location of Top7 while it attempts to fold. A long linker allows Top7 to fold completely clear from the ribosome tunnel. A short linker means that a it can’t fold due to many of its residues being inside the ribosome tunnel. Between these extremes, however, the protein may only have a few residues within the tunnel and by stretching the nascent chain it may access them so as to be able to fold. In addition, Top7 was chosen specifically as it was known to fold even under light pressure. Hence, by newtons third, Top7 would fold even while its C terminus would be under strain into the ribosome, it in turn generates an equal and opposite force on the stalling peptide sequence within the heart of the ribosome tunnel, which should allow translation to proceed past the stall. Crucially, if Top7 folded too far away from the ribosome, this interaction would not occur and translation would not continue.

Goldman’s experiments showed that this is in fact the case; they found that only linkers of 15 to 22 amino acid would successfully complete translation. This confirms that a protein folding at the mouth of the ribosome tunnel can generate sizeable force (they calculate roughly 12pN in this instance). Now I find this whole system especially interesting as the I wonder how this may generalise to all translation, both in terms of interactions of the nascent chain with the side wall and the domain folding at the ribosome tunnel mouth. Should I consider these when I calculate translation speeds for example? Oh well, we need a reasonable model for translation while ignoring these special cases first before I really need to worry!

Short project: “Network Approach to Identifying the Mode of Action of Environmental Changes in Yeast”

I recently had the pleasure of working for 11 weeks with the wonderful people in OPIG. I studied protein interaction networks and how we might discern the parts of the network that are important for disease (and otherwise). In the past, people have looked at differential gene expression or used community detection to this end, but both of these approaches have drawbacks. The former misses the fact that biological systems are rarely just binary systems or interactions. Community detection addresses this, but it in turn does not take into account the dynamic nature of proteins in the cell – how do their interactions change over time? What about interactions or proteins that are only present in some cells? Community detection tries to look at all proteins and ignores important context like this.

My aim was to develop approaches that combined these elements. We used Pearson’s correlation coefficient on gene expression data and community detection on an interaction network. We showed that the distribution of the correlation of pairs of genes is weighted towards 1.0 for those that interact compared to those that do not, and for those in the same community compared to those that are not – see the figure above. We went on to assign a “score” to communities based on their correlation in each set of expression data. For example, one community might have a high score in expression data from cells undergoing amino acid starvation. We ended up with a list of communities which seemed to be important in certain environmental conditions. We made use of functional enrichment – drawing on the lovely Malte’s work – to try and verify these scores.

I had a great time with some lovely people and produced something that I thought was very interesting. I really hope I see this work pop up again and get taken to interesting places! So long, and thanks for all the cookies!

Click here for some more pretty plots and a code repository (by request only).

Journal Club: Accessing Protein Conformational Ensembles using RT X-ray Crystallography

This week I presented a paper that investigates the differences between crystallographic datasets collected from crystals at RT (room-temperature) and crystals at CT (cryogenic temperatures). Full paper here.

The cooling of protein crystals to cryogenic temperatures is widely used as a method of reducing radiation damage and enabling collection of whole datasets from a single crystal. In fact, this approach has been so successful that approximately 95% of structures in the PDB have been collected at CT.

However, the main assumption of cryo-cooling is that the “freezing”/cooling process happens quickly enough that it does not disturb the conformational distributions of the protein, and that the RT ensemble is “trapped” when cooled to CT.

Although it is well established that cryo-cooling of the crystal does not distort the overall structure or fold of the protein, this paper investigates some of the more subtle changes that cryo-cooling can introduce, such as the distortion of sidechain conformations or the quenching of dynamic CONTACT networks. These features of proteins could be important for the understanding of phenomena such as binding or allosteric modulation, and so accurate information about the protein is essential. If this information is regulartly lost in the cryo-cooling process, it could be a strong argument for a return to collection at RT where feasible.

By using the RINGER method, the authors find that the sidechain conformations are commonly affected by the cryo-cooling process: the conformers present at CT are sometimes completely different to the conformers observed at RT. In total, they find that cryo-cooling affects a significant number of residues (predominantly those on the surface of the protein, but also those that are buried). 18.9% of residues have rotamer distributions that change between RT and CT, and 37.7% of residues have a conformer that changes occupancy by 20% or more.

Overall, the authors conclude that, where possible, datasets should be collected at RT, as the derived models offer a more realistic description of the biologically-relevant conformational ensemble of the protein.

At this week’s group meeting I presented on my second SABS short project, which is supervised by Charlotte Deane, Mason Porter, and Jonny Wray from e-Therapeutics. It has the title “Multilayer-Network Analysis of Protein Interaction Networks”.
Protein interactions can be represented using networks. Accordingly, approaches that have been developed in network science are appropriate for the analysis of protein interactions, and they can lead to the detection of new drug targets. Thus far, only ordinary (“monolayer”) protein interaction networks have been exploited for drug discovery. However, because “multilayer networks” allow the representation of multiple types of interactions and of time-dependent interactions, they have the potential to improve insight from network-based approaches [1].
Aim of my project was to employ known multilayer methods on well-established data to investigate potential use cases of multilayer protein interaction networks. We focussed on various community detection methods [3,4] to find groups of proteins as candidates of functional, biological modules. Additionally, temporal centrality [5] measures were used to identify important proteins across time.

References:
[1] Kivelä, Mikko, et al. “Multilayer networks.” Journal of Complex Networks (2014) [2] Calvano, Steve E., et al. “A network-based analysis of systemic inflammation in humans.” Nature (2005) [3] Peixoto, Tiago P. “Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models.” PRE (2014) [4] Mucha, Peter J., et al. “Community structure in time-dependent, multiscale, and multiplex networks.” Science (2010) [5] Taylor, Dane, et al. “Eigenvector-Based Centrality Measures for Temporal Networks.” arXiv preprint (2015).

ERC starting grant mock interview

Yesterday’s group meeting got transformed into a mock interview for the final evaluation step of my ERC starting grant application. To be successful with an ERC starting grant application one has to pass three evaluation steps.

The first step consists of a short research proposal (max 5 pages), CV, successful previous grant writing, early achievements track record, and a publication list. If a panel of reviewers (usually around 4-6) decides that this is “excellent” (this word is the main evaluation criterion) then the application is transferred to step two.

In step two a full scientific proposal is evaluated. The unfair procedure is that if step one is not successful then the full proposal is not even read (although it had to be submitted together with step one).

Fortunately, my proposal passed step one and step two. The final hurdle will be a 10 minutes interview + 15 minutes questions in Brussels where the final decision will take place.

I already had one mock interview with some of the 2020 research fellows (thanks to Konrad, Remi, and Laurel), one with David Gavaghan, and the third one took place yesterday with our whole research group.

After those three mock interviews I hope to be properly prepared for the real interview!

Slow and steady improvements in the prediction of one-dimensional protein features

What do you do when you have a big, complex problem whose solution is not necessarily trivial? You break the problem into smaller, easier to solve parts, solve each of these sub-problems and merge the results to find the solution of the original, bigger problem. This is an algorithm design paradigm known as the divide and conquer approach.

In protein informatics, we use divide and conquer strategies to deal with a plethora of large and complicated problems. From protein structure prediction to protein-protein interaction networks, we have a wide range of sub and sub-sub problems whose solutions are supposed to help us with the bigger picture.

In particular, prediction of the so called one-dimensional protein features are fundamental sub-problems with a wide range of applications such as protein structure modelling, homology detection, functional characterization and others. Here, one-dimensional protein features refer to secondary structure, backbone dihedral and C-alpha angles, and solvent accessible surface area.

In this week’s group meeting, I discussed the latest advancements in prediction of one-dimensional features as described in an article published by Heffernan R. and colleagues in Scientific Reports (2015):

“Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning.”

In this article, the authors describe the implementation of SPIDER2, a deep learning approach to predict secondary structure, solvent accessible surface area, and four backbone angles (the traditional dihedrals phi and psi, and the recently explored theta and tau).

“Deep learning” is the buzzword (buzz-two-words or buzzsentence, maybe?) of the moment. For those of you who have no idea what I am talking about, deep learning is an umbrella term for a series of convoluted machine learning methods. The term deep comes from the multiple hidden layers of neurons used during learning.

Deep learning is a very fashionable term for a reason. These methods have been shown to produce state-of-the-art results for a wide range of applications in several fields, including bioinformatics. As a matter of fact, one of the leading methods for contact prediction (previously introduced in this blog post), uses a deep learning approach to improve the precision of predicted protein contacts.

Machine learning has already been explored to predict one-dimensional protein features, showing promising (and more importantly, useful) results. With the emergence of new, more powerful machine learning techniques such as deep learning, previous software are now becoming obsolete.

Based on this premise, Heffernan R. and colleagues implemented and applied their deep learning approach to improve the prediction of one-dimensional protein features. Their training process was rigorous: they performed a 10-fold cross validation using their training set of ~4500 proteins and, on top of that, they also had two independent test sets (a ~1200 protein test set and a set based on the targets of CASP11). Proteins in all sets did not share more than 25% (30% sequence identity for the CASP set) to any other protein in any of the sets.

The method described in the paper, SPIDER2, was thoroughly compared with state-of-the art prediction software for each of the one-dimensional protein features that it is capable of predicting. Results show that SPIDER2 achieves a small, yet significant improvement compared to other methods.

It is just like they say, slow and steady wins the race, right? In this case, I am not so sure. It would be interesting to see how much the small increments in precision obtained by SPIDER2 can improve the bigger picture, whichever your bigger picture is. The thing about divide and conquer is that if you become marginally better at solving one of the parts, that doesn’t necessarily imply that you will improve the solution of the bigger, main problem.

If we think about it, during the “conquer” stage (that is, when you are merging the solution of the smaller parts to get to the bigger picture), you may make compromises that completely disregard any minor improvements for the sub-problems. For instance, in my bigger picture, de novo protein structure prediction, predicted local properties can be sacrificed to ensure a more globally consistent model. More than that, most methods that perform de novo structure prediction already account for a certain degree of error or uncertainty for, say, secondary structure prediction. This is particularly important for the border regions between secondary structure elements (i.e. where an alpha-helix ends and a loop begins). Therefore, even if you improve the precision of your predictions for those border regions, the best approach for structure prediction may still consider those slightly more precise border predictions as unreliable.

The other moral of this story is far more pessimistic. If you think about it, there were significant advancements in machine learning, which led to the creation of ever-more-so complicated neural network architectures. However, when we look back to how much improvement we observed when these highly elaborate techniques were applied to an old problem (prediction of one-dimensional protein features), it seems that the pay-off wasn’t as significant (at least as I would expect). Maybe, I am a glass half-empty kind of guy, but given the buzz surrounding deep learning, I think minor improvements is a bit of a let down. Not to take any credit away from the authors. Their work was rigorous and scientifically very sound. It is just that maybe we are reaching our limits when it comes to applying machine learning to predict secondary structure. Maybe when the next generation of buzzword-worthy machine learning techniques appear, we will observe an even smaller improvement to secondary structure prediction. Which leaves a very bitter unanswered question in all our minds: if machine learning is not the answer, what is?

Predicted protein contacts: is it the solution to (de novo) protein structure prediction?

So what is this buzz I hear about predicted protein contacts? Is it really the long awaited solution for one of the biggest open problems in biology today? Has protein structure prediction been solved?

Well, first things first. Let me give you a quick introduction to this predicted protein contact business (probably not quick enough for an elevator pitch, but hopefully you are not reading this in an elevator).

Nowadays, the scientific community has become very good at sequencing things (and by things I mean genetic things, like whole genomes of a bunch of different people and organisms). We are so good at it that mountains of sequence data are now available: genes, mRNAs, protein sequences. The question is what do we do with all this data?

Good scientists are coming up with new and creative ideas to extract knowledge from these mountains of data. For instance, one can build multiple sequence alignments using protein sequences for a given protein family. One of the ways in which information can be extracted from these multiple sequence alignments is by identifying extremely conserved columns (think of the alignment as a big matrix). Residues in these conserved positions are good candidates for being functionally important for the proteins in that particular family.

Another interesting thing that can be done is to look for pairs of residues that are mutating in a correlated fashion. In more practical terms, you are ascertaining how correlated is the information between two columns of a multiple sequence alignment; how often a change in one of them is countered by a change in the other. Why would anyone care about that? Simple. There is an assumption that residues that mutate in a correlated fashion are co-evolving. In other words, they share some sort of functional dependence (i.e. spatial proximity) that is under selective pressure.

Ok, that was a lot of hypotheticals, does it work? For many years, it didn’t. There were lots of issues with the way these correlations were computed and one of the biggest problems was to identify (and correct for) transitivity. Transitivity is the idea that you observe a false correlation between residues A and C because residues A,B and residues B,C are mutating in a correlated fashion. AS more powerful statistical methods were developed (borrowing some ideas from mechanical statistics), the transitivity issue has seemingly been solved.

The newest methods that detect co-evolving residues in a multiple sequence alignment are capable of detecting protein contacts with high precision. In this context, a contact is defined as two residues that are close together in a protein structure. How close? Their C-betas must be 8 Angstroms or less apart. When sufficient sequence information is available (at least 500 sequences in the MSA), the average precision of the predicted contacts can reach 80%.

This is a powerful way of converting sequence information into distance constraints, which can be used for protein structure modelling. If a sufficient number of correct distance constraints is used, we can accurately predict the topology of a protein [1]. Recently, we have also observed great advances in the way that models are refined (that is, refining a model that contains the correct topology to atomic, near-experimental resolution). If you put those two things together, we start to look at a very nice picture.

So what’s the catch? The catch was there. Very subtle. “When sufficient sequence information is available”. Currently, there is an estimate that only 15% of the de novo protein structure prediction cases present sufficient sequence information for the prediction of protein contacts. One potential solution would be to sit and wait for more and more sequences to be obtained. Yet a potential pitfall of sitting and waiting is that there is no guarantee that we will have sufficient sequence information for a large number of protein families, as they may as well present less than 500 members.

Furthermore, scientists are not very good at sitting around and waiting. They need to keep themselves busy. There are many things that the community as whole can invest time on while we wait for more sequences to be generated. For instance, we want to be sure that, for the cases where there is a sufficient number of sequences, that we get the modelling step right (and predict the accurate protein topology). Predicted contacts also show potential as a tool for quality assessment and may prove to be a nice way of ascertaining whether you have confidence that a model with correct topology was created. More than that, model refinement still needs to improve if we want to make sure that we get from the correct topology to near-experimental resolution.

Protein structure prediction is a hard problem and with so much room for improvement, we still have a long way to go. Yet, this predicted contact business is a huge step in the right direction. Maybe, it won’t be long before models generated ab initio are considered as reliable as the ones generated using a template. Who knows what promised the future holds.

References:

[1] Kim DE, Dimaio F, Yu-Ruei Wang R, Song Y, Baker D. One contact for every twelve residues allows robust and accurate topology-level protein structure modeling. Proteins. 2014 Feb;82 Suppl 2:208-18. doi: 10.1002/prot.24374. Epub 2013 Sep 10.

Modelling antibodies, from Sequence, to Structure…

Antibody modelling has come a long way in the past 5 years. The Antibody Modelling Assessment (AMA) competitions (effectively an antibody version of CASP) have shown that most antibody design methods are capable of modelling the antibody variable fragment (Fv) at ≤ 1.5Å. Despite this feat, AMA-II provided two important lessons:

1. We can still improve our modelling of the framework region and the canonical CDRs.

Stage two of the AMA-II competition showed that CDR-H3 modelling improves once the correct crystal structure was provided (bar the H3 loop, of course). In addition, some of the canonical CDRs (e.g. L1) were modelled poorly, and some of the framework loops had also been poorly modelled.

2. We can’t treat orientation as if it doesn’t exist.

Many pipelines are either vague about how they predict the orientation, or have no explicit explanation on how the orientation will be predicted for the model structure. Given how important the orientation can be toward the antibody’s binding mode (Fera et al., 2014), it’s clear that this part of the pipeline has to be re-visited more carefully.

In addition to these lessons, one question remains:

What do we do with these models?

No pipeline, as far as we are aware, have no comments on what we should do beyond creating the model from a pipeline. What are its implications? Can we even use it for experiments, and use it as a potential therapeutic in the long-term? In light of these lessons and this blaring question, we developed our own method.

Before we begin, how does modelling work?

In my mind, most, if not all, pipelines follow this generic paradigm:

Our method, ABodyBuilder, also follows this 4-step workflow;

We choose the template structure based on sequence identity; below a threshold, we predict the structure of the heavy and light chains separately
In the event that we use the structures from separate antibodies, we predict the orientation from the structure with the highest global sequence identity.
We model the loops using FREAD (Choi, Deane, 2011)
Graft the side chains in using SCWRL.

Following the modelling procedure, our method also annotates the accuracy of the model in a probabilistic context — i.e., an estimated probability that a particular region is modelled at a given RMSD threshold. Moreover, we also flag up any issues that an experimentalist can run into should they ever decide to model the antibody.

The accuracy estimation is a data-driven estimation of model quality. Many pipelines end up giving you just a model – but there’s no way of determining model accuracy until the native structure is determined. This is particularly problematic for CDRH3 where RMSDs can reach up to >4.0A between models and native structures, and it would be incredibly useful to have an a priori, expected estimation of model accuracy.

Furthermore, by commenting on motifs that can conflict with antibody development, we aim to offer a convenient solution for users when they are considering in vitro experiments with their target antibody. Ultimately, ABodyBuilder is designed with the user in mind, making an easy-to-use, informative software that facilitates antibody modelling for novel applications.

Journal Club: Spontaneous transmembrane helix insertion thermodynamically mimics translocon-guided insertion

Many methods are available for prediction of topology of transmembrane helices, this being one of the success stories of protein structure prediction with accuracies over 90%. However, there are still areas where there is disagreement in some areas about the partitioning between the states of dissolved in water and positioned across a lipid bilayer. Complications arise because there are so many methods of measuring the thermodynamics of this transition – experimental and theoretical, in vivo and in vitro. It is uncertain what difference the translocon makes to the energetics of insertion – is the topology and conformation of a membrane protein the global thermodynamic minimum or just a kinetic product?

This paper uses three approaches to measure partitioning to test the agreement between different methods. The authors aim to reconcile differences calculated so far for insertion of an arginine residue into the membrane (ranging from +2 to +15 kcal/mol). This is an important question, because many transmembrane helices are only marginally hydrophobic and it is not known how and when they insert in the folding process. Arginine is chosen here because the pKa of 12.5 of the side chain is very high so it will not deprotonate in the centre of a bilayer and complications of protonation and deprotonation do not need to be considered. The same peptide is used for each method, of the form L_nRL_n, and the ratio between the interface and transmembrane states is used to calculate estimates of ΔG. In order to make sure that there were helices with a ΔG close to zero for accurate estimates, they used a range of values of n from 5-8.

The first method was an insertion assay using reconstituted microsomes, where this helix was inserted into the luminal domain of LepB. A glycosylation site was added at each end of the helix, but glycosylation takes place only on sites inside microsomes. Helices inserted into the membrane are only glycosylated once, whereas secreted helices are glycosylated twice and those which did not go through the translocon are not glycosylated. SDS-PAGE can separate these states by mass, and the ratio between single and double glycosylation gives the partitioning between inserted and interface helices out of those which entered the translocon. As expected, the trend is for longer helices with more leucine to favour the transmembrane state.

Adapted from Figure 4a: The helix, H, either passes through the translocon into the lumen (“S”) resulting in two glycosylations (green pentagons), or is inserted (TM) resulting in one glycosylation.

The second method was also experimental: oriented synchrotron radiation circular dichroism (ORSCD). Here they used just the peptide with one glycine at each end, as this would be able to equilibrate between the two states quickly. Theoretical spectra can be calculated for a helix , and therefore the ratio in which they must be combined to give the measured spectrum for a given peptide gives the ratio of transmembrane and interface states present.

Figure 2b: TM and IP are the theoretical spectra for the transmembrane and interface states, and the peptides fall somewhere in between.

Finally, the authors present 4 μs molecular dynamics simulations of the same peptides at 140°C, so that equilibration between the two states would be fast. The extended peptide at the start of the simulation quickly associates with the membrane and adopts a helical conformation. An important observation to note is that the transmembrane state is in fact at around 30° to the membrane normal, to allow the charged guanidinium group of the arginine to “snorkel” up to interact with charged phosphate groups of the lipids. Therefore this state is defined as transmembrane, in contrast to the OSRCD experiments where the theoretical TM spectrum was calculated for a perpendicular helix. This may be a source of some inaccuracy in the propensities calculated from OSRCD.

Figure 2c: Equilibration in the simulation for the L<sub>7</sub>RL<sub>7</sub> peptide. Transmembrane and interface states are seen in the partitioning and equilibration phases after the helix has formed.

Figure 2c: Equilibration in the simulation for the L₇RL₇ peptide. Transmembrane and interface states are seen in the partitioning and equilibration phases after the helix has formed.

Figure 3c: As the simulations run, the proportion of helices in the transmembrane state (P_TM) converges to a different value for each peptide.

Overall, the ΔG calculated experimental and molecular dynamics (MD) simulations agree very well. In fact, they agree better than those from previous studies of a similar format looking at polyleucine helices, where there was a consistent offset of 2 kcal/mol between the experiment and simulation derived values. The authors are unable to explain why the agreement for this study is better, but they indicate that it is unlikely to be related to any stabilisation by dimerisation in the experimental results, as a 4 μs MD simulation of two helices did not show them forming stable interactions. The calculated difference in insertion energy (ΔΔG) on replacing a leucine with argnine is therefore calculated to be +2.4-4.3 kcal/mol by experiment and +5.4-6.8 by simulation, depending on the length of the peptide (it is a more costly substitution for longer peptides as the charge is buried deeper). The difference between the experimental and simulation results is accounted for by their disagreement in the polyleucine study.

We thought this paper was a great example of experimental design, where the system was carefully chosen so that different experimental and theoretical approaches would be directly comparable. The outcome is good agreement between the methods, demonstrating that the vastly different values recorded previously seem to be because very different questions were being asked.

Journal Club: AbDesign. An algorithm for combinatorial backbone design guided by natural conformations and sequences

Computational protein design methods often use a known molecule with a well-characterised structure as a template or scaffold. The chosen scaffold is modified so that its function (e.g. what it binds) is repurposed. Ideally, one wants to be confident that the expressed protein’s structure is going to be the same as the designed conformation. Therefore, successful designed proteins tend to be rigid, formed of collections of regular secondary structure (e.g. α-helices and β-sheets) and have active site shapes that do not perturb far from the scaffold’s backbone conformation (see this review).

A recent paper (Lapidoth et al 2015) from the Fleishman group proposes a new protocol to incorporate backbone variation (read loop conformations) into computational protein design (Figure 1). Using an antibody as the chosen scaffold, their approach aims to design a molecule that binds a specific patch (epitope) on a target molecule (antigen).

Figure 1 from Lapidoth et al 2015 shows an overview of the AbDesign protocol

Protein design works in the opposite direction to structure prediction. i.e. given a structure tell me what sequence will allow me to achieve that shape and to bind a particular patch in the way I have chosen. To do this one first needs to select a shape that could feasibly be achieved in vivo. We would hope that if a backbone conformation has previously been seen in the Protein Data Bank that it is one of such a set of feasible shapes.

Lapidoth et al sample conformations by constructing a backbone torsion angle database derived from known antibody structures from the PDB. From the work of North et al and others we also know that certain loop shapes can be achieved with multiple different sequences (see KK’s recent post). The authors therefore reduce the number of possible backbone conformations by clustering them by structural similarity. Each conformational cluster is represented by a representative and a position specific substitution matrix (PSSM). The PSSM represents how the sequence can vary whilst maintaining the shape.

The Rosetta design pipeline that follows uses the pre-computed torsion database to make a scaffold antibody structure (1x9q) adopt different backbone conformations. Proposed sequence mutations are sampled from the corresponding PSSM for the conformation. Shapes and the sequences that can adopt them, are ranked with respect to a docked pose with the antigen using several structure-based filters and Rosetta energy scores. A trade off is made between predicted binding and stability energies using a ‘fuzzy logic’ scheme.

After several rounds of optimisation the pipeline produces a predicted structure and sequence that should bind the chosen epitope patch and fold to form a stable protein when expressed. The benchmark results show promise in terms of structural similarity to known molecules that bind the same site (polar interactions, buried surface area). Sequence similarity between the predicted and known binders is perhaps lower than expected. However, as different natural antibody molecules can bind the same antigen, convergence between a ‘correct’ design and the known binder may not be guaranteed anyway.

In conclusion, my take home message from this paper is that to sensibly sample backbone conformations for protein design use the variation seen in known structures. The method presented demonstrates a way of predicting more structurally diverse designs and sampling the sequences that will allow the protein to adopt these shapes. Although, as the authors highlight, it is difficult to assess the performance of the protocol without experimental validation, important lessons can be learned for computational design of both antibodies and general proteins.

Oxford Protein Informatics Group

or "OPIG" to friends