PLOS Comp Bio’s front page!

The work of our molecular dynamic expert, Bernhard Knapp, has made the front page of the PLOS Computational Biology . The image shows how CDR3 loops scan a peptide/MHC complex (Figure below: Snapshot from the website). Immune cells in the human body screen other cells for possible infections. The binding of T-cell receptors (TCR) and parts of pathogens bound by major histocompatibility complexes (MHC) is one of the activation mechanisms of the immune system. There have been many hypotheses as to when such binding will activate the immune system. In this study we performed the, to our knowledge, largest set of Molecular Dynamics simulations of TCR-MHC complexes. We performed 172 simulations each of 100 ns in length. By performing a large number of simulations we obtain insight about which structural features are frequently present in immune system activating and non-activating TCR-MHC complexes. We show that many previously suggested structural features are unlikely to be causal for the activation of the human immune system.

How CDR3 loops scan a peptide/MHC complex. Rendering is based on the 100 ns simulation of the wild-type TCRpMHC complex. Blue: peptide; Solid white cartoon and transparent surface: first frame of the MHC helices and β-sheet floor; Orange: equally distributed frames of the CDR3s over simulation time. Transparent white cartoon: first frame of all other TCR regions.

How CDR3 loops scan a peptide/MHC complex. Rendering is based on the 100 ns simulation of the wild-type TCRpMHC complex. Blue: peptide; Solid white cartoon and transparent surface: first frame of the MHC helices and β-sheet floor; Orange: equally distributed frames of the CDR3s over simulation time. Transparent white cartoon: first frame of all other TCR regions.


Introduction and Methods:

Today I prsented the paper which introduces HHsearch. HHSearch is mainly used for template detection and still is one of the best known methods. The early methods for template detection simply performed sequence-sequence comparison between a query protein and a database of targets: such as BLAST or FASTA. But then more advanced methods such as PSI-BLAST were introduced which perform sequence-profile comparisons. Profiles are build using multiple sequence alignments (MSA) of protein families which describes the probability of the occurrence of an amino acid in a column of a MSA. In other words, profiles describe the conservation of amino acids among families. A remarkable advance in template detection was introduced by methods performing profile-profile comparisons such as COMPASS, PROF-SIM and LAMA.

Hidden Markov Model (HMM) introduced a new way of building profiles which resulted in methods performing sequence-HMM comparisons to detect templates. A HMM is similar to a state machine and is build using a MSA where each column is given a ‘M’ (match) state. A match state emits amino acids based on probability calculated from the MSA. In addition of a match state, all columns of the MSA will have an ‘I’ (insert) state and ‘D’ (delete/gap) state (See below figure for an example). There is a transition between states (shown by arrows) where all transitions also have probabilities. Having a sequence and a HMM, the sequence can be alignment on the HMM. In other words there is path in the HMM which associates to emitting the sequence and a log-odd score associated with this path. Dynamic programming (Viterbi algorithm) is used to detect this path (similar to needleman and wunsch algorithm for sequence-sequence alignments). More detail can be found here.

Example of an HMM. Taken from Bioinformatics Sequence and Genome Analysis, David W. Mount (

Example of an HMM. Taken from Bioinformatics Sequence and Genome Analysis, David W. Mount

The novel idea of HHsearch was that instead of performing sequence-HMM alignment, HMM-HMM alignments can be used for template detection. Therefore, they first discuss the formula used to calculate the log-odd scores which are required by the Viterbi algorithm to find the best aligned path. The score of aligning two columns in two HMMs (query profile q and template profile t) are calculated as:


Using this score, the best alignment between two profile HMMs is generated by maximizing the log-sum-odds score as in the formula below:


Now that the scoring function is defined, Viterbi algorithm is used to maximize this function. For simplicity HHsearch has described five pair states and the allowed transition between them (see figure 1.C of the paper). Therefore five dynamic program matrices are needed to align the two HMMs.

There are two other main parameters that can contribute to the final score of the HHsearch functions: 1) Correlation score: this score is based on the idea that if two proteins are homologs then once they are aligned high score columns (conserved columns) should cluster together. Which means the higher this score is the more homolog the sequences are. This score can be simply added to the ‘final best alignment score’ from the Viterbi algorithm. 2) While aligning two columns of the two HMMs, Secondary Structure (SS) elemnts are scored using statistical scores generated by the authors, which take into account the confidence values of the SS predictions. HHsearch provides two set of scores for SS comparison: 1- predicted vs predicted 2- predicted vs known. The second one is mainly used when performing 3D structure prediction. These acores are added to the Viterbi algorithm scoring functions in the formula 5,6 and 7 of the paper.



HHsearch performance was compared to BLAST(sequence-sequence), PSI-BLAST(profile-sequence) , HMMER(HMM-sequence) and COMPASS and PROF-SIM(profile-profile) methods. five version of HHsearch was benchmarked:

HHserach 0 -> basic profile-profile comparison with fix gap penalties
HHserach 1 -> basic HMM-HMM comparison
HHsearch 2 -> HMM-HMM comparison including correlation score
HHsearch 3 -> HMM-HMM comparison including correlation score + predicted vs predicted SS
HHsearch 4 -> HMM-HMM comparison including correlation score + predicted vs known SS

The dataset used in the comparison study consists of 3691 sequences taken from the SCOP database filtered at 20% sequence identity (SCOP-20). For each sequence an alignment is prepared using PSI-BLAST. These alignments are used to compare methods.

Detection of homolog:

The ability of methods in detecting homologs are compared against each other. Homolog and non-homolog definitions are: If two domains of SCOP-20 are in the same SCOP superfamily then they are considered as homologs. If the two domains are from different classes they are considered as non-homologs. All other pairs are classified as unknown.
The performance are compared by drawing TP (number ofhomolog pairs) vs FP (number of non-homolog pairs) curves, for all-against-all comparison of data in SCOP-20. The highest curves represent the best performing method. This curve is shown in Figure 2 of the paper. In this dataset the total number of TP are 41505 and the total number of FP are 1.08 x 10^7. The worse method is BLAST which at a error rate of 10% will only find 22% of the homologs. carrying on from BLAST the detection ability grows as:

BLAST < PSI_BLAST < HMMER < PROF-SIM < COMPASS < HHsearch0 < HHsearch1 < HHsearch2 < HHsearch3 < HHsearch4

Studying the results from HHsearch in detail the authors realised that in some cases HHSearch (with high confidence) groups two domains from different superfamily or even different folds as homologs. Looking at the structures of these proteins the authors noticed that the structure of these proteins are either locally or globally similar. Therefore, proteins defined by SCOP to be in different superfamilies or fold might actually be homologs. Therefore, they repeated the same test but changing the definition of TP and F:

-homologs (TPs) are domains from the same superfamily OR if their sequence-based alignment resulted in an structure alignment with MaxSub score of 0.1. (MaxSub ranges from 0-1, where 0 is insignificant and 1 very significant.)
-non-homologs (FPs) are those domains from different classes AND zero MaxSub score. Domains not in these two categories are grouped as unknow.

Figure 3 of the paper displays the new results. Although the figures 2 and 3 look similar there are few main points concluded from these diagrams:

1- At the same error rate all methods except BLAST detect more homologs compared to Figure 2 of the paper.
2- With the new definitions, sensitive tools improve more than others in homolog detection such as HHsearch 3 and 4. Since the new set of homologs are harder to detect and therefore only sensitive tools can detect them
3- COMPASS and PRO-SIM perform better than HHsearch 0 which means they are better at remote homolog detection.

To check this hypothesis, they draw TP vs FP (Figure 4 of paper) only for TPs from different families. Not only they confirm the hypothesis, they notice that using SS (HHsearch 3 and 4) they detect more TPs. Interesting enough as the pairs under study evolutionary diverge, the power of SS in detection becomes bolder.

Alignment quality:

The quality of a homology model really depends on how well the query protein is aligned with the homolog. Therefore, in this paper all of the methods are compared on their ability on building accurate alignments. To do so, using an aligned pair of sequences, the 3D structures of two proteins
are superposed and the spatial distances are evaluated (similar to MaxSub method). The authors also introduced the Sbalance score which is similar to MaxSub but considers over and under predictions that MaxSub fails to consider.

Comparing alignment qualities using MaxSub score (Fig 5 of paper) we can roughly conclude the performance as:

BLAST < PSI_BLAST < PROF-SIM < COMPASS < HHsearch0 < HMMER < HHsearch1 < HHsearch2 < HHsearch3 < HHsearch4

again more sensitive methods are able to build better alignments for distant homologs.

Comparing alignment qualities using Sbalance score (Fig 6 of paper) we can roughly conclude the performance as:

BLAST < PSI_BLAST < HMMER < PROF-SIM < COMPASS < HHsearch0 < HHsearch1 < HHsearch2 < HHsearch4 < HHsearch3

HMMER now is inferior to profile-profile alignment methods. In addition HHsearch 3 is the winner.

In general HMM-HMM methods are superior to all other methods for homolog detection and building alignments. HHsearch 4 has shown to be able to detect related structures (more than other methods) in the the same superfamily and folds.

Using the same idea more accurate tools have been developed since this paper such as HHblits from the same group. Also, recently a method has introduced Markov Random Fields to detect homologs, with better performance than HHsearch and HHblits.


My talk this week focused on secondary structure (SS) assignment. What do I mean by this? It is assigning SS types (principally α-helices and β-sheets) to protein structures. It can be found hiding in many of the things we do – e.g. alignment and modelling. There are many available methods to do this, of which DSSP (despite being published in 1983) is the most popular.


How does it work?

The algorithm identifies hydrogen bonds between mainchain carbonyl and amide groups. Partial charges are applied to the amide and carbonyl bonds, and the the C, O, N, and H atoms are assumed to be point charges (hence C has charge +ρ1, O -ρ1, N -ρ2, and H +ρ2. The the electrostatic energy between these 4 atoms is calculated, and if it is < -0.5 kcal/mol, a hydrogen bond exists. This is a relatively relaxed threshold, as a normal hydrogen bond in an $alpha;-helix is in the region of -3 kcal/mol, so it means that a given residue could have i+3, i+4, and i+5 hydrogen bonds.

Helices and sheets are then identified where there are characteristic hydrogen bond patterns. For example, two consecutive i to i+4 backbone hydrogen bonds indicates an α-helix turn. The algorithm identifies each turn, and each β-bridge, and where several of these overlap, they are combined into single elements.

DSSP has 8 different SS assignments:

  • G – 310 helix
  • H – α-helix
  • I – π-helix
  • E – β-sheet
  • B – β-bridge
  • T – helix turn
  • S – bend (high curvature)
  • C – coil (none of the above)

These are assigned in an order of preference – HBEGITSC.
Many (but by no means all) SS assignment programs still use this notation.

DSSP is one of the more simple SS assignment programs. Its hydrogen bond energy calculation is distinctly simplistic. It does not (fully) take the angles of the hydrogen bond into account, and provides only a binary classification for each hydrogen bond. However, perhaps surprisingly, DSSP is still the most used method. Why? Probably something to do with them giving it away for free, which resulted in many software suits incorporating it (e.g. JOY, PROMOTIF). As a general rule, if something does not say what it uses for SS assignment, it probably uses DSSP.

Other Methods

Given the simplicity of DSSP, it is not surprising that there are a large number of other available methods. Indeed, you may notice that different programs will give different assignments (e.g. comparing Pymol to VMD or the PDB annotation).

There have been a vast number of other secondary structure (SS) annotation methods published, including: STRIDE, DEFINE, PROMOTIF, KAKSI, SST, PSSC, P-SEA, SECSTR, XLLSSTR, PALSSE, and STICK. The other two you are likely to come across are STRIDE, and the PDB annotation.

SS assignment in general

All of the SS assignment methods rely on a combination of three features of SS. These are:

  1. Mainchain hydrogen bonds
  2. Φ and Ψ angles
  3. Inter Cα distances

For all three of these, there are values characteristic of both helices and β-sheets. Every method takes some combination of these three features for each residue, and if they are within the chosen limits, classifies the residue accordingly. The limits are generally chosen to give good agreement with the PDB annotation.
(It should be noted that the hydrogen-bond containing methods use the position of the hydrogen atom, which is rarely present in the crystal structure, and thus must be inferred.)

STRIDE can be described as an updated version of DSSP. It has two main differences – it uses a much fuller description of hydrogen bond geometry, and combines this with a knowledge based φ/ψ angle potential to assign the residue state. It has many more parameters that DSSP, and these are trained based on the PDB annotation. So where does that come from?

This PDB annotation comes from the depositors own annotation. The current guidance (from here) is to use the generated annotation, from PROMOTIF. PROMOTIF uses DSSP, with a slight change – it annotates an extra residue at the end of each structure element. I am in no position to say how well this guidance is adhered to by the depositors, or what their historical behaviour was, but the vast majority of annotations are reasonable.

I guess you are now wondering how different these methods are. Generally they agree in the obvious cases, and disagreement is normally at the ends of SS elements. Other examples (particularly pertinent to my research) occur when one method identifies a single long element, while another method identifies two elements seperated by a coil section. Ultimately there is no ‘right’ answer, so saying one method is right and another wrong is impossible.

To sum up, DSSP is the de facto standard. Ignoring my previous comment, it is probably not the best algorithm, as it is a gross simplification. STRIDE improves on the algorithm (although using more parameters), whilst for specific tasks, one method may be better than all of the others. It is hard to say if one is the best, and if it is important to you, then you should think about which method to use. If you do not think it is, then you should reconsider, and if it really is not important, then just use DSSP like everyone else. This is perhaps an example where willing, free, provision your code to the community results in your method (DSSP) becoming the de facto standard.

Loopy LaTeX

Something that you may or may not know, is that you can write loops in LaTeX. It is, in fact, a Turing-complete language, so you can write if statements too (which is helpful for making multi-line comments), and do all the other things you can do with a programming language.

So, I guess you are thinking, ‘eh, that’s cool, but why would I do it?’ Indeed, I have known about this for ages, but never found any need for it, until recently. My problem was that I had generated 80 small images, that I wanted to display on a number of pages. I could have played with print settings, and made multiple pages per sheet, but since I wanted two columns, and to fit 16 to a page (the 80 images were in 5 sets of 16, each with two subsets), that was going to be a pain. I also wanted to add some labels to each set, but not have said label on every image. However, I thought that LaTeX could solve my problem.

As ever, there are a number of different latex packages that you can use, but I used the pgffor package. In the example below, my pictures were named [A-E]_[ab][1-8]_picture, e.g. A_b2_picture.png, or D_a3_picture.png. The code produces pages of 16 pictures, with the ‘a’ pictures on the left, and the ‘b’ pictures on the right.
There is a more simple example at the bottom.

Happy LaTeXing..

OPIG goes Punting!

Last Wednesday, it was the oh-so-very traditional OPIG Punting day (OPunting for those more acronym-prone).


To my surprise,  the weather was spectacular! It was warm and sunny, a true perfect day for punting. We set off from our alcoves offices with determination in our hearts and, more importantly, with delicious snacks and a significant amount of Pimms and G&T.   Everything was set for a truly amazing day. 20140730_194600 Our group took over 5 punts from the Cherwell Boathouse, constituting what I like to think of as a small fleet of avid punters and merriment-seekers. We punted all the way up the Cherwell, past the lovely Victoria’s Arms into lands unknown (or into some grassy meadows in the vicinities of Oxford). Fortunately no keys were thrown overboard and no one fell off the punts (well, at least not accidentally). Yet, as usual, OPunting was very eventful! Following the customs of our group, we had the traditional punting race. I may have been too busy gorging on Pimms during the race, but if memory does not fail me, the race was won by Hannah (who can be seen in the photo bellow doing her swimming victory lap).

Hannah, in her victory lap swim...

Hannah, in her victory lap swim…

During the punting, we also discovered that Bernhard had some previously unknown Viking ancestry (Austrian vikings?), which manifested in an impetus to ram his punt against others. Suffice to say that he paved the grounds to “Dodgems Punts”, a ride that will become popular in fun fairs and amusement parks in 2027.

Other than that, the weather was so great that many of us decided to go for a lovely swim at the Cherwell.


After a refreshing pint at Victoria’s, we made our way back to conclude yet another successful OPunting day!

3Dsig and ISMB 2014 (Boston)

This year we head to Boston for 3Dsig and ISMB 2014. The outcome was excellent with James Dunbar giving a talk at the 3Dsig on “Examining variable domain orientations in antigen receptors gives insight into TCR-like antibody design” and Alistair Martin oral poster presentation at ISMB on “Unearthing the structural information contained within mRNA ”. The Deane group received the most votes by different judges for the poster competition at 3Dsig with James Dunbar poster winning the best poster prize and Jinwoo Leem and Reyhaneh Esmaielbeiki posters receiving the honorary mentioned (all presented posters are Here).

Soooo Excited!

Soooo Excited!

This blog post contains a brief description of what we found most interesting at the conference.

N-terminal domains in two-domain proteins are biased to be shorter and predicted to fold faster than their C-terminal counterparts.
Authors: Jacob, Etai; Unger, Ron; Horovitz, Amnon

Chosen by: Alistair Martin

Is is not surprising that protein misfolding is selected against in the genome, with aggregation of misfolded proteins being associated to a an array of diseases. It is suggested that multi-domain proteins are more prone to misfolding and aggregation due to an effective higher local protein concentration. Jacob et al. investigated what mechanisms have developed to avoid this, focussing on ~3000 two domain proteins contained within Swiss-Prot.

They found that there are notable differences between the N- and C-terminal domains. Firstly, there exists a large propensity for the C-terminal domain to be longer than the N-terminal domain (1.6 times more likely). Secondly, the absolute contact order (ACO) is greater in the C-terminal domain when compared to the N-terminal domain. Both length and ACO inversely correlate to folding speed and thus they draw the conclusion that the first domain produced by the ribosome is under pressure to fold faster than the latter domain. These observations are enhanced in prokaryotes and lessened in eukaryotes, thereby suggesting a relationship to translational speed and cotranslational folding.

A novel tool to identify molecular interaction field similarities
Authors: Matthieu Chartier and Rafael Najmanovich

Chosen by: Claire Marks

One of the prize-winning talks at 3Dsig this year was “A novel tool to identify molecular interaction field similarities”, presented by Matthieu Chartier from the Université de Sherbrooke. The talk introduced IsoMIF – a new method for finding proteins that have similar binding pockets to one another. Unlike previous methods, IsoMIF does not require the superposition of the two proteins, and therefore the binding sites of proteins with very different folds can be compared.
IsoMIF works by placing probes at various positions on a grid in the binding pocket and calculating the interaction potential. Six probe types are used: hydrogen bond donor and acceptor, cation, anion, aromatic group, and hydrophobic group. A graph-matching algorithm then finds similar pockets to a query using the interaction potentials for each probe type at each grid point. On a dataset of nearly 3000 proteins, IsoMIF achieved good AUC values of 0.74-0.92.

The promise of evolutionary coupling for 3D biology
Presenter: Debora Marks

Chosen by: Reyhaneh Esmaielbeiki

I was impressed by the keynote talk given by Debora Marks at the 3Dsig. She gave an overall talk on how their group have worked on detecting evolutionary couplings (EC) between residues in proteins and how they use this information in predicting folds. In general, looking at the interacting residues in a 3D structure and comparing these position in a MSA displays co-evolving relationship. But the challenge is to solve the inverse, from sequence to structure, since not necessary all co-evolving residues are close in the 3D space (this relationship is shown in the figure below).

Evoulationary coupling in sequnce and structure

Evoulationary coupling in sequnce and structure

Debora showed that previous studies for detecting co-evolving residues used Mutual Information (MI). But, comparing the prediction out of MI with contacts maps shows that these methods perform poorly. This is because MI looks at the statistics of pair of residue at a time while residues in proteins are highly coupled and pairs are not independent from other pairs. Therefore, MI works good for RNA but not for proteins. Debora’s group have used mean field and pseudo likelihood maximization to overcome the limitation of MI and introduced the EVcoupling tool for predicting EC (Marks et al. PLoS One, 6(12), 2011). They have used the predicted EC as a distance restraint to predict the 3D structure of proteins using EVfold. Using EVfold they have managed to build structure with 2-5Å accuracy.

In a more recent work, they have built EVfold-membrane which is specific for membrane proteins (Hopf et al. Cell, 149(7), 1607-1621, 2012) and they tried modeling membrane proteins with unknown experimental structures. Recently close homologues to these structures were released and comparisons show that EVfold-membrane structures have accuracy of 3 to 4Å.

She also discussed the usage of detecting EC in identifying functional residues involved in ligand binding and conformational changes with an interesting example of two GPCRs, adrenergic beta-2 receptor and an opioid receptor (paper link).

She concluded her talk by talking about her recent work EVcomplex (paperlink). The aim is to use the detected EC between two different protein chains and use this information in the docking software as a distance restraint. Although, this method has provided models of the E.coli ATP synthase but there are currently several limitation (all mentioned in the Discussion of the paper) for using this work in large scale.

in general, EC was a popular topic at the conference with interesting posters from the UCL Bioinformatics group.

Characterizing changes in the rate of protein-protein dissociation upon interface mutation using hotspot energy and organization
Authors: Rudi Agius, Mieczyslaw Torchala, Iain H. Moal, Juan Fernández-Recio, Paul A. Bates

Chosen by:Jinwoo Leem

This was a paper by Agius et al. published in 2013 in PLOS Comp Biol. Essentially, the work was centralised around finding a set of novel descriptors to characterise mutant proteins and ultimately predict the koff (dissociation rate constant) of mutants. Their approach is quite unique; they perform two rounds of alanine-scanning mutagenesis, one on the wild-type (WT) and one on the mutant protein. They identify ‘hotspot’ residues as those that have a change of G of 2kcal/mol (or more) from alanine scanning, and the descriptor is formed from the summation of the energy changes of the hotspots.

The results were very encouraging, and from their random forest-trained model with hotspot descriptors, they see correlations to koff up to 0.79. However, the authors show that traditional ‘molecular’ descriptors (e.g. statistical potentials) perform just as well, with a correlation of 0.77. The exact contribution of their ‘hotspot’ descriptors to the prediction of koff seems unclear, especially considering how well molecular descriptors perform. Having said this, the paper shows a very unique way to approach the issue of predicting the effects of mutations on proteins, and on a larger dataset with a more diverse range of proteins (not necessarily mutants, but different proteins altogether!) these ‘hotspot’-specific methods may prove to be much more predictive.

On the Origin and Completeness of Ligand Binding Pockets with applications to drug discovery
Authors: Mu Gao & Jeffrey Skolnick.
Presented by Jeffrey Skolnick

Chosen by:Nicholas Pearce

The prevalence of ligand-binding pockets in proteins enables a wide range of biochemical reactions to be catalysed in the cell. Jeffrey Skolnick presented research which proposes that ligand-binding pockets are inherent in proteins. One mechanism that he hypothesised for the creation of these pockets is the mis-stacking of secondary structure elements, leading to imperfections in their surfaces – pockets. Using their method for calculating pocket similarity – APoc – Gao & Skolnick characterised the space of ligand-binding pockets, using artificial structures and structures from the PDB, and find it to be ~400-1000 pockets in size.

They suggest that the relatively small size of pocket-space could be one of the reasons that such a large amount of off-target promiscuity is seen in drug design attempts.

From this result, Skolnick went on to discuss several interesting possibilities for the evolutionary history of ligand-binding pockets. One of the interesting hypotheses is that many, if not all, proteins could have inherent low-level catalytic ability across a wide range of biochemical reactions. Motivated by the small size of pocket-space, it is conceivable that one protein would be able to catalyse many different reactions – this could give an insight into the evolutionary history of protein catalysis.

If primordial proteins could catalyse many different reactions, albeit inefficiently, this would give a possibility for how the first lifeforms developed. Nature need only then work on increasing specificity and efficiency from a background of weak catalytic ability. However, even through the course of evolution to produce more specific proteins, this background of activity could remain, and explain the background of ‘biochemical noise’ that is seen in biological systems.

Drug promiscuity and inherent reactivity may not only be present due to the small size of pocket-space – they may be a direct consequence of evolution and the fundamental properties of proteins.

Alistair Martin & codon!

Alistair Martin & codon!

James Dunbar presenting good stuff about antibodies

James Dunbar presenting good stuff about Antibodies

Antibody CDR-H3 Modelling with Prime

In a blog post from last month, Konrad discussed the most recent Antibody Modelling Assessment (AMA-II), a CASP-like blind prediction study designed to test the current state-of-the-art in antibody modelling. In the second round of this assessment, participants were given the crystal structure of ten antibodies with their H3 loops missing – the loop usually found in the centre of the binding site that is largely responsible for the binding properties of the antibody. The groups of researchers were asked to model this loop in its native environment. Modelling this loop is challenging, since it is much more variable in sequence and structure than the other five loops in the binding site.

For eight out of the ten loops, the Prime software from Schrodinger (the non-commercial version of which is called PLOP) produced the most accurate predictions. Prime is an ab initio method, meaning that loop conformations are generated from scratch (unlike knowledge-based methods, which  use databases of known loop structures). In this algorithm, described here,  a  ‘full’ prediction job is made up of consecutive ‘standard’ prediction jobs. A standard prediction job involves building loops from dihedral angle libraries – for each residue in the sequence, random phi/psi angles are chosen from the libraries. Loops are built in halves – lots of conformations of the first half are generated, along with many of the second half, and then all the first halves are cross-checked against the second halves to see whether any of them meet in the middle. If so, then the two halves are melded and a full loop structure is made. All loop structures are then clash-checked using an overlap factor (a cutoff on how close two atoms can get to each other). Finally, the loops are clustered, and a representative structure has its side chain conformations predicted and its energy minimised.

A full loop prediction job is made up of a series of standard jobs, with the goal of guiding the conformational search to focus on structures with low energy. The steps are as follows:

  • Initial – five standard jobs are run, with slightly different overlap factors.
  • Ref1 – the first refinement stage. The conformational space around the top 10 loops from each standard job of the Initial stage is explored further by constraining the distance between Ca atoms.
  • Fixed – the top 10 loops of all those generated so far are passed to this series of stages. To begin with, the first and last residues of the loop are excluded from the prediction and the rest of the loop is re-modelled. The top 10 loops after this are then taken to the second Fixed stage, where two residues at each end of the loop are kept fixed. This is repeated five times, with the number of fixed residues at each end of the loop being increased by one each time.
  • Ref2 – a second refinement stage, which is the same as the first, except tighter distance constraints are used.
  • Final  – all the loop structures generated are ranked according to their energy, and the lowest energy conformation is chosen as the final prediction.

In a recent paper, Prime was used to predict the structures of 53 antibody H3 loops (using the dataset of a previous RosettaAntibody paper). 91% of the targets were predicted with sub 2-angstrom accuracy, and 81% predictions were sub-angstrom. Compared to RosettaAntibody, which achieved 53% and 17% for predictions below 2A and 1A respectively, this is very impressive. For AMA-II, however, where each group was required to give five predictions, and some poor models were included in each group’s top five, it is apparent that ranking loop conformations is still a major challenge in loop modelling.

Sampling Conformations of Antibodies using MOSAICS

Much work has been done to study the conformational changes taking place in antibodies, particularly during the event of binding to an antigen. This has been done through comparison of crystal structures, circular dichroism, and recently with high resolution single particle electron microscopy. The ability to resolve domains within an antibody from single particles without any averaging  made it possible to show distributions of properties such as the shape of a Fab domain, measured by the ratio of width to length. Some of the variation in structure seen involves very large scale motions, but it is not known how conformational changes may be transmitted from the antigen binding region to the Fc, and therefore influence effector function. Molecular dynamics simulations have been performed on some large antibody systems, however none have been possible on a time scale which would be able to provide information on the converged distributions of large scale properties such as the angle between the Fab and Fc fragments.

In my short project with Peter Minary, I used MOSAICS to investigate the dynamics of an antibody Fab fragment, using the coarse-grained natural move Monte Carlo approach described by Sam a few weeks ago. This makes it possible to split a structure into units which are believed to move in a correlated way, and propose moves for the components of each region together. The rate of sampling is accelerated in degrees of freedom which may have functional significance, for example the movement of the domains in a Fab fragment relative to one another (separate regions shown in the diagram below). I used ABangle to analyse the output of each sampling trajectory and observe any changes in the relative orientations of The VH and VL domains.

Region definitions for MOSAICS

Fab region definitions for MOSAICS

Of particular interest would be any correlations between conformational changes in the variable and constant parts of the Fab fragment, as these could be involved in transmitting conformational changes between remote parts of the antibody. We also hoped to see in our model some effect of including the antigen in the simulation, bound to the antibody fragment as seen in the crystal structure. In the time available for the project, we was able to  set up a model representing the Fab fragment and run some relatively short simulations to explore favoured conformational states and see how the set up of regions affects distributions seen. In order to draw conclusions about the meaning of the results, a much greater number of simulations will need to be run to ensure sampling of the whole conformational space.

Computational Antibody Affinity Maturation

In this week’s journal club, we reviewed a paper by Lippow et al. in Nature Biotechnology, which features a computational pipeline that is capable of maturing antibodies (Abs) by up to 140-fold. The paper itself discusses 4 test case Abs (D44.1, cetuximab, 4-4-20, bevacizumab) and uses changes in electrostatic energy to identify favourable mutations. Up to the point when this paper was published back in 2007, computational antibody design was an (almost) unexplored field of research – except for a study by Clark et al. in 2006, no one else had done anything like the work presented in this paper.

The idea behind the paper is to identify certain positions within the Ab structure for mutation and hopefully find an Ab with a higher binding affinity.

The idea behind the paper is to identify certain positions within the Ab structure for mutation and hopefully find an Ab with a higher binding affinity.


Briefly speaking, the group generated a mutant Ab-antigen (Ag) complex using a series of algorithms (dead-end elimination and A*), which was then scored by the group’s energy function for identifying favourable mutations. Lippow et al. used the electrostatics term of their binding affinity prediction in order to estimate the effects of mutations on an Ab’s binding affinity. In other words, instead of examining their entire scoring function, which includes terms such as van der Waal’s energy, the group only used changes in the electrostatic energy term as an indicator for proposing mutations. Overall, in 2 of the 4 mentioned test cases (D44.1 & cetuximab), the proposed mutations were experimentally tested to confirm their computational design pipeline – a brief overview of these two case studies will be described.


In the case of the D44.1 anti-lysozyme Ab, the group proposed 9 single mutations by their electrostatics-based calculation method; 6/9 single mutants were confirmed to be beneficial (i.e., the mutant had an increased binding affinity). The beneficial single mutants were combined, ultimately leading to a quadruple mutant structure with a 100-fold improvement in affinity. The quadruple mutant was then subjected to a second round of computer-guided affinity maturation, leading to a new variant with six mutations (effectively a 140-fold improvement over the wild-type Ab). This case study was a solid testimony to the validity of their method; since anti-lysozyme Abs are often used as model systems, these results demonstrated that their design pipeline had taken, in principle, a suitable approach to maturing Abs in silico.

The second case study with cetuximab was arguably the more interesting result. Like the D44.1 case above, mutations were proposed to increase the Ab’s binding affinity on the basis of the changes in electrostatics. Although the newly-designed triple mutant only showed a 10-fold improvement over its wild-type counterpart, the group showed that their protocols can work for therapeutically-relevant Abs. The cetuximab example was a perfect complement to the previous case study — it demonstrated the practical implications of the method, and how this pipeline could potentially be used to mature existing Abs within the clinic today.

Effectively, the group suggested that mutations that either introduce hydrophobicity or a net charge at the binding interface tend to increase an Ab’s binding affinity. These conclusions shouldn’t come with huge surprise, but it was remarkable that the group had reached these conclusions with just one term from their energy function.


Effectively, the paper set off a whole new series of possibilities and helped us to widen our horizons. The paper was by no means perfect, especially with respect to predicting the precise binding affinities of mutants – much of this error could be bottled down to the modelling stage of their pipeline. However, the paper showed that computational affinity maturation is not just a dream – in fact, the paper showed that it’s perfectly doable, and immediately applicable. Interestingly, Lippow et al.’s manipulation of an Ab’s electrostatics seemed to be a valid approach, with recent publications on Ab maturation showing that introducing charged residues can enhance binding affinity (e.g. Kiyoshi et al., 2014).

More importantly, the paper was a beautiful showcase of how computational analyses could inform the decision making process in an in vitro framework, and I believe it exemplified how we should approach our problems in bioinformatics. We should not think of proteins as mere text files and numbers, but realise that they are living systems, and we’re not yet at a point where we fully understand how proteins behave. This shouldn’t discourage us from research; instead, it should give us the incentive to take things more slowly, and develop a method/product that could be used to solve greater, pragmatic problems.

Le Tour de Farce v2.0

In what is becoming the highlight of the year and a regular occurrence for the OPIGlets, Le Tour de Farce – The annual OPIG bike ride, took place on the 4th of June. Now in its 2.0 revision but maintaining a route similar to last year, 9.5 miles and several pints later, approximately 20 of us took in some distinctly pretty Oxfordshire scenery, not to mention The White Hart, The Trout, Jacobs Inn and for some, The One and The Punter too.