# PLOS Comp Bio’s front page!

The work of our molecular dynamic expert, Bernhard Knapp, has made the front page of the PLOS Computational Biology . The image shows how CDR3 loops scan a peptide/MHC complex (Figure below: Snapshot from the website). Immune cells in the human body screen other cells for possible infections. The binding of T-cell receptors (TCR) and parts of pathogens bound by major histocompatibility complexes (MHC) is one of the activation mechanisms of the immune system. There have been many hypotheses as to when such binding will activate the immune system. In this study we performed the, to our knowledge, largest set of Molecular Dynamics simulations of TCR-MHC complexes. We performed 172 simulations each of 100 ns in length. By performing a large number of simulations we obtain insight about which structural features are frequently present in immune system activating and non-activating TCR-MHC complexes. We show that many previously suggested structural features are unlikely to be causal for the activation of the human immune system.

How CDR3 loops scan a peptide/MHC complex. Rendering is based on the 100 ns simulation of the wild-type TCRpMHC complex. Blue: peptide; Solid white cartoon and transparent surface: first frame of the MHC helices and β-sheet floor; Orange: equally distributed frames of the CDR3s over simulation time. Transparent white cartoon: first frame of all other TCR regions.

# HHSearch

Introduction and Methods:

Today I prsented the paper which introduces HHsearch. HHSearch is mainly used for template detection and still is one of the best known methods. The early methods for template detection simply performed sequence-sequence comparison between a query protein and a database of targets: such as BLAST or FASTA. But then more advanced methods such as PSI-BLAST were introduced which perform sequence-profile comparisons. Profiles are build using multiple sequence alignments (MSA) of protein families which describes the probability of the occurrence of an amino acid in a column of a MSA. In other words, profiles describe the conservation of amino acids among families. A remarkable advance in template detection was introduced by methods performing profile-profile comparisons such as COMPASS, PROF-SIM and LAMA.

Hidden Markov Model (HMM) introduced a new way of building profiles which resulted in methods performing sequence-HMM comparisons to detect templates. A HMM is similar to a state machine and is build using a MSA where each column is given a ‘M’ (match) state. A match state emits amino acids based on probability calculated from the MSA. In addition of a match state, all columns of the MSA will have an ‘I’ (insert) state and ‘D’ (delete/gap) state (See below figure for an example). There is a transition between states (shown by arrows) where all transitions also have probabilities. Having a sequence and a HMM, the sequence can be alignment on the HMM. In other words there is path in the HMM which associates to emitting the sequence and a log-odd score associated with this path. Dynamic programming (Viterbi algorithm) is used to detect this path (similar to needleman and wunsch algorithm for sequence-sequence alignments). More detail can be found here.

Example of an HMM. Taken from Bioinformatics Sequence and Genome Analysis, David W. Mount
(http://compbio.pbworks.com/w/page/16252909/Multiple%20Sequence%20Alignment)

The novel idea of HHsearch was that instead of performing sequence-HMM alignment, HMM-HMM alignments can be used for template detection. Therefore, they first discuss the formula used to calculate the log-odd scores which are required by the Viterbi algorithm to find the best aligned path. The score of aligning two columns in two HMMs (query profile q and template profile t) are calculated as:

Using this score, the best alignment between two profile HMMs is generated by maximizing the log-sum-odds score as in the formula below:

Now that the scoring function is defined, Viterbi algorithm is used to maximize this function. For simplicity HHsearch has described five pair states and the allowed transition between them (see figure 1.C of the paper). Therefore five dynamic program matrices are needed to align the two HMMs.

There are two other main parameters that can contribute to the final score of the HHsearch functions: 1) Correlation score: this score is based on the idea that if two proteins are homologs then once they are aligned high score columns (conserved columns) should cluster together. Which means the higher this score is the more homolog the sequences are. This score can be simply added to the ‘final best alignment score’ from the Viterbi algorithm. 2) While aligning two columns of the two HMMs, Secondary Structure (SS) elemnts are scored using statistical scores generated by the authors, which take into account the confidence values of the SS predictions. HHsearch provides two set of scores for SS comparison: 1- predicted vs predicted 2- predicted vs known. The second one is mainly used when performing 3D structure prediction. These acores are added to the Viterbi algorithm scoring functions in the formula 5,6 and 7 of the paper.

Results:

Dataset:

HHsearch performance was compared to BLAST(sequence-sequence), PSI-BLAST(profile-sequence) , HMMER(HMM-sequence) and COMPASS and PROF-SIM(profile-profile) methods. five version of HHsearch was benchmarked:

HHserach 0 -> basic profile-profile comparison with fix gap penalties
HHserach 1 -> basic HMM-HMM comparison
HHsearch 2 -> HMM-HMM comparison including correlation score
HHsearch 3 -> HMM-HMM comparison including correlation score + predicted vs predicted SS
HHsearch 4 -> HMM-HMM comparison including correlation score + predicted vs known SS

The dataset used in the comparison study consists of 3691 sequences taken from the SCOP database filtered at 20% sequence identity (SCOP-20). For each sequence an alignment is prepared using PSI-BLAST. These alignments are used to compare methods.

Detection of homolog:

The ability of methods in detecting homologs are compared against each other. Homolog and non-homolog definitions are: If two domains of SCOP-20 are in the same SCOP superfamily then they are considered as homologs. If the two domains are from different classes they are considered as non-homologs. All other pairs are classified as unknown.
The performance are compared by drawing TP (number ofhomolog pairs) vs FP (number of non-homolog pairs) curves, for all-against-all comparison of data in SCOP-20. The highest curves represent the best performing method. This curve is shown in Figure 2 of the paper. In this dataset the total number of TP are 41505 and the total number of FP are 1.08 x 10^7. The worse method is BLAST which at a error rate of 10% will only find 22% of the homologs. carrying on from BLAST the detection ability grows as:

BLAST < PSI_BLAST < HMMER < PROF-SIM < COMPASS < HHsearch0 < HHsearch1 < HHsearch2 < HHsearch3 < HHsearch4 Studying the results from HHsearch in detail the authors realised that in some cases HHSearch (with high confidence) groups two domains from different superfamily or even different folds as homologs. Looking at the structures of these proteins the authors noticed that the structure of these proteins are either locally or globally similar. Therefore, proteins defined by SCOP to be in different superfamilies or fold might actually be homologs. Therefore, they repeated the same test but changing the definition of TP and F: -homologs (TPs) are domains from the same superfamily OR if their sequence-based alignment resulted in an structure alignment with MaxSub score of 0.1. (MaxSub ranges from 0-1, where 0 is insignificant and 1 very significant.) -non-homologs (FPs) are those domains from different classes AND zero MaxSub score. Domains not in these two categories are grouped as unknow. Figure 3 of the paper displays the new results. Although the figures 2 and 3 look similar there are few main points concluded from these diagrams: 1- At the same error rate all methods except BLAST detect more homologs compared to Figure 2 of the paper. 2- With the new definitions, sensitive tools improve more than others in homolog detection such as HHsearch 3 and 4. Since the new set of homologs are harder to detect and therefore only sensitive tools can detect them 3- COMPASS and PRO-SIM perform better than HHsearch 0 which means they are better at remote homolog detection. To check this hypothesis, they draw TP vs FP (Figure 4 of paper) only for TPs from different families. Not only they confirm the hypothesis, they notice that using SS (HHsearch 3 and 4) they detect more TPs. Interesting enough as the pairs under study evolutionary diverge, the power of SS in detection becomes bolder. Alignment quality:

The quality of a homology model really depends on how well the query protein is aligned with the homolog. Therefore, in this paper all of the methods are compared on their ability on building accurate alignments. To do so, using an aligned pair of sequences, the 3D structures of two proteins
are superposed and the spatial distances are evaluated (similar to MaxSub method). The authors also introduced the Sbalance score which is similar to MaxSub but considers over and under predictions that MaxSub fails to consider.

Comparing alignment qualities using MaxSub score (Fig 5 of paper) we can roughly conclude the performance as:

BLAST < PSI_BLAST < PROF-SIM < COMPASS < HHsearch0 < HMMER < HHsearch1 < HHsearch2 < HHsearch3 < HHsearch4 again more sensitive methods are able to build better alignments for distant homologs. Comparing alignment qualities using Sbalance score (Fig 6 of paper) we can roughly conclude the performance as: BLAST < PSI_BLAST < HMMER < PROF-SIM < COMPASS < HHsearch0 < HHsearch1 < HHsearch2 < HHsearch4 < HHsearch3 HMMER now is inferior to profile-profile alignment methods. In addition HHsearch 3 is the winner. In general HMM-HMM methods are superior to all other methods for homolog detection and building alignments. HHsearch 4 has shown to be able to detect related structures (more than other methods) in the the same superfamily and folds. Using the same idea more accurate tools have been developed since this paper such as HHblits from the same group. Also, recently a method has introduced Markov Random Fields to detect homologs, with better performance than HHsearch and HHblits.

# DSSP

My talk this week focused on secondary structure (SS) assignment. What do I mean by this? It is assigning SS types (principally α-helices and β-sheets) to protein structures. It can be found hiding in many of the things we do – e.g. alignment and modelling. There are many available methods to do this, of which DSSP (despite being published in 1983) is the most popular.

## DSSP

### How does it work?

The algorithm identifies hydrogen bonds between mainchain carbonyl and amide groups. Partial charges are applied to the amide and carbonyl bonds, and the the C, O, N, and H atoms are assumed to be point charges (hence C has charge +ρ1, O -ρ1, N -ρ2, and H +ρ2. The the electrostatic energy between these 4 atoms is calculated, and if it is < -0.5 kcal/mol, a hydrogen bond exists. This is a relatively relaxed threshold, as a normal hydrogen bond in an \$alpha;-helix is in the region of -3 kcal/mol, so it means that a given residue could have i+3, i+4, and i+5 hydrogen bonds.

Helices and sheets are then identified where there are characteristic hydrogen bond patterns. For example, two consecutive i to i+4 backbone hydrogen bonds indicates an α-helix turn. The algorithm identifies each turn, and each β-bridge, and where several of these overlap, they are combined into single elements.

DSSP has 8 different SS assignments:

• G – 310 helix
• H – α-helix
• I – π-helix
• E – β-sheet
• B – β-bridge
• T – helix turn
• S – bend (high curvature)
• C – coil (none of the above)

These are assigned in an order of preference – HBEGITSC.
Many (but by no means all) SS assignment programs still use this notation.

DSSP is one of the more simple SS assignment programs. Its hydrogen bond energy calculation is distinctly simplistic. It does not (fully) take the angles of the hydrogen bond into account, and provides only a binary classification for each hydrogen bond. However, perhaps surprisingly, DSSP is still the most used method. Why? Probably something to do with them giving it away for free, which resulted in many software suits incorporating it (e.g. JOY, PROMOTIF). As a general rule, if something does not say what it uses for SS assignment, it probably uses DSSP.

## Other Methods

Given the simplicity of DSSP, it is not surprising that there are a large number of other available methods. Indeed, you may notice that different programs will give different assignments (e.g. comparing Pymol to VMD or the PDB annotation).

There have been a vast number of other secondary structure (SS) annotation methods published, including: STRIDE, DEFINE, PROMOTIF, KAKSI, SST, PSSC, P-SEA, SECSTR, XLLSSTR, PALSSE, and STICK. The other two you are likely to come across are STRIDE, and the PDB annotation.

## SS assignment in general

All of the SS assignment methods rely on a combination of three features of SS. These are:

1. Mainchain hydrogen bonds
2. Φ and Ψ angles
3. Inter Cα distances

For all three of these, there are values characteristic of both helices and β-sheets. Every method takes some combination of these three features for each residue, and if they are within the chosen limits, classifies the residue accordingly. The limits are generally chosen to give good agreement with the PDB annotation.
(It should be noted that the hydrogen-bond containing methods use the position of the hydrogen atom, which is rarely present in the crystal structure, and thus must be inferred.)

STRIDE can be described as an updated version of DSSP. It has two main differences – it uses a much fuller description of hydrogen bond geometry, and combines this with a knowledge based φ/ψ angle potential to assign the residue state. It has many more parameters that DSSP, and these are trained based on the PDB annotation. So where does that come from?

This PDB annotation comes from the depositors own annotation. The current guidance (from here) is to use the generated annotation, from PROMOTIF. PROMOTIF uses DSSP, with a slight change – it annotates an extra residue at the end of each structure element. I am in no position to say how well this guidance is adhered to by the depositors, or what their historical behaviour was, but the vast majority of annotations are reasonable.

I guess you are now wondering how different these methods are. Generally they agree in the obvious cases, and disagreement is normally at the ends of SS elements. Other examples (particularly pertinent to my research) occur when one method identifies a single long element, while another method identifies two elements seperated by a coil section. Ultimately there is no ‘right’ answer, so saying one method is right and another wrong is impossible.

To sum up, DSSP is the de facto standard. Ignoring my previous comment, it is probably not the best algorithm, as it is a gross simplification. STRIDE improves on the algorithm (although using more parameters), whilst for specific tasks, one method may be better than all of the others. It is hard to say if one is the best, and if it is important to you, then you should think about which method to use. If you do not think it is, then you should reconsider, and if it really is not important, then just use DSSP like everyone else. This is perhaps an example where willing, free, provision your code to the community results in your method (DSSP) becoming the de facto standard.

# Loopy LaTeX

Something that you may or may not know, is that you can write loops in LaTeX. It is, in fact, a Turing-complete language, so you can write if statements too (which is helpful for making multi-line comments), and do all the other things you can do with a programming language.

So, I guess you are thinking, ‘eh, that’s cool, but why would I do it?’ Indeed, I have known about this for ages, but never found any need for it, until recently. My problem was that I had generated 80 small images, that I wanted to display on a number of pages. I could have played with print settings, and made multiple pages per sheet, but since I wanted two columns, and to fit 16 to a page (the 80 images were in 5 sets of 16, each with two subsets), that was going to be a pain. I also wanted to add some labels to each set, but not have said label on every image. However, I thought that LaTeX could solve my problem.

As ever, there are a number of different latex packages that you can use, but I used the pgffor package. In the example below, my pictures were named [A-E]_[ab][1-8]_picture, e.g. A_b2_picture.png, or D_a3_picture.png. The code produces pages of 16 pictures, with the ‘a’ pictures on the left, and the ‘b’ pictures on the right.
There is a more simple example at the bottom.

\documentclass[a4paper,10pt]{article}
\usepackage[utf8]{inputenc}
\usepackage{pgffor}
\usepackage{subfigure}
\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{fullpage}

\begin{document}
\section{}

\foreach \method in {A, B, C, D, E}{ %looping over each set of 16
\begin{figure}
\foreach \i in {1,...,8}{ %looping over each subset, the a set first, then the b set
\centering
\subfigure[]{\label{fig:\method \i a}\includegraphics[width=0.45\textwidth]{\method _a\i _picture}}
\subfigure[]{\label{fig:\method \i b}\includegraphics[width=0.45\textwidth]{\method _b\i _picture}}
}
\caption{\method}
\end{figure}
}

%a more simple example
\foreach \number in {1,...,10}{
\number \ elephant
}

\end{document}

Happy LaTeXing..

# OPIG goes Punting!

Last Wednesday, it was the oh-so-very traditional OPIG Punting day (OPunting for those more acronym-prone).

To my surprise,  the weather was spectacular! It was warm and sunny, a true perfect day for punting. We set off from our alcoves offices with determination in our hearts and, more importantly, with delicious snacks and a significant amount of Pimms and G&T.   Everything was set for a truly amazing day. Our group took over 5 punts from the Cherwell Boathouse, constituting what I like to think of as a small fleet of avid punters and merriment-seekers. We punted all the way up the Cherwell, past the lovely Victoria’s Arms into lands unknown (or into some grassy meadows in the vicinities of Oxford). Fortunately no keys were thrown overboard and no one fell off the punts (well, at least not accidentally). Yet, as usual, OPunting was very eventful! Following the customs of our group, we had the traditional punting race. I may have been too busy gorging on Pimms during the race, but if memory does not fail me, the race was won by Hannah (who can be seen in the photo bellow doing her swimming victory lap).

Hannah, in her victory lap swim…

During the punting, we also discovered that Bernhard had some previously unknown Viking ancestry (Austrian vikings?), which manifested in an impetus to ram his punt against others. Suffice to say that he paved the grounds to “Dodgems Punts”, a ride that will become popular in fun fairs and amusement parks in 2027.

Other than that, the weather was so great that many of us decided to go for a lovely swim at the Cherwell.

After a refreshing pint at Victoria’s, we made our way back to conclude yet another successful OPunting day!

# 3Dsig and ISMB 2014 (Boston)

This year we head to Boston for 3Dsig and ISMB 2014. The outcome was excellent with James Dunbar giving a talk at the 3Dsig on “Examining variable domain orientations in antigen receptors gives insight into TCR-like antibody design” and Alistair Martin oral poster presentation at ISMB on “Unearthing the structural information contained within mRNA ”. The Deane group received the most votes by different judges for the poster competition at 3Dsig with James Dunbar poster winning the best poster prize and Jinwoo Leem and Reyhaneh Esmaielbeiki posters receiving the honorary mentioned (all presented posters are Here).

Soooo Excited!

This blog post contains a brief description of what we found most interesting at the conference.

N-terminal domains in two-domain proteins are biased to be shorter and predicted to fold faster than their C-terminal counterparts.
Authors: Jacob, Etai; Unger, Ron; Horovitz, Amnon

Chosen by: Alistair Martin

Is is not surprising that protein misfolding is selected against in the genome, with aggregation of misfolded proteins being associated to a an array of diseases. It is suggested that multi-domain proteins are more prone to misfolding and aggregation due to an effective higher local protein concentration. Jacob et al. investigated what mechanisms have developed to avoid this, focussing on ~3000 two domain proteins contained within Swiss-Prot.

They found that there are notable differences between the N- and C-terminal domains. Firstly, there exists a large propensity for the C-terminal domain to be longer than the N-terminal domain (1.6 times more likely). Secondly, the absolute contact order (ACO) is greater in the C-terminal domain when compared to the N-terminal domain. Both length and ACO inversely correlate to folding speed and thus they draw the conclusion that the first domain produced by the ribosome is under pressure to fold faster than the latter domain. These observations are enhanced in prokaryotes and lessened in eukaryotes, thereby suggesting a relationship to translational speed and cotranslational folding.

A novel tool to identify molecular interaction field similarities
Authors: Matthieu Chartier and Rafael Najmanovich

Chosen by: Claire Marks

One of the prize-winning talks at 3Dsig this year was “A novel tool to identify molecular interaction field similarities”, presented by Matthieu Chartier from the Université de Sherbrooke. The talk introduced IsoMIF – a new method for finding proteins that have similar binding pockets to one another. Unlike previous methods, IsoMIF does not require the superposition of the two proteins, and therefore the binding sites of proteins with very different folds can be compared.
IsoMIF works by placing probes at various positions on a grid in the binding pocket and calculating the interaction potential. Six probe types are used: hydrogen bond donor and acceptor, cation, anion, aromatic group, and hydrophobic group. A graph-matching algorithm then finds similar pockets to a query using the interaction potentials for each probe type at each grid point. On a dataset of nearly 3000 proteins, IsoMIF achieved good AUC values of 0.74-0.92.

The promise of evolutionary coupling for 3D biology
Presenter: Debora Marks

Chosen by: Reyhaneh Esmaielbeiki

I was impressed by the keynote talk given by Debora Marks at the 3Dsig. She gave an overall talk on how their group have worked on detecting evolutionary couplings (EC) between residues in proteins and how they use this information in predicting folds. In general, looking at the interacting residues in a 3D structure and comparing these position in a MSA displays co-evolving relationship. But the challenge is to solve the inverse, from sequence to structure, since not necessary all co-evolving residues are close in the 3D space (this relationship is shown in the figure below).

Evoulationary coupling in sequnce and structure

Debora showed that previous studies for detecting co-evolving residues used Mutual Information (MI). But, comparing the prediction out of MI with contacts maps shows that these methods perform poorly. This is because MI looks at the statistics of pair of residue at a time while residues in proteins are highly coupled and pairs are not independent from other pairs. Therefore, MI works good for RNA but not for proteins. Debora’s group have used mean field and pseudo likelihood maximization to overcome the limitation of MI and introduced the EVcoupling tool for predicting EC (Marks et al. PLoS One, 6(12), 2011). They have used the predicted EC as a distance restraint to predict the 3D structure of proteins using EVfold. Using EVfold they have managed to build structure with 2-5Å accuracy.

In a more recent work, they have built EVfold-membrane which is specific for membrane proteins (Hopf et al. Cell, 149(7), 1607-1621, 2012) and they tried modeling membrane proteins with unknown experimental structures. Recently close homologues to these structures were released and comparisons show that EVfold-membrane structures have accuracy of 3 to 4Å.

She also discussed the usage of detecting EC in identifying functional residues involved in ligand binding and conformational changes with an interesting example of two GPCRs, adrenergic beta-2 receptor and an opioid receptor (paper link).

She concluded her talk by talking about her recent work EVcomplex (paperlink). The aim is to use the detected EC between two different protein chains and use this information in the docking software as a distance restraint. Although, this method has provided models of the E.coli ATP synthase but there are currently several limitation (all mentioned in the Discussion of the paper) for using this work in large scale.

in general, EC was a popular topic at the conference with interesting posters from the UCL Bioinformatics group.

Characterizing changes in the rate of protein-protein dissociation upon interface mutation using hotspot energy and organization
Authors: Rudi Agius, Mieczyslaw Torchala, Iain H. Moal, Juan Fernández-Recio, Paul A. Bates

Chosen by:Jinwoo Leem

This was a paper by Agius et al. published in 2013 in PLOS Comp Biol. Essentially, the work was centralised around finding a set of novel descriptors to characterise mutant proteins and ultimately predict the koff (dissociation rate constant) of mutants. Their approach is quite unique; they perform two rounds of alanine-scanning mutagenesis, one on the wild-type (WT) and one on the mutant protein. They identify ‘hotspot’ residues as those that have a change of G of 2kcal/mol (or more) from alanine scanning, and the descriptor is formed from the summation of the energy changes of the hotspots.

The results were very encouraging, and from their random forest-trained model with hotspot descriptors, they see correlations to koff up to 0.79. However, the authors show that traditional ‘molecular’ descriptors (e.g. statistical potentials) perform just as well, with a correlation of 0.77. The exact contribution of their ‘hotspot’ descriptors to the prediction of koff seems unclear, especially considering how well molecular descriptors perform. Having said this, the paper shows a very unique way to approach the issue of predicting the effects of mutations on proteins, and on a larger dataset with a more diverse range of proteins (not necessarily mutants, but different proteins altogether!) these ‘hotspot’-specific methods may prove to be much more predictive.

On the Origin and Completeness of Ligand Binding Pockets with applications to drug discovery
Authors: Mu Gao & Jeffrey Skolnick.
Presented by Jeffrey Skolnick

Chosen by:Nicholas Pearce

The prevalence of ligand-binding pockets in proteins enables a wide range of biochemical reactions to be catalysed in the cell. Jeffrey Skolnick presented research which proposes that ligand-binding pockets are inherent in proteins. One mechanism that he hypothesised for the creation of these pockets is the mis-stacking of secondary structure elements, leading to imperfections in their surfaces – pockets. Using their method for calculating pocket similarity – APoc – Gao & Skolnick characterised the space of ligand-binding pockets, using artificial structures and structures from the PDB, and find it to be ~400-1000 pockets in size.

They suggest that the relatively small size of pocket-space could be one of the reasons that such a large amount of off-target promiscuity is seen in drug design attempts.

From this result, Skolnick went on to discuss several interesting possibilities for the evolutionary history of ligand-binding pockets. One of the interesting hypotheses is that many, if not all, proteins could have inherent low-level catalytic ability across a wide range of biochemical reactions. Motivated by the small size of pocket-space, it is conceivable that one protein would be able to catalyse many different reactions – this could give an insight into the evolutionary history of protein catalysis.

If primordial proteins could catalyse many different reactions, albeit inefficiently, this would give a possibility for how the first lifeforms developed. Nature need only then work on increasing specificity and efficiency from a background of weak catalytic ability. However, even through the course of evolution to produce more specific proteins, this background of activity could remain, and explain the background of ‘biochemical noise’ that is seen in biological systems.

Drug promiscuity and inherent reactivity may not only be present due to the small size of pocket-space – they may be a direct consequence of evolution and the fundamental properties of proteins.

Alistair Martin & codon!

James Dunbar presenting good stuff about Antibodies