Author Archives: Henry Wilman

DSSP

My talk this week focused on secondary structure (SS) assignment. What do I mean by this? It is assigning SS types (principally α-helices and β-sheets) to protein structures. It can be found hiding in many of the things we do – e.g. alignment and modelling. There are many available methods to do this, of which DSSP (despite being published in 1983) is the most popular.

DSSP

How does it work?

The algorithm identifies hydrogen bonds between mainchain carbonyl and amide groups. Partial charges are applied to the amide and carbonyl bonds, and the the C, O, N, and H atoms are assumed to be point charges (hence C has charge +ρ1, O -ρ1, N -ρ2, and H +ρ2. The the electrostatic energy between these 4 atoms is calculated, and if it is < -0.5 kcal/mol, a hydrogen bond exists. This is a relatively relaxed threshold, as a normal hydrogen bond in an $alpha;-helix is in the region of -3 kcal/mol, so it means that a given residue could have i+3, i+4, and i+5 hydrogen bonds.

Helices and sheets are then identified where there are characteristic hydrogen bond patterns. For example, two consecutive i to i+4 backbone hydrogen bonds indicates an α-helix turn. The algorithm identifies each turn, and each β-bridge, and where several of these overlap, they are combined into single elements.

DSSP has 8 different SS assignments:

  • G – 310 helix
  • H – α-helix
  • I – π-helix
  • E – β-sheet
  • B – β-bridge
  • T – helix turn
  • S – bend (high curvature)
  • C – coil (none of the above)

These are assigned in an order of preference – HBEGITSC.
Many (but by no means all) SS assignment programs still use this notation.

DSSP is one of the more simple SS assignment programs. Its hydrogen bond energy calculation is distinctly simplistic. It does not (fully) take the angles of the hydrogen bond into account, and provides only a binary classification for each hydrogen bond. However, perhaps surprisingly, DSSP is still the most used method. Why? Probably something to do with them giving it away for free, which resulted in many software suits incorporating it (e.g. JOY, PROMOTIF). As a general rule, if something does not say what it uses for SS assignment, it probably uses DSSP.

Other Methods

Given the simplicity of DSSP, it is not surprising that there are a large number of other available methods. Indeed, you may notice that different programs will give different assignments (e.g. comparing Pymol to VMD or the PDB annotation).

There have been a vast number of other secondary structure (SS) annotation methods published, including: STRIDE, DEFINE, PROMOTIF, KAKSI, SST, PSSC, P-SEA, SECSTR, XLLSSTR, PALSSE, and STICK. The other two you are likely to come across are STRIDE, and the PDB annotation.

SS assignment in general

All of the SS assignment methods rely on a combination of three features of SS. These are:

  1. Mainchain hydrogen bonds
  2. Φ and Ψ angles
  3. Inter Cα distances

For all three of these, there are values characteristic of both helices and β-sheets. Every method takes some combination of these three features for each residue, and if they are within the chosen limits, classifies the residue accordingly. The limits are generally chosen to give good agreement with the PDB annotation.
(It should be noted that the hydrogen-bond containing methods use the position of the hydrogen atom, which is rarely present in the crystal structure, and thus must be inferred.)

STRIDE can be described as an updated version of DSSP. It has two main differences – it uses a much fuller description of hydrogen bond geometry, and combines this with a knowledge based φ/ψ angle potential to assign the residue state. It has many more parameters that DSSP, and these are trained based on the PDB annotation. So where does that come from?

This PDB annotation comes from the depositors own annotation. The current guidance (from here) is to use the generated annotation, from PROMOTIF. PROMOTIF uses DSSP, with a slight change – it annotates an extra residue at the end of each structure element. I am in no position to say how well this guidance is adhered to by the depositors, or what their historical behaviour was, but the vast majority of annotations are reasonable.

I guess you are now wondering how different these methods are. Generally they agree in the obvious cases, and disagreement is normally at the ends of SS elements. Other examples (particularly pertinent to my research) occur when one method identifies a single long element, while another method identifies two elements seperated by a coil section. Ultimately there is no ‘right’ answer, so saying one method is right and another wrong is impossible.

To sum up, DSSP is the de facto standard. Ignoring my previous comment, it is probably not the best algorithm, as it is a gross simplification. STRIDE improves on the algorithm (although using more parameters), whilst for specific tasks, one method may be better than all of the others. It is hard to say if one is the best, and if it is important to you, then you should think about which method to use. If you do not think it is, then you should reconsider, and if it really is not important, then just use DSSP like everyone else. This is perhaps an example where willing, free, provision your code to the community results in your method (DSSP) becoming the de facto standard.

Loopy LaTeX

Something that you may or may not know, is that you can write loops in LaTeX. It is, in fact, a Turing-complete language, so you can write if statements too (which is helpful for making multi-line comments), and do all the other things you can do with a programming language.

So, I guess you are thinking, ‘eh, that’s cool, but why would I do it?’ Indeed, I have known about this for ages, but never found any need for it, until recently. My problem was that I had generated 80 small images, that I wanted to display on a number of pages. I could have played with print settings, and made multiple pages per sheet, but since I wanted two columns, and to fit 16 to a page (the 80 images were in 5 sets of 16, each with two subsets), that was going to be a pain. I also wanted to add some labels to each set, but not have said label on every image. However, I thought that LaTeX could solve my problem.

As ever, there are a number of different latex packages that you can use, but I used the pgffor package. In the example below, my pictures were named [A-E]_[ab][1-8]_picture, e.g. A_b2_picture.png, or D_a3_picture.png. The code produces pages of 16 pictures, with the ‘a’ pictures on the left, and the ‘b’ pictures on the right.
There is a more simple example at the bottom.

\documentclass[a4paper,10pt]{article}
\usepackage[utf8]{inputenc}
\usepackage{pgffor}
\usepackage{subfigure}
\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{fullpage}

\begin{document}
\section{}

\foreach \method in {A, B, C, D, E}{ %looping over each set of 16 
  \begin{figure}
  \foreach \i in {1,...,8}{ %looping over each subset, the a set first, then the b set
    \centering     
    \subfigure[]{\label{fig:\method \i a}\includegraphics[width=0.45\textwidth]{\method _a\i _picture}} 
    \subfigure[]{\label{fig:\method \i b}\includegraphics[width=0.45\textwidth]{\method _b\i _picture}}
  }
  \caption{\method}
  \end{figure}
}

%a more simple example
\foreach \number in {1,...,10}{
\number \ elephant
}  

\end{document}

Happy LaTeXing..

AH^AH – Alpha Helices Assessed by Humans.

Can you help our research, by sparing about 15 minutes of your time to participate in our crowd-sourcing experiment? Click here!

We have created a game-like web-application to classify protein helices. If you participate it will take you about 15 minutes and afterwards we will show you the relation between your results and all other participants. You do not need any specific training to participate – your natural human puzzle solving skills are sufficient. Try to beat the others!

blog3

We are interested in kinked α-helices in proteins. Kinks are important for protein function, and understanding them will help to treat and design therapies for diseases. One hindrance to our work is that helix kinks are not well defined – there are many helices that are classed as kinked by some scientists, but not by others. Classifying helices as kinked or curved or straight is difficult computationally. There are lots of different way to do it, but they all give different answers. We think that humans can do it better.

Can you help us by sparing about 15 minutes of your time to assess some α-helices? Click here!

Edit: To participate in this, you need to register. While the data will be made anonymous, registration is required for a number of reasons. It allows you to log back in, and view you results at a later date, it gives us a record that you have participated (so that we can acknowledge your help), and it allows us to compare the results from people with different educational backgrounds. There are also some legal reasons, and it is much more simple if we can record that you have agreed to the privacy and cookies policy, and do not have to ask permission every time you return. We will not pass your data on to third parties, and the data is stored securely (and the passwords encrypted).

CABS-flex

One of the still open questions in bioinformatics is that of the flexibility of proteins, and it is one in which I am quite interested. Our main source of structural information is X-ray diffraction experiments, in which the crystalline protein is intrinsically rigid. While there is an ever increasing body of NMR data, and Molecular Dynamics simulations are becoming faster and cheaper, complete information about the dynamics of every PDB structure is a long way off. All atom MD simulations for a complete protein still take either days, or a powerful computer. So, naturally, any papers that talk about ways to get flexibility information catch my attention.

The paper that caught my attention was CABS-flex: Server for fast simulation of protein structure fluctuations.. There also is a connected paper – Consistent View of Protein Fluctuations from All-Atom Molecular Dynamics and Coarse-Grained Dynamics with Knowledge-Based Force-Field – which benchmarks their method against MD simulations.

The CABS model pops up in a number of other places – there is a CABS-fold server, and the method was first described in the paper Consistent View of Protein Fluctuations from All-Atom Molecular Dynamics and Coarse-Grained Dynamics with Knowledge-Based Force-Field from 2004.

What does the webserver do?

You give it a coordinate file, wait some hours, and it gives you back a series of (clustered) structures, a C&alpha trajectory, a residue flexibility profile, and a nice little video. Which is nice. It is, however, a little limited; you can only do a single chain (so no modelling of small molecule or peptide interactions), there is no way of fixing part of the model so it does not flex, and it can be picky about your PDB files – no chain gaps, no strange amino acids. You can tell it to analyse a PDB chain by just entering the code, but these are frequently rejected for having unacceptable coordinate files.

How does it work?

The CABS model makes a reduced representation of the protein, and explores its conformational space by making moves – which are accepted or rejected based on a (comparatively complex) force field. Many moves are attempted, and over a large number of iterations, this builds up an ensemble of structures, which show the possible confirmations of the protein. So, its a bit like MD, but rather than moving atoms based on calculated forces, you make random moves, and accept the sensible ones.

hello

The model is shown above. The protein structure is reduced to 4 atoms per residue. There are 5 types of move, and in each step, a random move of each type is tried ~n times (n being the number of amino acids in the protein). These are accepted or rejected based on their force field, and the process is repeated several hundred times. The force field used to accept/reject moves is quite complex – there are short range (sequence dependent and sequence independent), hydrogen bond, and long range (again, sequence dependent and sequence independent) interactions. There are sub-categories within these, and the relative contributions of the interactions can be changed. There is no solvent, so the long range interactions have to provide the forces to keep the protein together, and the hydrogen bond interactions act to maintain the secondary structure elements. More detail can be found in the 2004 paper.

Is it true?

The authors justify the method by comparing it to MD simulations taken from a database, and conclude that CABS-flex gives answers as similar to MD simulations as MD simulations using different force fields are to each other. Simply put, the same residues move in their simulations as move in MD simulations. They move more, which backs up their claim that they demonstrate flexibility over a longer time-frame than MD simulations. They do admit that they get the best agreement when they heavily restrict their secondary structure – raising the question of how much of the agreement between all of the methods is down to all the simulations having the same secondary structure.

To conclude, this could be a useful method, particularly in that it can give long time-frame flexibility – providing you treat it as any other measure of flexibility, with a pinch of salt. It is a bit of a shame that the interface is (perhaps frustratingly) simple, and you need to pre-process your coordinate files, as many PDB coordinate files will not be accepted.

What is a kink?

If you ask someone who works with membrane protein structures what they think a helix kink is, you will probably get a fairly vague answer. They would probably tell you something like a kink is a point in the helix where there is a sharp change in the direction of the helix axis. Like the helix on the right in the picture.

A couple of helices

Two example helices


They might also tell you that they frequently (or always) involve Proline, and that they feature the absence of some of the backbone hydrogen bonds that you would otherwise expect to observe in a helix. But that is about as far as the consensus goes.

There are a number of published methods to identify kinks in helices (ProKink, Helanal, MC-Helan, and TMKink, to name a few). The detail of identifying kinks (and hence the definition of kinks) differs a lot between the methods. That would be fine if they all identified the same helices as having kinks, and the same positions in each helix as kinked, but they do not. Even the number of helices that are kinked differs a lot – anything from 6 to 60%. It would be easy if all α-helices were like the two in the figure – clearly kinked or straight. Generally the methods agree on these ‘easy’ helices. But there are lots of helices that are difficult to classify.

Why do we want to computationally identify them? Well, kinks are functionally important in membrane proteins. Membrane proteins are both medicinally important, and hard to experimentally determine the structure of. So, modelling their structures is a useful thing to do, and their structure includes kinks. But also, they are a structural motif. If we can write down what we think a kink is, use that definition to identify kinks, and then use information from those kinks we identified to (reliably) predict the existence of other kinks, we will know that we fully understand to motif.

There are a remarkably large number of perfectly sensible ways to identify a sharp change in the axis of an α-helix, given the coordinates of the backbone atoms.
The published methods (mentioned above) are a bit cumbersome (they require you to create input files, and parse at least one, if not many, output files), so from my experience people tend to make their own methods (That is not just me with Kink Finder, but many of the people who need to identify kinks that I have spoken to at conferences do not use a published method).

A short summary of ways to identify kinks:

  1. Fit axes to short sections of helix (4-8 residues). Give each residue an angle calculated from the angle between the axis of the section before it, and the axis of the section after it. If the largest angle in the helix is greater than a threshold, annotate the helix as kinked, at the residue with the largest angle
  2. Methods similar to (1), but using a measure other than an angle. For example, the curvature of a line fitted to the helix, or the distance between the ith and (i+4)th Cα residue
  3. Identify sections of the helix that are ‘straight’. Where a helix contains more than on section, it is kinked
  4. Look at them, and classify

Even for (1), there are lots of ways to calculate a helix axis. You can us least-squares (very sensitive to the number of residues used in the fit), or fitting to a cylinder (better, but slower), or a vector method using 4 Cαs. Similarly for (2), you could fit a spline, or just a polynomial. This is before you decide on (e.g.) how many residues you are going to fit axes to (6?, 7? 8?), or what smoothing parameter to use in the spline fit, or what order polynomial.

How should we classify this helix?

How should we classify this helix?

The point is, there are lots of ways to do it, but there is no consensus definition. You can use one method, or 2, or more, but they will give you different answers. Because we have no concise definition, it is hard to say which helix classification is ‘right’ and which one is ‘wrong’. Take this helix, it is not perfectly straight, but is it not straight enough to be kinked? Or is the change in axis over a number of turns, making it curved rather than kinked?

There are a few ways round this problem, where your computer program is struggling to ‘correctly’ classify helices. Rather than using a computer, you could manually curate a set, which has been done. However, there are a few issues here. First, their definition of a kink was a bit vague – and certainly difficult to turn into a computer program. Second, humans are very good at seeing patterns, even when they are not really there. Third, the end aim, at least for me, is to incorporate this into structure prediction, where a computational formulation of a kink is necessary. Another solution is to accept that your program is deficient, and put the borderline (i.e. difficult) cases into a third group, and removing them from your analysis. This does help to improve your signal, but is a little unsatisfactory.

To conclude, there are not two types of helices, there is more of a spectrum, from very straight, through a bit kinked, to very kinked. Classifying them is a hard problem, with lots a different approaches available. But, while the methods give different answers (particularly in identifying the position of the kink in the helix), and they do not indicate the same causes of kinks, there is work to be done.

PredyFlexy – Predicting Protein Flexibility

PredyFlexyResulteg

An example output of PredyFlexy

My presentation focused on a method to predict protein flexibility – PredyFlexy. There is a webserver, and it is described in their paper (de Brevern et. al 2012). The method is covered much more explicitly in the paper Predicting protein flexibility through the prediction of local structures (Bornot et al. 2011). This work builds on earlier papers (which are mentioned on the front page of the webserver, some of which I will mention later). In terms of people, Alexandre G. de Brevern, Aurelie Bornot, and Catherine Etchebest are authors common to all of these papers.

The concept is simple; there is a library of ‘Local Structure Prototypes’ or LSPs. These LSPs are 11 residue fragments, and have structure and flexibility associated with them, which are derived from a set of protein structures and molecular dynamics simulations. Each LSP has a SVM (support vector machine) ‘expert’ to score how likely a given sequence profile is to have said structure. A 21 residue window is used to determine the LSP of the central 11 residues within this window.

So, the user inputs a sequence, which gets Psi-Blasted to give a sequence profile. The 5 most probable LSPs for each atom are determined using the SVM experts. Then the predicted flexibility of each atom is given by the average flexibility of these 5 LSPs. There is a confidence index to the predictions, which comes from assessing the discriminative power of the SVMs. Regions predicted to have LSP with more accurate SVMs will have a high confidence index.

Local Structure Prototypes!

Examples of Local Structure Prototypes. Taken from the supplementary information of ‘Predicting protein flexibility through the prediction of local structures’ – Aurélie Bornot, Catherine Etchebest, Alexandre G. de Brevern, Proteins 79, 3, 839–852, 2011

So, the concept seems simple, but you are probably wondering, what are these LSPs? To answer that, we have to delve into the literature. ‘Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks‘, published in 2000, and with Alexandre G. de Brevern as first author, is the sensible place to start. This introduced the concept of protein blocks, which is in effect a structural alphabet. These are 5 residue fragments (described by 8 dihedral angles), and there are 16 of them (there are pretty pictures in the supplementary data 1 of this paper). Local structure prototypes are made up of these protein blocks, which were first used to predict structure, in a 2006 paper – Assessing a Novel Approach for Predicting Local 3D Protein Structures from Sequence (again, with the same authors). LSPs are 11 residue fragments, made from 7 overlapping protein blocks. Obviously there are lots of combinations to the protein blocks that can give 11 residue fragments. These combinations are clustered into 120 groups. Each group is represented by the fragment within the cluster that is closest to all other fragments within the cluster, based on C-alpha RMSD. Hence 120 LSPs.

The other pertinent question is, where does the flexibility data come from? Well, they did some MD simulations (see Bornot et al. 2011 for details) and took b-factors from the structures. They normalised these, by calculating the number of standard deviations away from the mean for each structure. Each LSP was attributed a b-factor and RMSF, which was the mean value for the central residue of every instance of the LSP in the data set. Additionally, each fragment in the data set was  classed as ‘rigid’, ‘flexible’, or ‘intermediate’ based on its normalized RMSF and B-factor. Each LSP was given a probability of belonging to each of these classes based on the frequency of fragments that belonged to that LSP being in each class. The figure here (taken from Bornot et al. 2011) shows the interesting weak correlation between normalised B-factor and normalised RMSF.

 

Normalised B-factor values according to normalized RMSF values as determined from molecular dynamics simulations. From Bornot et al 2011. Blue points represent rigid fragments, red flexible ones, and green points intermediate fragments.

Normalised B-factor values according to normalized RMSF values as determined from molecular dynamics simulations. From Bornot et al 2011. Blue points represent rigid fragments, red flexible ones, and green points intermediate fragments.

Bornot et al. 2011 also gives us a guide to the ability of this prediction method (see table II). In predicting the class of fragments (rigid, intermediate or flexible), it gets the correct class about half the time. For 40% of rigid and flexible cases, the class is predicted as ‘intermediate’. Prediction rate is also strongly correlated with flexibility – more flexible regions have much poorer prediction rates. Which is not great, as we already know that most alpha helices are rigid. However, the confidence index does give a good guide as to what to trust. I could speculate that might results in an output that tells us that helices and sheets are definitely rigid, and other elements are possibly flexible. Which would not be particularly useful, but given there are few comparable tools, something is better than nothing. 

Protein flexibility is hard; experimentally determining it is difficult (and even MD simulations take a while), and people can argue about how relevant the experimental methods are (and we frequently do in our group meetings). So, like most predictive methods, a relatively fast (and simple) way to get some information about your problem is always going to be useful. If only to guide you to where you might focus your attention.