Category Archives: Group Meetings

What we discuss during cake at our Tuesday afternoon group meetings

Research Talk: High Resolution Antibody Modelling

In keeping with the other posts in recent weeks, and providing a certain continuity, this post also focusses on antibodies. For those of you that have read the last few weeks’ posts, you may wish to skip the first few paragraphs, otherwise things may get repetitive…

Antibodies are key components of the immune system, with almost limitless potential variability. This means that the immune system is capable of producing antibodies with the ability to bind to almost any target. Antibodies exhibit very high specificity and very high affinity towards their targets, and this makes them excellent at their job – of marking their targets (antigens) to identify them to the rest of the immune system, either for modification or destruction.

(left) The Immunoglobulin (IgG) fold, the most common fold for antibodies. It is formed of four chains, two heavy and two light. The binding regions of the antibody are at the ends of the variable domains VH and VL, located at the ends of the heavy and light chains respectively. (right) The VH domain. At the end of both the VH and the VL domains are three hypervariable loops (CDRs) that account for most of the structural variability of the binding site. The CDRs are highlighted in red. The rest of the domain (coloured in cyan), that is not the CDRs, is known as the framework.

Over the past few years, the use of antibodies as therapeutic agents has increased. It is now at the point where we are beginning to computationally design antibodies to bind to specific targets. Whether they are designed to target cancer cells or viruses, the task of designing the CDRs to complement the antigen perfectly is a very difficult one. Computationally, the best way of predicting the affinity of an antibody for an antigen is through the use of docking programs.

For best results, high resolution, and very accurate models of both the antibody and the antigen are needed. This is because small changes in the antibodies sequence can be seen to produce large changes in the affinity, experimentally.

Many antibody modelling protocols currently exist, including WAM, PIGS, and RosettaAntibody. These use a variety of approaches. WAM and PIGS use homology modelling approaches to model the framework, augmented with expert knowledge-based rules to model the CDRs. RosettaAntibody also uses homology modelling to model the framework of the antibody, but then uses the Rosetta protocol to perform an exploration of the conformational space to find the lowest energy conformation.

However, there are several problems that remain. The orientation between the VH domain and the VL domain is shown to be instrumental in the high binding affinity of the antibody. Mutations to framework residues that change the orientation of the VH and VL domains have been shown to cause significant changes to the binding affinity.

Because of the multi-chain modelling problem, which currently has no general solution, the current approach is often to copy the orientation across from the template antibody to create the orientation of the target antibody. (The three examples above do perform some extent of orientation optimisation using conserved residues at the VH-VL interface.)

However, before we begin to consider how to effect the modelling of the VH-VL interface, we must first build the VH and the VL separately. All of the domain folds in the IgG structure are very similar, consisting of two anti-parallel beta sheets sandwiched together. These beta sheets are very well conserved. The VH domain is harder to model because it contains the CDR H3 – which is the longest and most structurally variable of the 6 CDRs – so we may as well start there…

Framework structural alignment of 605 non-redundant structures (made non-redundant @95% sequence identity). The beta sheet cores are very well conserved, but the loops exhibit more structural variability (although not that much by general protein standards...). The stumps where the CDRs have been removed are shown.

Framework structural alignment of 605 non-redundant VHs (made non-redundant @95% sequence identity). The beta sheet cores are very well conserved, but the loops exhibit more structural variability (although not that much by general protein standards…). The stumps where the CDRs have been removed are labelled.

But even before we start modelling the VH, how hard is the homology modelling problem likely to be for the average VH sequence that we come across? Extracting all of the VH sequences from the IMGT database (72,482 sequences) we find the structure in SAbDab (Structural Antibody Database) that exhibits the highest sequence identity to each of the sequences. This is the structure that would generally be used as the template for modelling. Results below…

Most of the sequences have a best template with over 70% sequence identity, so modelling them with low RMSDs (< 1 Angstrom) should be possible. However, there are still those that have lower sequence identity. These could be problematic…

When we are analysing the accuracy of our models, we often generate models for which we have experimentally derived crystal structures, and then compare them. But a crystal structure is not necessarily the native conformation of the protein, and some of the solvents added to aid the crystallisation could well distort the structure in some small (or possibly large) way. Or perhaps the protein is just flexible, and so we wouldn’t expect it to adopt just one conformation.

Again using SAbDab to help generate our datasets, we found the maximum variation (backbone RMSD) between sequence-identical VH domains, for the framework region only. How different can 100% identical sequences get? Again, results are below…

We see that even for 100% identical domains, the conformations can be different enough for a significant RMSD. The change that created a 1.4A RMSD change (PDB entries 4fqc and 4fq1) is due to a completely different conformation for one of the framework loops.

So, although antibody modelling is easy in some respects – high conservation, large number of available structures for templates – it is not just a matter of getting it ‘close’, or even ‘good’. It’s about getting it as near to perfect as possible… (even though perfect may be ~ 0.4 A RMSD over the framework…)

Watch this space…

“Perfection is not attainable, but if we chase perfection we can catch excellence.”

(Vince Lombardi )

Building an Antibody Benchmark Set

In this so-called ‘big data’ age, the quest to find the signal amidst the noise is becoming more difficult than ever. Though we have sophisticated systems that can extract and parse data incredibly efficiently, the amount of noise has equally, if not more so, expanded, thus masking the signals that we crave for. Oddly enough, it sometimes seems that we are churning and gathering a vast amount data just for the sake of it, rather than looking for highly-relevant, high-quality data.

One such example is antibody (Ab) binding data. Even though there are several Ab-specific databases (e.g. AbySis, IMGT), none of these, to our knowledge, has any information on an Ab’s binding affinity to its antigen (Ag), despite the fact that an Ab’s affinity is one of the few quantitative metrics of its performance. Therefore, gathering Ab binding data would not only help us to create more accurate models of Ab binding, it would, in the long term, facilitate the in silico maturation and design/re-design of Abs. If this seems like a dream, have a read of this paper – they made an incredibly effective Ab from computationally-inspired methods.

Given the tools at our disposal, and the fact that several protein-protein binding databases are available in the public domain, this task may seem somewhat trivial. However, there’s the ever-present issue of gathering only the highest quality data points in order to perform some of the applications mentioned earlier.

Over the past few weeks, we have gathered the binding data for 228 Ab-Ag complexes across two major protein-protein binding databases; PDB-Bind and the structure-based benchmark from Kastritis et al. Ultimately, 36 entries were removed from further analyses as they had irrelevant data (e.g. IC50 instead of KD; IC50 relates to inhibition, which is not the same as the Ab’s affinity for its Ag). Given the dataset, we performed some initial tests on existing energy functions and docking programs to see if there is any correlation between the programs’ scores and protein binding affinities.

Blue = Abs binding to proteins, Red = Abs binding to peptides

As the graphs show, there is no distinctive correlation between a program/function’s score and the affinity of an Ab. Having said this, these programs were trained on general protein-protein interfaces (though that does occasionally include Abs!) and we thus trained DCOMPLEX and RAPDF specifically for Ab structures (~130 structures). The end results were poor nonetheless (top-centre and top-right graphs, above), but the interatomic heatmaps show clear differences in the interaction patterns between Ab-Ag interfaces and general protein-protein interfaces.

Interatomic contact map between Ab-Ag or two general proteins. Warmer colours represent higher counts.

Now, with this new information, the search for signals continues. It is evident that Ab binding has distinctive differences with respect to protein-protein interfaces. Therefore, the next step is to gather more high-quality data and see if there is any correlation between an Ab’s distinct binding mode and its affinity. However, we are not interested in just getting whatever affinity data is available. As we have done for the past few weeks, the rigorous standards we have used for building the current benchmark set must be maintained – otherwise we risk in masking the signal with unnecessary noise.

Currently, the results are disappointing, but if the past few weeks in OPIG has taught me anything, this is only the beginning of a long and difficult search for a good model. BUT – this is what makes research so exciting! We learn from the low Pearson correlation coefficients, the (almost) random distribution of data, and the not-so-pretty plots of our data in order to form useful models for practical applications like Ab design. I think a quote from The Great Gatsby accurately ‘models’ my optimism for making sense of the incoming stream of data:

Gatsby believed in the green light, the orgastic future that year by year recedes before us. It eluded us then, but that’s no matter — to-morrow we will run faster, stretch out our arms farther. . . . And one fine morning ——

So we beat on, boats against the current, borne back ceaselessly into the past.

Evolutionary fold space preferences

At group meeting last week I focussed, alongside some metaphysical speculation, on a project which has occupied the first half of my DPhil: namely exploring the preferences of both very old and very young protein structures. This work is currently in preparation for publication so I will give only a brief overview and hopefully update the juicy details later. Feel free to contact me for more information.

Proteins are the molecular machinery of the cell. Their evolution is one of the most fundamental processes which has delivered the diversity and complexity of life that we see around ourselves today. Despite this diversity, protein domains (independent folding units) of known structure fall into just over 1,000 unique SCOP folds.

This project has sought to identify how populations of proteins at different stages of evolution explore their possible structure space.

Superfamily ages

Structural domains are clustered at different levels of similarity within the SCOP classification. At the superfamily level this classification attempts to capture evolutionary relationships through structural and functional similarities even if sequence diversion has occurred.

Evolutionary ages for these superfamilies are then estimated from their phylogenetic profiles across the tree of life. These ages are an estimate of the structural ancestor for a superfamily.

The phylogenetic occurrence profiles are constructed using predictions of superfamilies on completely sequenced genomes using HMMs and taken from the SUPERFAMILY database. Given an occurrence profile and a phylogenetic tree (for robustness we consider several possible reconstructions of the tree of life) we use a maximum parsimony algorithm (proposed by Mirkin et. al) which estimates the simplest scenario of loss events (domain loss on a genome) and gain events (domain gain) at internal nodes on the tree which explains the occurrence profile. The age estimate is the height of the first gain event, normalised between 0 (at the leaves of the tree) and 1 (at the root).

We estimated ages for 1,962 SCOP superfamilies and compared several properties relating to their primary, secondary and tertiary structures, as well as their functions. In particular, we compared two populations of superfamilies: ancients, with an age of 1, and new-borns, with an age < 0.4. Full details of our results will hopefully be published shortly so watch this space!

Protein kinases, the PIM story

Last week I was presenting my DPhil work. In one of my projects I address the reasons for inhibitor selectivity in PIM protein kinase family. PIM kinases play key roles in signalling pathways and have been identified as oncogenes long time ago. Slightly unusual for protein kinases ATP-binding sites and cancer roles have prompted the investigation of potential PIM-selective inhibitors for anticancer therapy. Due to overlapping functions of the three PIM isoforms, efficacious inhibitors should bind to all three isozymes. However, most reported inhibitors show considerable selectivity for PIM1 and PIM3 over PIM2 and the mechanisms leading to this selectivity remain unclear.

Figure 1. Workflow of the sequence and structure analysis of PIM kinases

To establish the sequence determinants of inhibitor selectivity we investigated the phylogenetic relationships of PIM kinases and their structural conformations upon ligand binding (Figure 1). Together with my OPIG supervisor Charlotte Deane we predicted a set of candidates for site-directed mutagenesis as illustrated in Figure 2. The mutants were designed to convert PIM1 residues into analogous PIM2 residues at the same positions.

I then moved to the wetlab to test the hypotheses experimentally. Under guidance of Oleg Fedorov, I screened the SGC library of kinase inhibitors using differential scanning fluorimetry (DSF). After comparing melting temperature shift values across the PIM kinases and mutants, a set of potent inhibitors with different chemical scaffolds have been selected for quantitative binding analysis. I worked with Peter Drueker’s team at Novartis on PIMs enzymology, where I measured activities, Km values for ATP and IC50s using mobility shift assay. For my final set of measurements I performed isothermal titration calorimetry (ITC) experiments back at the SGC and determined binding constants and enthalpic/entropic contributions to the total free energy of ligand binding.

Figure 2. An overlay of PIM1 and PIM2 structures (P-loop and hinge regions), the mutated residues are shown as sticks

The data are yet to be published, I only briefly state the results here. The hinge mutant E124L demonstrated reduced thermal stability probably due to removal of E124-R122 salt bridge. The P-loop mutants had intermediate Km ATP values between PIM1 and PIM2, indicating that those residues could be responsible for stronger ATP binding in PIM2. As shown in Figure 2, the residues are located at the tip of the P-loop and might have involvement in the P-loop movement. Importantly, three mutants have shown reduced affinity to inhibitors validating my initial hypotheses.

Ideally having PIM1 and PIM2 co-crystal structures with the same inhibitors would allow direct comparison of the binding modes. So far I was able to solve apo-PIM2 structure in addition to the single PIM2 pdb, which will be deposited shortly.

I will update you soon about on my second project which involves more mutants, type II inhibitors, equilibrium shifts and speculations about conformational transitions. Keep visiting us!

Free food!

Yesterday I walked into Group Meeting not having read Bernhard‘s paper (shameful, I know), and I was immediately asked “Where is the Daleks post on the blog?”. To which I mumbled something unconstrued, because I am not sure what a Dalek is and because I didn’t know we were doing post requests.

Anyway, at every group meeting one of us is responsible to organise the talk and another to supply food. The only current rule (since the well-received demise of the “No alcohol” one) is: “No tomatoes“. We’ve had a number of original and tasty contributions: Dominos pizza, Ben’s and Millies cookies, truckloads of Haribos, Krispy Kreme Doughnuts, Sushi, Nutella baguettes and home-baked delights.

But Eoin‘s contribution takes the prize this round (a small trophy in Lab Room #1).

Eoin’s Dr. Who Daleks sugar rush inducing cakes (click for the juicy detail).

So, a small pointer to OPIG prospective students – “baking” and “creative thinking” skills are really well appreciated and look good on your CV!

Journalclub: Molecular Dynamics simulations of TCRpMHC

Introduction

T cells recognize fragments of pathogen (peptides) presented by the Major Histocompatibility Complex (MHC) via their T-cell receptor (TCR). This interaction process is commonly considered as one of the most important events taken place in the adaptive immune reaction.

Molecular Dynamics simulations are a computational technique to simulate the movement of atoms over time. For this purpose the interaction energies (bond and non-bond) between the single atoms are calculated and the spatial position are adjusted during each iteration. Such simulations are very resource and time consuming but provide insights into interaction processes which can not be obtained by any currently available experimental technique.

In this journal club we discussed 3 different papers dealing with MD simulations of the TCRpMHC complex:

A typical story

Epitope Flexibility and Dynamic Footprint Revealed by Molecular Dynamics of a pMHC-TCR Complex
Reboul et al., Plos Comp. Biol. 2012

Like similarly done by many other authors before Reboul et al. performed MD simulations of two different (however very similar MHCs) in complex with the same viral peptide. While no immune reaction is caused if the peptide is presented by HLA-B*3501 there is an reaction induced if presented in the context of HLA-B*3508.

In their MD simulations the authors find minor differences in the RMSF and claim this to be systematic and the cause for the different behaviour.

An innovative story

Toward an atomistic understanding of the immune synapse: Large-scale molecular dynamics simulation of a membrane embedded TCR–pMHC–CD4 complex
Wan et al., Molecular Immunology 2008

While several PDB structures of parts of the core of the immunological synapse are available (see image below). On overall structure was not published before this paper. This is addressed by the authors by means of superimposition, modelling of linking and trans-membrane regions, and subsequent MD simulation. The resulting structure seems to be in good agreement with experimental electron microscopy data.

My story

Early relaxation dynamics in the LC 13 T cell receptor in reaction to 172 altered peptide ligands: A molecular dynamics simulation study
Knapp et al., Plos One 2013

In most studies authors compare the same MHC but with two or three different peptides or the same peptide bound to 2 MHCs. In some cases also the same peptide and MHC are simulated in interaction with 2 different TCRs. Given the fact that the TCRpMHC consists of roughly 800 AAs one will almost certainly find some differences between those two or three simulations (multiple testing). Differences would also be present if one simulates the same complex twice with different starting velocities or more extreme even if one parametrizes the same velocities but different hardware is used. Yes, also in this case this may lead to slightly different results. On this basis such studies (if published without further experimental data to undermine the findings) are at best anecdotal stories.

Therefore we indented to address this challenge in a more systematic way: We simulated the LC 13 TCR / HLA-B*08:01 system in complex with all possible single point mutations in the EBV peptide FLRGRAYGL. This leads to a total of 172 highly related MD simulations where for each of them the experimental immunogenicity is known. Based on their immunogencity we assigned each simulations to either the more immunogenic (moreI) or less immunogenic (lessI) group. This was repeated for several thresholds. Further analysis on the basis of RMSD maps and permutation tests showed that moreI and lessI groups were significantly different in their initial relaxation dynamics from the (perturbed) x-ray structure.

They were not only significantly different but they also showed a quite interesting pattern in their most frequently different regions (highlighted in green):

Journal Club: Evolutionary conservation of codon optimality reveals hidden signatures of cotranslational folding

This paper Pechmann et al discusses the relationship between codons and co-translational regulation of protein folding. Every amino acid apart from Methionine and Tryptophan has multiple codons and it is well established that codons are translated at varying speeds and thus influence local translational kinetics.

This codon multiplicity and speed variation may be important for protein folding as several experiments have shown that synonymous substitutions (changing the codon but not the amino acid) alter folding and or function.

codon translation efficeincy depends on tRNA supply and demand

The new idea presented in this paper is a translational efficiency scale. This is an attempt to calculate the efficiency with which a codon will be translated by considering both the supply of tRNA and the demand for that tRNA. They calculate their new measure nTE for all of the coding sequences in 10 closely related yeast species.

The distribution of the nTE values is unlike that of previous scales as the majority of codons occur in a middle plateau region. The authors suggest that this is due to cost effective proteome maintenance, i.e. for most tRNA supply and demand are closely matched.

They go on to look for the previously observed “ramp” a slow region at the start of coding sequences. They identify a ramp region which is approximately 10 codons long (this is significantly shorter than that seen in other analyses which found a 35-50 codon ramp). This shorter region relates to two other observations, firstly the distance between the peptidyl transferase centre and the constriction site in the ribosome is approximately 10 amino acids long and secondly that experimentally ribosomes are found to pause near the very start of coding sequences.

The codons are now divided into two categories based on their nTE score, optimal codons those with high nTE values that should be translated rapidly and accurately and non-optimal codons. The authors found that codon optimality was conserved between orthologs in their set at rates far higher than those expected by chance (for both optimal and non-optimal codons). When considering those proteins with structural information available, they were also able to observe conservation of positioning of codon types with respect to secondary structures. This evolutionary conservation suggests an evolved function for codon optimality in regulating the rhythm of elongation in order to facilitate co-translational protein folding.

Evolutionary conservation of codon optimality reveals hidden signatures of cotranslational folding Nat Struct Mol Biol. 2013 Feb;20(2):237-43 Pechmann S, Frydman J.

Viewing ligands in twilight electron density

In this week’s journal club we discussed an excellent review paper by E. Pozharski, C. X. Weichenberger and B. Rupp investigating crystallographic approaches to protein-ligand complex elucidation. The paper assessed and highlighted the shortcomings of deposited PDB structures containing ligand-protein complexes. It then made suggestions for the community as a whole and for researchers making use of ligand-protein complexes in their work.

The paper discussed:

The difficulties in protein ligand complex elucidation
The tools to assess the quality of protein-ligand structures both qualitative and quantitative
The methods used describing their analysis of certain PDB structures
Some case studies visually demonstrating these issues
Some practical conclusions for the crystallographic community
Some practical conclusions for non-crystallographer users of protein-ligand complex structures from the PDB

The basic difficulties of ligand-protein complex elucidation

Ligands have less than 100% occupancy – sometimes significantly less and thus will inherently show up less clearly in the overall electron density.
Ligands make small contributions to the overall structure and thus global quality measures , such as r-factors, will be affected only minutely by the ligand portion of the structure being wrong
The original basis model needs to be used appropriately. The r-free data from the original APO model should be used to avoid model bias

The following are the tools available to inspect the quality of agreement between protein structures and their associated data.

Visual inspection of the Fo-Fc and 2Fo-Fc maps,using software such as COOT, is essential to assess qualitatively whether a structure is justified by the evidence.
Use of local measures of quality for example real space correlation coefficients (RSCC)
Their own tool, making use of the above as well as global quality measure resolution

Methods and results

In a separate publication they had analysed the entirety of the PDB containing both ligands and published structure factors. In this sample they demonstrate 7.6% had RSCC values of less than 0.6 the arbitrary cut off they use to determine whether the experimental evidence supports the model coordinates.

Figure to show an incorrectly oriented ligand (a) and its correction (b)

An incorrectly oriented ligand (a) and its correction (b). In all of these figures Blue is the 2mF_oDF_c map contoured at 1σ and Green and Red are positive and negative conturing of the mF_oDF_c map at 3σ

In this publication they visually inspected a subset of structures to assess in more detail how effective that arbitrary cutoff is and ascertain the reason for poor correlation. They showed the following:

(i) Ligands incorrectly identiﬁed as questionable,false positives(7.4%)
(ii) Incorrectly modelled ligands (5.2%)
(iii) Ligands with partially missing density (29.2%).
(iv) Glycosylation sites (31.3%)
(v) Ligands placed into electron density that is likely to
originate from mother-liquor components
(vi) Incorrect ligand (4.7%)
(vii) Ligands that are entirely unjustiﬁed by the electron
density (11.9%).

The first point on the above data is that the false-positive rate using RSCC of 0.6 is 7.4%. This demonstrates that this value is not sufficient to accurately determine incorrect ligand coordinates. Within the other categories all errors can be attributed to one of or a combination of the following two factors:

The inexperience of the crystallographer being unable to understand the data in front of them
The wilful denial of the data in front of the crystallographer in order that they present the data they wanted to see

Figure to show a ligand placed in density for a sulphate ion from the mother liquor (a) and it's correction (b)

A ligand incorrectly placed in density for a sulphate ion from the mother liquor (a) and it’s correction (b)

The paper observed that a disproportionate amount of poor answers was derived from glycosylation sites. In some instances these observations were used to inform the biochemistry of the protein in question. Interestingly this follows observations from almost a decade ago, however many of the examples in the Twilight paper were taken from 2008 or later. This indicates the community as a whole is not reacting to this problem and needs further prodding.

Figure to show an incomplete glycosylation site inaccurately modeled

Conclusions and suggestions

For inexperienced users looking at ligand-protein complexes from the PDB:

Inspect the electron density map using COOT if is available to determine qualitatively is their evidence for the ligand being there
If using large numbers of ligand-protein complexes, use a script such as Twilight to find the RSCC value for the ligand to give some confidence a ligand is actually present as stated

For the crystallographic community:

Improved training of crystallographers to ensure errors due to genuine misinterpretation of the underlying data are minimised
More submission of electron-density maps, even if not publically available they should form part of initial structure validation
Software is easy to use but difficult to analyse the output

GPGPUs for bioinformatics

As the clock speed in computer Central Processing Units (CPUs) began to plateau, their data and task parallelism was expanded to compensate. These days (2013) it is not uncommon to find upwards of a dozen processing cores on a single CPU and each core capable of performing 8 calculations as a single operation. Graphics Processing Units were originally intended to assist CPUs by providing hardware optimised to speed up rendering highly parallel graphical data into a frame buffer. As graphical models became more complex, it became difficult to provide a single piece of hardware which implemented an optimised design for every model and every calculation the end user may desire. Instead, GPU designs evolved to be more readily programmable and exhibit greater parallelism. Top-end GPUs are now equipped with over 2,500 simple cores and have their own CUDA or OpenCL programming languages. This new found programmability allowed users the freedom to take non-graphics tasks which would otherwise have saturated a CPU for days and to run them on the highly parallel hardware of the GPU. This technique proved so effective for certain tasks that GPU manufacturers have since begun to tweak their architectures to be suitable not just for graphics processing but also for more general purpose tasks, thus beginning the evolution General Purpose Graphics Processing Unit (GPGPU).

Improvements in data capture and model generation have caused an explosion in the amount of bioinformatic data which is now available. Data which is increasing in volume faster than CPUs are increasing in either speed or parallelism. An example of this can be found here, which displays a graph of the number of proteins stored in the Protein Data Bank per year. To process this vast volume of data, many of the common tools for structure prediction, sequence analysis, molecular dynamics and so forth have now been ported to the GPGPU. The following tools are now GPGPU enabled and offer significant speed-up compared to their CPU-based counterparts:

Application	Description	Expected Speed Up	Multi-GPU Support
Abalone	Models molecular dynamics of biopolymers for simulations of proteins, DNA and ligands	4-29x	No
ACEMD	GPU simulation of molecular mechanics force fields, implicit and explicit solvent	160 ns/day GPU version only	Yes
AMBER	Suite of programs to simulate molecular dynamics on biomolecule	89.44 ns/day JAC NVE	Yes
BarraCUDA	Sequence mapping software	6-10x	Yes
CUDASW++	Open source software for Smith-Waterman protein database searches on GPUs	10-50x	Yes
CUDA-BLASTP	Accelerates NCBI BLAST for scanning protein sequence databases	10	Yes
CUSHAW	Parallelized short read aligner	10x	Yes
DL-POLY	Simulate macromolecules, polymers, ionic systems, etc on a distributed memory parallel computer	4x	Yes
GPU-BLAST	Local search with fast k-tuple heuristic	3-4x	No
GROMACS	Simulation of biochemical molecules with complicated bond interactions	165 ns/Day DHFR	No
GPU-HMMER	Parallelized local and global search with profile Hidden Markov models	60-100x	Yes
HOOMD-Blue	Particle dynamics package written from the ground up for GPUs	2x	Yes
LAMMPS	Classical molecular dynamics package	3-18x	Yes
mCUDA-MEME	Ultrafast scalable motif discovery algorithm based on MEME	4-10x	Yes
MUMmerGPU	An open-source high-throughput parallel pairwise local sequence alignment program	13x	No
NAMD	Designed for high-performance simulation of large molecular systems	6.44 ns/days STMV 585x 2050s	Yes
OpenMM	Library and application for molecular dynamics for HPC with GPUs	Implicit: 127-213 ns/day; Explicit: 18-55 ns/day DHFR	Yes
SeqNFind	A commercial GPU Accelerated Sequence Analysis Toolset	400x	Yes
TeraChem	A general purpose quantum chemistry package	7-50x	Yes
UGENE	Opensource Smith-Waterman for SSE/CUDA, Suffix array based repeats finder and dotplot	6-8x	Yes
WideLM	Fits numerous linear models to a fixed design and response	150x	Yes

It is important to note however, that due to how GPGPUs handle floating point arithmetic compared to CPUs, results can and will differ between architectures, making a direct comparison impossible. Instead, interval arithmetic may be useful to sanity-check the results generated on the GPU are consistent with those from a CPU based system.

PredyFlexy – Predicting Protein Flexibility

An example output of PredyFlexy

My presentation focused on a method to predict protein flexibility – PredyFlexy. There is a webserver, and it is described in their paper (de Brevern et. al 2012). The method is covered much more explicitly in the paper Predicting protein flexibility through the prediction of local structures (Bornot et al. 2011). This work builds on earlier papers (which are mentioned on the front page of the webserver, some of which I will mention later). In terms of people, Alexandre G. de Brevern, Aurelie Bornot, and Catherine Etchebest are authors common to all of these papers.

The concept is simple; there is a library of ‘Local Structure Prototypes’ or LSPs. These LSPs are 11 residue fragments, and have structure and flexibility associated with them, which are derived from a set of protein structures and molecular dynamics simulations. Each LSP has a SVM (support vector machine) ‘expert’ to score how likely a given sequence profile is to have said structure. A 21 residue window is used to determine the LSP of the central 11 residues within this window.

So, the user inputs a sequence, which gets Psi-Blasted to give a sequence profile. The 5 most probable LSPs for each atom are determined using the SVM experts. Then the predicted flexibility of each atom is given by the average flexibility of these 5 LSPs. There is a confidence index to the predictions, which comes from assessing the discriminative power of the SVMs. Regions predicted to have LSP with more accurate SVMs will have a high confidence index.

Examples of Local Structure Prototypes. Taken from the supplementary information of ‘Predicting protein flexibility through the prediction of local structures’ – Aurélie Bornot, Catherine Etchebest, Alexandre G. de Brevern, Proteins 79, 3, 839–852, 2011

So, the concept seems simple, but you are probably wondering, what are these LSPs? To answer that, we have to delve into the literature. ‘Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks‘, published in 2000, and with Alexandre G. de Brevern as first author, is the sensible place to start. This introduced the concept of protein blocks, which is in effect a structural alphabet. These are 5 residue fragments (described by 8 dihedral angles), and there are 16 of them (there are pretty pictures in the supplementary data 1 of this paper). Local structure prototypes are made up of these protein blocks, which were first used to predict structure, in a 2006 paper – Assessing a Novel Approach for Predicting Local 3D Protein Structures from Sequence (again, with the same authors). LSPs are 11 residue fragments, made from 7 overlapping protein blocks. Obviously there are lots of combinations to the protein blocks that can give 11 residue fragments. These combinations are clustered into 120 groups. Each group is represented by the fragment within the cluster that is closest to all other fragments within the cluster, based on C-alpha RMSD. Hence 120 LSPs.

The other pertinent question is, where does the flexibility data come from? Well, they did some MD simulations (see Bornot et al. 2011 for details) and took b-factors from the structures. They normalised these, by calculating the number of standard deviations away from the mean for each structure. Each LSP was attributed a b-factor and RMSF, which was the mean value for the central residue of every instance of the LSP in the data set. Additionally, each fragment in the data set was classed as ‘rigid’, ‘flexible’, or ‘intermediate’ based on its normalized RMSF and B-factor. Each LSP was given a probability of belonging to each of these classes based on the frequency of fragments that belonged to that LSP being in each class. The figure here (taken from Bornot et al. 2011) shows the interesting weak correlation between normalised B-factor and normalised RMSF.

Normalised B-factor values according to normalized RMSF values as determined from molecular dynamics simulations. From Bornot et al 2011. Blue points represent rigid fragments, red flexible ones, and green points intermediate fragments.

Bornot et al. 2011 also gives us a guide to the ability of this prediction method (see table II). In predicting the class of fragments (rigid, intermediate or flexible), it gets the correct class about half the time. For 40% of rigid and flexible cases, the class is predicted as ‘intermediate’. Prediction rate is also strongly correlated with flexibility – more flexible regions have much poorer prediction rates. Which is not great, as we already know that most alpha helices are rigid. However, the confidence index does give a good guide as to what to trust. I could speculate that might results in an output that tells us that helices and sheets are definitely rigid, and other elements are possibly flexible. Which would not be particularly useful, but given there are few comparable tools, something is better than nothing.

Protein flexibility is hard; experimentally determining it is difficult (and even MD simulations take a while), and people can argue about how relevant the experimental methods are (and we frequently do in our group meetings). So, like most predictive methods, a relatively fast (and simple) way to get some information about your problem is always going to be useful. If only to guide you to where you might focus your attention.

Oxford Protein Informatics Group

or "OPIG" to friends