Monthly Archives: August 2013

3035 – OPIG Beer & Cycling

Other than my customary Saturday morning hangover, today I woke up with aching limbs – so I am feeling rather poor indeed.  The source of this pain, other than the five and a half pints of premium quality Lager, is the OPIG beer and cycling tour of Oxford (aptly named, “Le Tour de Farce”).  What can I say, over here they make you work for your beer.

View OPIG Oxford Beer and Cycling in a larger map

We started off at the Medawar building, where we sit on weekdays and occasionally on weekends. Then we cycled to The Fishes, The White Hart, The Trout (the Carne pizza is a must), yet another White Hart, and the vegetarian and vegan Gardener Arms (I know, I know – I was drunk and they forced me in here).  For those not familiar with Oxford custom – these are some of the most beautiful (and pricey) pubs this land has to offer.

OPIG members love a laugh and their beer on the cycling trip

OPIG members love a laugh and their beer on the cycling trip

The bike hike (devised by Charlotte) was 9.535 miles.  On 5.5 pints of beer (Nick spilt his pint, like an amateur – so I gave him half of mine), that means I run on 13.86 miles per gallon  (9.535 / 0.6875).  So I roughly have the fuel economy of a 2013 Ferrari 458 Spider.  Say what you want about these blog posts, but you cannot say I am not thorough about the research which goes on behind them.

9.5 miles in Malta

This is how far yesterday’s bike ride would have got me in my country (Malta).

This morning I have also noticed my calves are an inch thicker in diameter, which of course means I will go and throw away all my socks and make that long overdue trip to the bicester village.  Perhaps I will cycle there.

A very long introductory post about protein structure prediction

If you are a protein informatician, bioinformatician, biochemist, biologist or simply a person well informed about science, you probably heard about protein structure prediction. If that is the case, you might be wondering what all the fuss is about, right? If you never heard those terms before, don’t panic! You are about to find out what protein structure prediction is all about!

Based on my group meeting’s presentation last Wednesday, this blog entry will discuss why protein structure prediction is important and the potential limitations of existing methods. I will also discuss how the quality of input may be a potential source for lack of accuracy in existing software.

First, let us remember a little biology: our genetic code encrypts the inner-works of a complicated cellular machinery tightly regulated by other (macro)molecules such as proteins and RNAs. These two types of macromolecules are agents that perform the set of instructions codified by DNA. Basically, RNAs and proteins are involved in a series of processes that regulate cellular function and control how the genetic code is accessed and used.

For that reason, a huge chunk of genomic data can be pretty useless not that useful if considered on their own. Scientists around the globe have invested millions of moneys and a huge chunk of time in order to amass piles and piles of genome sequencing data. To be fair, this whole “gotta sequence ’em all” mania did not provide us with the fundamental answers everybody was hoping for. Cracking the genetic code was like watching an episode of Lost, in which we were left with more questions than answers. We got a very complicated map that we can’t really understand just yet.

For that reason, I feel obliged to justify myself: protein structures ARE useful. If we know a protein structure, we can formulate a very educated guess about that protein’s function. Combine that with empirical data (e.g. where and when the protein is expressed) and it can help us unveil a lot of info about the protein’s role in cellular processes. Basically, it can answer some of the questions about the (genomic) map. If only we could do that with Lost…

There is also evidence that knowing a protein’s structure can help us design specific drugs to target and inhibit that protein. Although the evidence of such biomedical application is sparse, I believe that with development of the field, there is a trend for protein structures to become more and more important in drug discovery protocols.

Still, if we look at the number of known genome sequences and known protein structures and at the growth of those figures over the past decade, we look at a drastic scenario:

Growth of Sequences vs Structures

There is a tendency for the gap between the number of protein sequences and protein structures to increase. Hence, we are getting more and more questions and little to no answers. Observe how the green line (the protein sequences associated with a known or predicted function) is very close to the red line (the number of known protein structures). However, there is a growing gap between the red and the blue line (the number of protein sequences). Source:

Well, gathering protein structure data is just as important, if not more important, than gathering sequence data. This motivated the creation of Structural Genomics Consortiums (SGC), facilities that specialize in solving protein structures.

I am sorry to tell you that this is all old news. We have known this for years. Nonetheless, the graph above hasn’t changed. Why? The cost limitations and the experimental difficulties associated with protein structure determination are holding us back. Solving protein structures in the lab is hard and time consuming and we are far from being as efficient at structure determination as we are at genome sequencing.

There is a possible solution to the problem: you start with a protein sequence (a sequential aminoacid list) and you try to predict its structure. This is known as protein structure prediction or protein structure modelling. Well, we have a limited number of building blocks (20) and a good understanding of their physicochemical properties, it shouldn’t be that hard right?

Unfortunately, modelling protein structure is not as simple as calculating how fast a block slides on an inclined plane. Predicting protein structure from sequence is a very hard problem indeed! It has troubled a plethora of minds throughout the past decades, making people lose many nights of sleep (I can vouch for that).

We can attribute that to two major limitations:

1- There are so many possible ways one can combine 20 “blocks” in a sequence of hundreds of aminoacids. Each aminoacid can also assume a limited range of conformations. We are looking at a massive combinatorial problem. The conformational space (the space of valid conformations a protein with a given sequence can assume) is so large that if you could check a single conformation every nanosecond, it would still take longer than the age of the universe to probe all possible conformations.

2- Our physics (and our statistics) are inaccurate. We perform so many approximations in order to make the calculations feasible with current computers that we end up with very inaccurate models.

Ok! So now you should know what protein structure prediction is, why it is important and, more importantly, why it is such a hard problem to solve. I am going to finish off by giving you a brief overview of the two most commons approaches to perform protein structure prediction: template-based modelling (also known as homology modelling) and de novo structure prediction.

There is a general understanding that if two proteins have very similar sequences (namely, if they are homologs), than they will have similar structures. So, we can use known structures of homologs as templates to predict other structures. This is known as homology modelling.

One can do a lot of fancy talk to justify why this works. There is the evolutionary argument: “selective pressure acts on the phenotype level (which can encompass a protein structure) rather than the genotype level. Hence protein structures tend to be more conserved than sequence. For that reason and considering that sequence alone is enough to determine structure, similar sequences will have even more similar structures.”

One can also formulate some sort of physics argument: “a similar aminoacid composition will lead to a similar behaviour of the interacting forces that keep the protein structure packed together. Furthermore, the energy minimum where a certain protein structure sits is so stable that it would take quite a lot of changes in the sequence to disturb that minimum energy conformation drastically.”

Probably the best argument in favour of homology modelling is that it works somewhat well. Of course, the accuracy of the models has a strong dependency on the sequence similarity, but for proteins with more than 40% identity, we can use this method in order to obtain good results.

This raises another issue: what if we can’t find a homolog with known structure? How can we model our templateless protein sequence then? Well, turns out that if we group proteins together into families based on their sequence similarity, more than half of the families would not have a member with known structure. [This data was obtained by looking at the representativeness of the Pfam (a protein family database) on the PDB (a protein structure database).]

Ergo, for a majority of cases we have to perform predictions from scratch (known as free modelling or de novo modelling).

Well, not necessarily from scratch. There is a specific approach to free modelling where we can build our models using existing knowledge. We can use chunks of protein, contiguous fragments extracted from known structures, to generate models. This is known as a fragment-based approach to de novo protein structure prediction. And that is one big name!

One can think of this as a small scale homology modelling, where both the physics and evolutionary arguments should still hold true to some degree. And how do we do? Can we generate good models? We perform appallingly! Accuracies are too low to generate any useful knowledge in a majority of cases. The problem with the rare cases when you get it right is that you have no means to know if you actually got the right answer.

The poor quality of the results can be justified by the 2 biggest limitations discussed above. Yet  something else might be in play. In homology modelling, if you use a bad template, you will most certainly get a bad model. In a similar way, using a bad set of fragments will lead you to a very poor final model.

Considering we already have the other two big issues (size of conformational space and accuracy of current potentials) to worry about, we should aim to use the best fragment library we possibly can. This has been the recent focus of my work. An attempt to make a small contribution to solve such a hard problem.

I would love to detail my work on finding better fragments here, but I believe this post is already far too long for anyone to actually endure it and read it until the end. So, congratulations if you made it through!

ISMB/ECCB Conference 2013 (Berlin)

It’s that time of the year again… when an intrepid group of OPIGlets trundle back tired but happy from another successful conference (this time it was ISMB/ECCB and its satellite conference 3Dsig in Berlin) armed with their favourite titbits from the presented work. This blog post is a mashup of some of our highlights as presented at the last group meeting.


Post-schnitzel and out and about in Berlin!

Definitely one of the best things for me was getting the chance to hear Sir Tom Blundell (our very own academic grandfather… Charlotte’s supervisor) give the keynote at 3Dsig, talking about everything from the structure of insulin to his deep, underlying love of jazz. Here are some more of our favourite things…

Empirical contact potentials derived from binding free energy changes upon mutation
(poster presentation by Iain H. Moal and Juan Fernández Racio)

Chosen by Jinwoo Leem

I was impressed by Moal (et al.)’s poster on predicting protein-protein binding affinities (in fact, it won the poster prize at 3D-Sig!). The poster describes a statistical potential that considers the number of mutations in a protein, and the type of interatomic contacts. Two variants of the potential were made; one for considering all atoms (atomic potential), and one considering residue side chains, represented as a centroid atom (residue potential). Ultimately, the energy change is represented as:


where N is the matrix of interatomic contacts between atoms i,j and P is a vector of contact types. Using weighted least-squares to minimise the residuals, r, the equation was used to predict affinity (ΔG) and affinity changes following mutations (ΔΔG).


As we can see in the top two graphs, the model shows decent performance for predicting ΔΔG of enzyme-inhibitor interactions, i.e. the model can indicate how a mutation affects binding affinities. Having said this, the ΔΔG predictions for Ab-Ag interactions were poor (Pearson’s correlation = 0.5-0.6).

Moreover, when the same potential was used to predict ΔG (bottom two graphs), the correlations were even worse. In fact, for flexible protein pairs, i.e. receptor-ligand pairs whose interface RMSD > 1.0Å, the correlation has gone to as low as 0.02.

Although the results are disappointing with respect to ΔG prediction, the model raises two interesting points. First, this is one of the few scoring functions that are specifically designed to predict affinity, rather than giving an arbitrary score for low RMSD. In addition, this model re-iterates the challenges in predicting Ab-Ag interactions. The solution for the latter point is not yet clear, but it may be of interest to re-train the model specifically with Ab-Ag complexes, and see if the model’s performance improves!

Predicting protein contact map using evolutionary and physical constraints by integer programming
(paper presentation by Zhiyong Wang and Jinbo Xu)

Chosen by Saulo de Oliveira

Last week, I decided to present a quick overview of a Paper Presentation I attended during the ISMB 2013.

The title of the presentation was “Predicting protein contact map using evolutionary and physical constraints by integer programming.” based on a paper by the same name.

Contact prediction (or evolutionary constraint prediction, a term I am much more fond of) was a trendy topic both at the 3DSig (2013) and at the ISMB (2013), with several presentations and posters on the subject.

In this particular presentation, Zhiyong Wang and Jinbo Xu described a new method to identify evolutionary constraints. The big differential of their talk and their work was approaching the problem in a different angle: their aim was to predict contacts when you have a low number of sequences in the multiple sequence alignment (refer to previous posts in the blog for an introduction to contact prediction).

They proposed a combination of machine learning and integer programming (similar to linear programming, again a topic we discussed previously here) to perform their predictions.

The features of the machine learning did not present any innovation. They were quite standard in the field such as mutation rates on PSIBLAST profiles and the Mutual Information (MI). The results of the Random Forest algorithm was employed to formulate constraints in a linear problem. These constraints were used to enforce physical properties of proteins, based mostly on our understanding of secondary structure.

Results seemed positive in both a random test set (CASP10) and 2 other test sets. By positive, I mean there was an improvement on the current state-of-the-art, especially for proteins with 10-1000 sequences in the MSA. Still, their precision was around 30, 40% for the top L/10 predictions (where L is the protein length). Further improvements are still necessary before we can apply these evolutionary constraints to improve protein structure prediction.

Evolution of drug resistance: structural models
(presentation by Maria Safi)

Chosen by Hannah Edwards

I found this talk by Maria Safi (which won the prize for best non-keynote presentation at 3Dsig) to be a really interesting method, despite my complete lack of background knowledge in the area (what are conferences for but to expose you to new ideas, right?).

Their idea was to produce a computationally viable method for identifying drug resistance in a protein’s mutant space. Drugs work by binding to their target protein in such a way as to inhibit its native function. If the protein mutates so as to maintain its native function but impair its binding to the drug it acquires resistance. The problem is, even within a reasonable distance of the native sequence, a proteins’ mutant space is huge, and it’s by no means trivial to test for maintenance of function and binding energy.

The groups’ solution was to recognise that the vast majority of mutant space would not be of interest. As such they send their candidate mutants through a 2-pass search: the first, a quick and easy algorithm to swiftly eliminate the dead end mutants… those that either are not resistant to drug binding or do not maintain their native function, and the second, a more biochemically accurate yet computationally expensive algorithm to be applied to the shortlist identified during the first pass.

The first algorithm is based on restricted dead-end elimination which aims to minimise a simple energy potential based on the protein’s structural stability and it’s binding energy to the drug. The algorithm keeps the backbone structure constant but by differing the side chain conformations, the mutants result in different energy potentials. A mutation at residue r can then be eliminated if an alternative mutation at r will always result in a lower energy potential.

The second algorithm is based on the more sophisticated methodology of MM-PBSA, combining molecular mechanics with the Poisson-Boltzman Surface Area calculations to estimate the free energy of the compound. This final run identifies the candidate mutants.

A significant strength of their method is that it requires only the crystal structures of the drug and target protein. As a purely structural model it eliminates the need for large amounts of training data, which, for newly emerging diseases and drugs, is often impossible to have access to.

The main focus of Maria’s talk however was using these energy potentials to predict evolutionary pathways from a wild-type protein to a resistant candidate. By treating evolution as a random walk through mutant space, weighted by the energy potentials, and assuming selection pressure of resistance, they were able to computationally simulate evolutionary scenarios.

For example, Maria focussed on the ritonavir-HIV protease complex to illustrate this method. The majority of the mutants with resistance to ritonavir which have been observed in nature were predicted by the method. For the candidates that were predicted but have not been seen, further elucidation could be found from the simulated evolutionary pathways: around 60% of these candidates were not accessible under the evolutionary model.

Sequence comes to the Structural Rescue: Identifying Relevant Protein Interfaces in Crystal Structures
(presentation by Jose M. Duarte)

Chosen by Henry Wilman

Jose Duarte presented a tool, EPPIC, which identifies and classifies protein interfaces from pdb structures. The talk was titled ‘Sequence comes to the Structural Rescue: Identifying Relevant Protein Interfaces in Crystal Structures’, and follows from their publication Protein interface classification by evolutionary analysis, Duarte JM, Srebniak A, Schärer MA, Capitani G. BMC Bioinformatics. 2012 Dec 22..

As the title suggests, this uses both structural and sequence information to classify protein contacts as biological or crystal. There is a webserver, and a downloadable version. There are a number of methods that exist to classify interfaces, and this differs in a few ways.

The previous methods typically rely on the area of the interface. As you see in the paper, even the benchmark sets used to test the other methods are biased such that biological contacts have much greater areas than crystal contacts. When the authors constructed a set where the contact area was similar, they found the previous methods performed generally poorly. However, there are a number of ways that you can define the interface or contact area, and specifically what people call ‘core residues’ of the interface. They found one study performed much better on their training set than the others. This defined core residues as ones that lost the majority of their solvent accessible surface area on binding to the interface. A simple cut off of >= 6 core residues at an interface produced a good classification.

In addition to this, they used sequence information. We know that interfaces are important, and often mutations at interface residues are bad. So, for a biological interface, we would expect residues to be better conserved than non-interacting surface residues. The authors used sequence entropy as a measure of the conservation. They calculated this by collecting homologous sequences with PSI-Blast and aligned them using Clustal-Omega. For each position in the alignment, if x is the occupancy frequency for a given amino acid, the sequence entropy is given by the sum over all amino acids of xlog(x). (They actually use a reduced alphabet for this, to avoid penalising mutations to similar amino acids). They then compare the entropy of the ‘core’ residues in the interface to those on the surface of the protein, and those on the periphery of the interface. If the core residues have lower entropy, then the contact is classed as biological. There are simple thresholds for both of these comparisons.

They have three metrics – one structural (number of core residues), and two sequence (entropy of core residues vs. peripheral residues, and entropy of core residues vs. surface residues). They classify based on a majority vote of the three methods. If there are an insufficient number of homologous sequences (i.e. fewer than 8), then they ignore the sequence scores, and classify using the structure only.

So why do we care about protein interfaces? Lots of us work with experimental protein structures. Many of these come from X-ray crystallography experiments. This means that when the structural information is captured, the protein is not isolated – instead it is packed against many other copies of itself. A bit like a brick in a wall – a wall many rows of bricks deep. So our protein is in contact with many others. Some of these contacts occur within the natural environment of the cell, others are a result of the crystal packing structure.
Now, protein interfaces are important. ‘Why?’, I hear you ask. Interfaces are how proteins interact with each other, and with other molecules. Some interfaces are large, some small, some are involved in transient interactions, others in permanent ones. Many diseases occur due to amino acid mutations at these interfaces. These change how the protein functions, which can cause bad things to happen. Similarly, if we can learn how to interact with an interface, we are (theoretically) able to alter the function with some sort of drug, and cause good things (whilst avoiding bad things).

So, this raises a few questions for people who study protein crystal structures. Things like, which bits of my protein interact? Do these interactions happen in a cell? Is my structure somehow distorted by the crystal packing? This EPPIC tool gives us one way to answer these.


Congratulations on reaching the end of this blog post… as a reward, the epic Brandenburg gate (taken by Jin)