Tag Archives: OPIG

Is “fragment-based” still the way forward in template-free protein structure prediction?

Out of the many questions surrounding the notion that you can predict a protein’s structure from its sequence, there is one in particular that I decided to tackle during last group meeting.

Protein structure prediction is a hard problem (do I sound repetitive?). One of the many cop outs employed by the structure prediction community is the idea that you can break down known structures into fragments and use these protein pieces to perform predictions. This is known as fragment-assembly or fragment-based template-free protein structure prediction.

As absurd as the idea may seem, there is robust evidence that suggests that this is actually a viable strategy. There is a notion that the fragment space is complete; you can reconstruct the backbone of any known structure based on the torsion angles of fragments from other structures. In less technical jargon, you can effectively use fragments and combine them to re-create any of the protein structures that we know and to a fairly acceptable level of precision.

So, technically, it is possible to predict a protein structure using fragments from other structures. In practice, you are still left with the problem of choosing the right fragments to model your sequence of interest. How easy do you think that is?

We can look at this question in light of observations that were made back in the early 80s. Kabsch and Sander reported that two protein fragments having exactly the same sequence can present completely different structures [1]. This complies with the notion that global properties can affect and even define local structure, which in turn suggests that selecting the right fragments to assemble a structure is not necessarily a straightforward process.

The starting point for protein structure prediction is a sequence. Since we are talking about template-free protein structure prediction, it is safe to assume that there is no good global sequence match to your target with a known structure (otherwise you would use that match/structure as a template). Hence, fragment selection is restricted to local sequence similarity, which, as suggested in the previous paragraph, is not necessarily ideal.

On the other hand, we are becoming increasingly more accurate in inferring one-dimensional properties from a protein’s sequence. These properties can and often are used to enhance our fragment-selection capabilities. Yet, even using the state-of-the-art in secondary structure and torsion angle prediction, fragment selection is still fairly imprecise.

During group meeting I highlighted a possible contrast between practical fragment space and general (or possible) fragment space. My premise is simple.  I define practical fragment space as the fragments that we can accurately select from the possible fragment space to model protein structures. In my opinion, it would be extremely interesting to quantify the difference between the two. This would answer the fundamental question of how useful fragment-assembly actually is. More importantly, it would help the community make an educated decision in regards to whether template-free structure prediction strategies should shift from fragment-based to ones based on distance constraints, an approach that is gaining popularity due to the success of contact predictions.

I am very keen to investigate this further. Maybe for my next blog post, we will have an answer! Stay tuned.

[1] Kabsch, Wolfgang, and Christian Sander. “On the use of sequence homologies to predict protein structure: identical pentapeptides can have completely different conformations.” Proceedings of the National Academy of Sciences  81.4 (1984): 1075­1078.

Slow and steady improvements in the prediction of one-dimensional protein features

What do you do when you have a big, complex problem whose solution is not necessarily trivial? You break the problem into smaller, easier to solve parts,  solve each of these sub-problems and merge the results to find the solution of the original, bigger problem. This is an algorithm design paradigm known as the divide and conquer approach.

In protein informatics, we use divide and conquer strategies to deal with a plethora of large and complicated problems. From protein structure prediction to protein-protein interaction networks, we have a wide range of sub and sub-sub problems whose solutions are supposed to help us with the bigger picture.

In particular, prediction of the so called one-dimensional protein features are fundamental sub-problems with a wide range of applications such as protein structure modelling,  homology detection, functional characterization and others. Here, one-dimensional protein features refer to secondary structure, backbone dihedral and C-alpha angles, and solvent accessible surface area.

In this week’s group meeting, I discussed the latest advancements in prediction of one-dimensional features as described in an article published by Heffernan R. and colleagues in Scientific Reports (2015):

“Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning.”

In this article, the authors describe the implementation of SPIDER2, a deep learning approach to predict secondary structure, solvent accessible surface area, and four backbone angles (the traditional dihedrals phi and psi, and the recently explored theta and tau).

“Deep learning” is the buzzword (buzz-two-words or buzzsentence, maybe?) of the moment. For those of you who have no idea what I am talking about, deep learning is an umbrella term for a series of convoluted machine learning methods. The term deep comes from the multiple hidden layers of neurons used during learning.

Deep learning is a very fashionable term for a reason. These methods have been shown to produce state-of-the-art results for a wide range of applications in several fields, including bioinformatics. As a matter of fact, one of the leading methods for contact prediction (previously introduced in this blog post), uses a deep learning approach to improve the precision of predicted protein contacts.

Machine learning has already been explored to predict one-dimensional protein features, showing promising (and more importantly, useful) results. With the emergence of new, more powerful machine learning techniques such as deep learning, previous software are now becoming obsolete.

Based on this premise, Heffernan R. and colleagues implemented and applied their deep learning approach to improve the prediction of one-dimensional protein features. Their training process was rigorous: they performed a 10-fold cross validation using their training set of ~4500 proteins and, on top of that, they also had two independent test sets (a ~1200 protein test set and a set based on the targets of CASP11).  Proteins in all sets did not share more than 25% (30% sequence identity for the CASP set) to any other protein in any of the sets.

The method described in the paper, SPIDER2, was thoroughly compared with state-of-the art prediction software for each of the one-dimensional protein features that it  is capable of predicting. Results show that SPIDER2 achieves a small, yet significant improvement compared to other methods.

It is just like they say, slow and steady wins the race, right? In this case, I am not so sure. It would be interesting to see how much the small increments in precision obtained by SPIDER2 can improve the bigger picture, whichever your bigger picture is. The thing about divide and conquer is that if you become marginally better at solving one of the parts, that doesn’t necessarily imply that you will improve the solution of the bigger, main problem.

If we think about it, during the “conquer” stage (that is, when you are merging the solution of the smaller parts to get to the bigger picture),  you may make compromises that completely disregard any minor improvements for the sub-problems. For instance, in my bigger picture, de novo protein structure prediction, predicted local properties can be sacrificed to ensure a more globally consistent model. More than that, most methods that perform de novo structure prediction already account for a certain degree of error or uncertainty for, say, secondary structure prediction. This is particularly important for the border regions between secondary structure elements (i.e. where an alpha-helix ends and a loop begins). Therefore, even if you improve the precision of your predictions for those border regions, the best approach for structure prediction may still consider those slightly more precise border predictions as unreliable.

The other moral of this story is far more pessimistic. If you think about it, there were significant advancements in machine learning, which led to the creation of ever-more-so complicated neural network architectures. However, when we look back to how much improvement we observed when these highly elaborate techniques were applied to an old problem (prediction of one-dimensional protein features), it seems that the pay-off wasn’t as significant (at least as I would expect). Maybe, I am a glass half-empty kind of guy, but given the buzz surrounding deep learning, I think minor improvements is a bit of a let down. Not to take any credit away from the authors. Their work was rigorous and scientifically very sound. It is just that maybe we are reaching our limits when it comes to applying machine learning to predict secondary structure. Maybe when the next generation of buzzword-worthy machine learning techniques appear, we will observe an even smaller improvement to secondary structure prediction. Which leaves a very bitter unanswered question in all our minds: if machine learning is not the answer, what is?

Predicted protein contacts: is it the solution to (de novo) protein structure prediction?

So what is this buzz I hear about predicted protein contacts? Is it really the long awaited solution for one of the biggest open problems in biology today? Has protein structure prediction been solved?

Well, first things first. Let me give you a quick introduction to this predicted protein contact business (probably not quick enough for an elevator pitch, but hopefully you are not reading this in an elevator).

Nowadays, the scientific community has become very good at sequencing things (and by things I mean genetic things, like whole genomes of a bunch of different people and organisms). We are so good at it that mountains of sequence data are now available: genes, mRNAs, protein sequences. The question is what do we do with all this data?

Good scientists are coming up with new and creative ideas to extract knowledge from these mountains of data. For instance, one can build multiple sequence alignments using protein sequences for a given protein family. One of the ways in which information can be extracted from these multiple sequence alignments is by identifying extremely conserved columns (think of the alignment as a big matrix). Residues in these conserved positions are good candidates for being functionally important for the proteins in that particular family.

Another interesting thing that can be done is to look for pairs of residues that are mutating in a correlated fashion. In more practical terms, you are ascertaining how correlated is the information between two columns of a multiple sequence alignment; how often a change in one of them is countered by a change in the other. Why would anyone care about that? Simple. There is an assumption that residues that mutate in a correlated fashion are co-evolving. In other words, they share some sort of functional dependence (i.e. spatial proximity) that is under selective pressure.

Ok, that was a lot of hypotheticals, does it work? For many years, it didn’t. There were lots of issues with the way these correlations were computed and one of the biggest problems was to identify (and correct for) transitivity. Transitivity is the idea that you observe a false correlation between residues A and C because residues A,B and residues B,C are mutating in a correlated fashion. AS more powerful statistical methods were developed (borrowing some ideas from mechanical statistics), the transitivity issue has seemingly been solved.

The newest methods that detect co-evolving residues in a multiple sequence alignment are capable of detecting protein contacts with high precision. In this context, a contact is defined as two residues that are close together in a protein structure. How close?  Their C-betas must be 8 Angstroms or less apart. When sufficient sequence information is available (at least 500 sequences in the MSA), the average precision of the predicted contacts can reach 80%.

This is a powerful way of converting sequence information into distance constraints, which can be used for protein structure modelling. If a sufficient number of correct distance constraints is used, we can accurately predict the topology of a protein [1]. Recently, we have also observed great advances in the way that models are refined (that is, refining a model that contains the correct topology to atomic, near-experimental resolution). If you put those two things together, we start to look at a very nice picture.

So what’s the catch? The catch was there. Very subtle. “When sufficient sequence information is available”. Currently, there is an estimate that only 15% of the de novo protein structure prediction cases present sufficient sequence information for the prediction of protein contacts. One potential solution would be to sit and wait for more and more sequences to be obtained. Yet a potential pitfall of sitting and waiting is that there is no guarantee that we will have sufficient sequence information for a large number of protein families, as they may as well present less than 500 members.

Furthermore, scientists are not very good at sitting around and waiting. They need to keep themselves busy. There are many things that the community as whole can invest time on while we wait for more sequences to be generated. For instance, we want to be sure that, for the cases where there is a sufficient number of sequences, that we get the modelling step right (and predict the accurate protein topology). Predicted contacts also show potential as a tool for quality assessment and may prove to be a nice way of ascertaining whether you have confidence that a model with correct topology was created. More than that, model refinement still needs to improve if we want to make sure that we get from the correct topology to near-experimental resolution.

Protein structure prediction is a hard problem and with so much room for improvement, we still have a long way to go. Yet, this predicted contact business is a huge step in the right direction. Maybe, it won’t be long before models generated ab initio are considered as reliable as the ones generated using a template. Who knows what promised the future holds.

References:

[1] Kim DE, Dimaio F, Yu-Ruei Wang R, Song Y, Baker D. One contact for every twelve residues allows robust and accurate topology-level protein structure modeling. Proteins. 2014 Feb;82 Suppl 2:208-18. doi: 10.1002/prot.24374. Epub 2013 Sep 10.

 

 

 

Hypotheses and Perspectives onto de novo protein structure prediction

Before I start with my musings about my work and the topic of my D. Phil thesis, I would like to direct you to a couple of previous entries here on BLOPIG. If you are completely new to the field of protein structure prediction or if you just need to refresh your brain a bit, here are two interesting pieces that may give you a bit of context:

A very long introductory post about protein structure prediction

and

de novo Protein Structure Prediction software: an elegant “monkey with a typewriter”

Brilliant! Now, we are ready to start.

In this OPIG group meeting, I presented some results that were obtained during my long quest to predict protein structures.

Of course, no good science can happen without the postulation of question-driving hypotheses. This is where I will start my scientific rant: the underlying hypotheses that inspired me to inquire, investigate, explore, analyse, and repeat. A process all so familiar to many.

As previously discussed (you did read the previous posts as suggested, didn’t you?), de novo protein structure prediction is a very hard problem. Computational approaches often struggle to search the humongous conformational space efficiently. Who can blame them? The number of possible protein conformations is so astronomically large that it would take MUCH longer than the age of the universe to look at every single possible protein conformation.

If we go back to biology, protein molecules are constantly undergoing folding. More so, they manage to do so efficiently and accurately. How is that possible? And can we use that information to improve our computational methods?

The initial hypothesis we formulated in the course of my degree was the following:

“We [the scientific community] can benefit from better understanding the context under which protein molecules are folding in vivo. We can use biology as a source of inspiration to improve existing methods that perform structure prediction.”

Hence came the idea to look at biology and search for inspiration. [Side note: It is my personal belief that there should be a back and forth process, a communication, between computational methods and biology. Biology can inspire computational methods, which in turn can shed light on biological hypotheses that are hard to validate experimentally]

To direct the search for biological inspiration, it was paramount to understand the limitations of current prediction methods. I have narrowed down the limitations of de novo protein structure prediction approaches to three major issues:

1- The heuristics that rely on sampling the conformational space using fragments extracted from know structures will fail when those fragments do not encompass or correctly describe the right answer.

2- Even when the conformational space is reduced, say, to fragment space, the combinatorial problem persists. The energy landscape is rugged and unrepresentative of the actual in vivo landscape. Heuristics are not sampling the conformational space efficiently.

3- Following from the previous point, the reason why the energy landscape is unrepresentative of the in vivo landscape is due to the inaccuracy of the knowledge-based potentials used in de novo structure prediction.

Obviously, there are other relevant issues with de novo structure prediction. Nonetheless, I only have a limited amount of time for my D.Phil and those are the limitations I decided to focus on.

To counter each of these offsets, we have looked for inspiration in biology.

Our understanding from looking at different protein structures is that several conformational constraints are imposed by alpha-helices and beta-strands. That is a consequence of hydrogen bond formation within secondary structure elements. Unsurprisingly, when looking for fragments that represent the correct structure of a protein, it is much easier to identify good fragments for alpha-helical or beta-strand regions. Loop regions, on the other hand, are much harder to be described correctly by fragments extracted from known structures. We have incorporated this important information into a fragment library generation software in an attempt to address limitation number 1.

We have investigated the applicability of a biological hypothesis, cotranslational protein folding, into a structure prediction context. Cotranslational protein folding is the notion that some proteins begin their folding process as they are being synthesised. We further hypothesise that cotranslational protein folding restricts the conformational space, promoting the formation of energetically-favourable intermediates, thus steering the folding path towards the right conformation. This hypothesis has been tested in order to improve the efficiency of the heuristics used to search the conformational space.

Finally, following the current trend in protein structure prediction, we used evolutionary information to improve our knowledge-based potentials. Many methods now consider correlated mutations to improve their predictions, namely the idea that residues that mutate in a correlated fashion present spatial proximity in a protein structure. Multiple sequence alignments and elegant statistical techniques can be used to identify these correlated mutations. There is a substantial amount of evidence that this correlated evolution can significantly improve the output of structure prediction, leading us one step closer to solving the protein structure prediction problem. Incorporating this evolution-based information into our routine assisted us in addressing the lack of precision of existing energy potentials.

Well, does it work? Surprisingly or not, in some cases it does! We have participated in a blind competition: the Critical Assessment for protein Structure Prediction (CASP). This event is rather unique and it brings together the whole structure prediction community. It also enables the community to gauge at how good we are at predicting protein structures. Working with completely blind predictions, we were able to produce one correct answer, which is a good thing (I guess).

All of this comes together nicely in our biologically inspired pipeline to predict protein structures. I like to think of our computational pipeline as a microscope. We can use it to prod and look at biology. We can tinker with hypotheses, implement potentials and test them, see what is useful for us and what isn’t. It may not be exactly what get the papers published, but the investigative character of our structure prediction pipeline is definitely the favourite aspect of my work. It is the aspect that makes me feel like a scientist.

Protein Structure Prediction, my own metaphorical microscope…

 

Journal Club: Native contacts in protein folding

Like your good old headphone cables, strings of amino acids have the potential to fold into a vast number of different conformations given the appropriate conditions. A conservative estimation for the time it would take a 100 residue protein to explore all theoretically possible conformations would exceed the age of the Universe several times. This is obviously not feasible and was pointed out by Levinthal when he published his “How To Fold Graciously” in 1969.

The so called Protein-Folding Problem has since been under intense study, which inevitably has led to a few theories and models about its nature. Due to the lack of appropriate wet-lab methods to study this phenomenon theoretical, computational approaches have been key to devising impactful frameworks for formally describing protein folding. One of these goes under the name of principle of minimum frustration introduced by Bryngelson and Wolynes in the late 80s (1). It states that proteins by evolution were enriched for sequences with the propensity to fold into low-energy structures, while actively selecting against traps. By avoiding mis-folding and non-native contacts, the theory says, a smooth funnel-like energy landscape with native-state minima is created that ensures robust and fast folding.

This implies that native contacts, i.e. residues that interact in the fully folded protein play a major role in the folding process. Gō models (2), named after Nobuhiro Gō who first proposed this method, are based around this assumption with the energetic contributions of native interactions acting as the sole driving forces in the folding process. While this approach has yielded promising results, many of which were in concordance with experiments, its underlying principles have never been validated in a statistically meaningful way.

native contact schematic

A schematic for native-contact-driven protein folding

In 2013 a study by Best, Hummer and Eaton (3) formally addressed this question. By devising a set of statistical quantities aimed at weighting the importance of native and non-native interactions for folding and applying these to the analysis of several long MD folding simulations they were able to show a “native-centric mechanism” for small fast-folding proteins.

In a first step it was assessed whether the fraction of native contacts  provided a suitable reaction coordinate for the simulated folding events. From their equilibrium simulations two thresholds of native-contact-fractions  were chosen that defined folded and unfolded states (a two-state model is assumed). Overlaying the values for the most visited native-contact-fractions during simulation against these thresholds revealed a strong correlation between the two equilibrium probability density maxima and the protein’s fold state. In addition they showed that the range of native-contact-fractions between those found to represent unfolded and folded thresholds were indicative of being on a transition path (defined as the  “.. regions of the trajectories that cross directly from the unfolded well to the folded well ..”).

A further measure was introduced with the contact lifetime test. The log-ratio of the time a contact spent on a transition path vs the time it existed in the unfolded state was calculated and compared in a heat-map to the native contact map coloured by the number of contacts between residues.

figure2

Contact life time test for a selected protein.
Adapted from (3).

Among others this result revealed a clear connection between contacts with longer transition path life times and the number of contacts they made in the native structure.

So what about non-native interactions?

Screenshot from 2014-03-27 12:47:04

One of the measures addressing this question was the Bayesian measure for non-native contacts on transition paths. In the examples used in this paper, no obvious link between being on a transition path given a non-native contact was found unless they were close to native contacts. Further criteria such as the complementary quantity, which is the probability of being on a transition path when a contact is not made, concluded in a similar fashion.

Interestingly, it was found that the one protein that was influenced by non-native contacts was the designed α3D. Best et al. reasoned that additional frustration introduced when building a protein with artificially introduced stability has led to a shifting of helix register giving rise to this outlier.

When taken together, these results lay a robust foundation for further studies along the same lines. It is too early to accept or reject the presented findings as universal truth, but strong arguments for the native-centric mechanism being a reasonable model in small fast-folding proteins have been made. It would not be far-fetched to think that larger proteins would adhere to similar principles with non-native contacts modulating the landscape, especially when considering individual downhill folding modules.

References:

(1) Bryngelson, J.D. et al., 1995. Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins, 21(3), pp.167–95.

(2) Taketomi, H., Ueda, Y. & Gō, N., 1975. Studies on protein folding, unfolding and fluctuations by computer simulation. I. The effect of specific amino acid sequence represented by specific inter-unit interactions. International journal of peptide and protein research, 7(6), pp.445–59.

(3) Best, R.B., Hummer, G. & Eaton, W.A., 2013. Native contacts determine protein folding mechanisms in atomistic simulations. Proceedings of the National Academy of Sciences of the United States of America, 110(44), pp.17874–9.

Kinetic Modelling of Co-translational Protein Folding (Journal Club)

Following up on last week’s entry, this post will explore the same topic: polypeptide chains assuming native-like conformations as they are extruded from the ribosome, or for the less intimate with the concept, co-translational protein folding.

Before addressing some important questions concerning co-translational protein folding, I would like to make a parenthesis: I want to dedicate a paragraph or two to talk about time.

Biological processes are dynamic. They are events that occur over a period of time. For instance, one can quantify the effect of mutations propagated and accumulated over millions of years of evolution. One can also quantify the femtoseconds in which subtle conformational changes occur in photoreceptor proteins like rhodopsin, when they respond to light. Time is fundamental to understand and model any sort of biological event.

Albeit it might seem obvious to the reader that time is so crucial to amass biological knowledge, those of us more theoretically inclined (bioinformaticians, computational biologists, biostatisticians,  mathematical biologists and so on and so forth) are usually  presented with models that tend to over-simplify reality. Surprisingly enough, there are many over-simplistic models that neglect the effect of time in order to “better” represent whatever they claim to model. Take Protein Docking for instance. The biological process at hand presents a complicated dynamic. There is a kinetic equilibrium, in which a vast amount of protein and ligand molecules interact, associating into complexes and dissociating. Nonetheless, Protein Docking is traditionally reduced to the binding affinity between a pair of molecules. As one might say, this is only a problem if I can present a solution… Luckily, Protein Docking is not my subject of expertise, so I will leave this question open to more tenacious minds than my own.

One of the areas in which I am truly interested in is the co-translational aspect of protein folding. If one performs a quick Google Images search, using the terms “Protein Synthesis” or “Protein Translation”, the results tell a very interesting story.  The vast majority of nascent protein chains are represented as fully elongates peptide chains. In a majority of pictures, the growing peptides do not even present secondary structure. They are mostly represented by long, unfolded, almost linear polymers.

Now, any first year Biochemistry student learns about something called Hydrophobicity (or hydrophilicity depending on whether you are a glass half empty or half full type of person). It is biochemistry-introductory-text-book stuff that some residues are polar and some residues are apolar, and hence will hide from water, forming a hydrophobic core. That (hydrophobicity) is one of the main driving forces of  protein folding.

Hence, most of the images that appear in our Google Images search are not very representative. They are plain wrong. It is simple physics that the growing peptide chains will form secondary and tertiary structures during the process of protein synthesis. One has to remember that this process is dynamic, it is happening over time. Under these circumstances, time should not be neglected. The time scale at which extrusion occurs is slow enough to allow the nascent chain to probe conformations and simply abide to the laws of physics. A fully elongated, completely unfolded and denatured peptide chain would not exist during protein synthesis. These nascent chains would adopt intermediate conformations simply as a result of apolar residues trying to hide from water.

Ok. Now, the BIG question that can be raised is whether those intermediate conformations actually resemble the native state of the fully elongated protein. I do not want to incur in Baby Kicking, but one thing that evolution has taught us is that cells have evolved to be highly efficient systems. There is no room for wasted energy. It makes sense to hypothesize that over millions of years, the cellular machinery has adapted to explore these intermediate conformations in order to make the process of protein folding more efficient.

Over the past couple of years, substantial evidence has been amassed that codon usage and the degeneracy of the genetic code could be exploited by cells to ensure that protein folding occurs accurately and efficiently. There are many theoretical ways that such exploitation could occur: the codon translation speed could facilitate the formation of certain intermediates that are beneficial for protein folding, that increase stability or that prevent protein aggregation. There is even a biomedical impact given that some observed pathologies have been associated with synonymous codon mutations that may lead to misfolded proteins.

In the paper I presented during this journal club [1], O’Brien and colleagues have devised and described a very interesting kinetic model for protein translation. Their model was used to describe possible scenarios in which both fast and slow translation speed codons are coordinators of co-translational protein folding. Please note that, in this context, co-translational protein folding is perceived as an enrichment of intermediate conformations of  the nascent chains, which resemble the native structure of the fully elongated protein.

In the model described in the paper, they opted for a probabilistic approach instead of an analytical (differential equations) approach. The time is modelled by the use of probabilities. The authors derived a formula to quantify the expected proportion of nascent chains of a given length that would be in a Folded intermediate state (one that resembles the native structure). They have managed to express this in terms of a rate of codon translation. Therefore, they stablish a direct relationship between Co-Translational protein folding and codon translation speed.

Their analysis is robust as none of the constants and kinetic rates need to be experimentally derived in order to provide insights about the protein folding process. Overall, I think the way the model was built was quite ingenious and very interesting. I would suggest any interested reader to read the article if they want to understand how the whole modelling was carried out.

Overall, I think the authors present a compelling argument for how cells could explore codon degeneracy and co-translational aspects of protein folding to improve folding efficiency. One of their results present a scenario in which fast translation speed codons can be used to assist in the fold of unstable protein regions, preventing the formation of misfolded intermediates.

One of the many functions of mathematical models is to provide insights into the underlying biology of the phenomena they attempt to model. The lack of any experimental evidence to support this paper’s results does not make it any less interesting. The article presents to the readers a sound and solid mathematical argument as to how co-translational aspects of protein folding could be beneficial for cell efficiency. If anything, they provide interesting hypotheses that might drive experimentalists in the future.

[1] Kinetic modelling indicates that fast-translating codons can coordinate cotranslational protein folding by avoiding misfolded intermediates.

Journal Club: Can Linear Progamming (LP) be useful to us?

Linear programming (LP) is known as a fast and powerful computational technique. It has been applied to a large range of problems in finances and economics, but it is not very popular among us bioinformaticians, computational biologists, and the likes.

Source: http://hotmath.com/hotmath_help/topics/linear-programming.html

Linear Programming is all about find feasible solutions that satisfy a series of constraints (usually represented by inequalities). Does it sound like a familiar problem to bioinformaticians and computational biologists out there?

Source: http://hotmath.com/hotmath_help/topics/linear-programming.html

This leaves room for some questioning: can biological phenomena be modelled or simplified under the assumption of linearity? Furthermore, can LP be used to tackle the many difficult problems posed in our field? Perhaps an even bigger question: why would any of us use Linear Programming instead of another type of linear modelling? What are the advantages of it?

I will not incur in explaining the particulars of LP here. There is a plethora of materials available online (Wikipedia and Wolfram are accessible starting points) that detail Linear Programming. For those eager for something more substantial, V. Chvatal’s Linear Programming and Dantzig’s Linear Programming and Extensions are two good texts on the subject.

During this week’s journal club, I discussed an article that attempted to use Linear Programming to devise knowledge-based Docking Potentials (DP) tailored for transient protein-protein complexes. Transient complexes tend to be under-represented on the PDB, mostly due to the inherent difficulties of crystallizing such complexes. Hence, the development of knowledge-based potentials for these special cases of protein interaction is drastically hindered by a sample size limitation.

Source: Bizzarri AR, Brunori E, Bonanni B, Cannistraro S. Docking and molecular dynamics simulation of the Azurin–Cytochrome c551 electron transfer complex. J. Mol. Recognit. 2007; 20: 122–131

A cartoon representation of a transient complex between Azurin (cyan) and its partner Cytochrome C551 (dark blue) from Pseudomonas aeruginosa. Transient protein complexes are hard to crystallize, hence, under-represented on the PDB.

Source: Bizzarri AR, Brunori E, Bonanni B, Cannistraro S. Docking and molecular dynamics simulation of the Azurin–Cytochrome c551 electron transfer complex. J. Mol. Recognit. 2007; 20: 122–131

To offset such limitation, it would be ideal if one could extrapolate information from decoys (non-native conformations obtained from computational docking tools) in order to improve the Docking potentials. Furthermore, in an ideal world, one would also address the bias introduced by homology/sequence similarity between the existing proteins in the available structures of transient complexes.

The author of the article “Designing coarse grained-and atom based-potentials for protein-protein docking – Tobi D. – BMC Structural Biology 2010, 10:40 doi:10.1186/1472-6807-10-40 ” claims that LP can address such issues by incorporating information from the decoys as linear constraints to the model. The article describes a linear problem, in which the aim is to minimize the variance of how much the non-native energy potentials differ from the native ones. Also, they impose the constraints that native structures must have a lower energy than all of the non-native structures for a given complex (lower in this case is good).

The energy is defined as a weighted sum of the counts of specific interaction types on the complex interface. In their work, the author employed two models: an atom-based model and a side chain-based model. These models are used to classify atoms into groups and to simplify calculations. Initially, they define boolean (one-step) interactions: two atoms interact if they are within a cutoff distance of each other. This cutoff varies according to the type of atoms involved. The initial model led to a state of infeasibility, and it was then replaced by a two-step model, where you have strong and weak interactions and two sets of cutoff (this leads to twice as many unknowns in the LP model).

Well, does it work? How does it fair against other existing knowledge-based DPs?

Source: Designing coarse grained-and atom based-potentials for protein-protein docking. - Tobi D. - BMC Structural Biology 2010, 10:40 doi:10.1186/1472-6807-10-40Source: Designing coarse grained-and atom based-potentials for protein-protein docking. – Tobi D. – BMC Structural Biology 2010, 10:40 doi:10.1186/1472-6807-10-40

Despite the lack of brilliant results or any apparent improvement compared to the state-of-art, the potentials described in the article seem to slightly outperform ZDOCK2.3’s scoring functions.

This may actually speak in favour of the applicability of LP to problems in our area. In the case presented during the journal club, an LP approach produced comparable results to more conventional techniques.

Perhaps the best answer to “why should I use LP?” is that it is an unconventional, creative solution. It is significantly fast and, therefore, easy to try out depending on your problem. Science is all about experimentation, after all. Why would you not try a different technique if you have the means to?

Image Source: http://www.designthenewbusiness.com/blog/documenting/thinking-inside-the-box.html

The moral of the story: it is good to think outside the box, as long as you keep your feet on the ground.

Image Source: http://www.designthenewbusiness.com/blog/documenting/thinking-inside-the-box.html

Check the article discussed in the post here.