Author Archives: Saulo Pires de Oliveira

OPIG Algorithm Club Problem #1: The 3n+1 problem

In the first meeting of the OPIG Algotirhm Club, we tackled the problem 3n+1 from the Sphere Online Judge (SPOJ).

 —Background —

The problem relates to the Collatz conjecture. Before describing the conjecture, let us define a very simple recursive relation:

  • If a number is odd, multiply it by 3 and add 1 (hence 3n+1).
  • If a number is even, divide it by 2.

The Collatz conjecture states that for any given integer, you can repeat this recursion indefinetely and you will always reach the number 1.

The 3n+1 problem requires yet another concept which is the cycle length (i.e. the number of times you have to repeat the recursion for a given number until you reach 1).

— Goal —

The aim of the problem is: given 2 integers and j, find the number with the highest cycle length that lies in the interval comprised between the 2 integers (them included).

— Important Issues —

Two things are worth mentioning before attempting to implement this problem:

  •  could be either greater than or lesser than j.
  • Despite the fact that the description of the problem states that all operations can be performed using int, there are certain inputs between 100,000 and 1,000,000 that violate that statement. 

— Implementation —

A simple solution can be implemented by using a recursive function to compute the cycle length:

int Cycle_Length (unsigned int n) /* Note the unsigned precision */
{
    	unsigned int aux;

        /* Stopping Condition - Don't want the recursion to run for forever */
        if (n == 1) return 1;

        /* Here is the recursion, if n is odd  else     if n is even   */
	    aux = (n%2) ? (1 + Cycle_Length(3*n+1) ) : (1+Cycle_Length(n>>1));
        /* Note that division by two is performed by shifitng a bit to the right */
	    return aux;
}

Now, there are two ways we can optimise this function and make it more efficient.

The first one relates to a certain property of odd numbers: we know that if n is odd, than 3*n+1 is going to be even. Therefore we can simply skip one iteration of the cycle:

aux = (n%2) ? (2+Cycle_Length((3*n+1)>>1)) : (1+Cycle_Length(n>>1)) ;

We can further optimise our code by using Dynamic Programming. We can store the Cycle Lengths we already computed so we don’t need to compute them ever again.

Using dynamic programming and the tricks above, we can get to the final solution (which runs pretty fast on the Online Judge platform):

#include<stdio.h>
#include<stdlib.h>

int CycleLength[100001];

int Cycle_Length (unsigned int n)
{
	int aux;
	if(n>100000)
	{
		aux = (n%2) ? (2+Cycle_Length((3*n+1)>>1)) : (1+Cycle_Length(n>>1)) ;
		return aux;
	}
	if(!CycleLength[n]) 	
		CycleLength[n] = (n%2) ? (2+Cycle_Length((3*n+1)>>1)) : (1+Cycle_Length(n>>1)) ;
	return CycleLength[n];
}

int findmax(int a, int b)
{
	int max=0,aux,i;
	for(i=a;i<=b;i++)
	{
		aux=Cycle_Length(i);
		if(max<aux)
			max=aux;
	}
	return max;	
}

int main()
{
	int a,b;
	CycleLength[1]=1;
	while(scanf("%d %d",&a,&b)!=EOF)
		a<b?printf("%d %d %d\n",a,b,findmax(a,b)):printf("%d %d %d\n",a,b,findmax(b,a));
	return 0;
}

Ta daam!

Kinetic Modelling of Co-translational Protein Folding (Journal Club)

Following up on last week’s entry, this post will explore the same topic: polypeptide chains assuming native-like conformations as they are extruded from the ribosome, or for the less intimate with the concept, co-translational protein folding.

Before addressing some important questions concerning co-translational protein folding, I would like to make a parenthesis: I want to dedicate a paragraph or two to talk about time.

Biological processes are dynamic. They are events that occur over a period of time. For instance, one can quantify the effect of mutations propagated and accumulated over millions of years of evolution. One can also quantify the femtoseconds in which subtle conformational changes occur in photoreceptor proteins like rhodopsin, when they respond to light. Time is fundamental to understand and model any sort of biological event.

Albeit it might seem obvious to the reader that time is so crucial to amass biological knowledge, those of us more theoretically inclined (bioinformaticians, computational biologists, biostatisticians,  mathematical biologists and so on and so forth) are usually  presented with models that tend to over-simplify reality. Surprisingly enough, there are many over-simplistic models that neglect the effect of time in order to “better” represent whatever they claim to model. Take Protein Docking for instance. The biological process at hand presents a complicated dynamic. There is a kinetic equilibrium, in which a vast amount of protein and ligand molecules interact, associating into complexes and dissociating. Nonetheless, Protein Docking is traditionally reduced to the binding affinity between a pair of molecules. As one might say, this is only a problem if I can present a solution… Luckily, Protein Docking is not my subject of expertise, so I will leave this question open to more tenacious minds than my own.

One of the areas in which I am truly interested in is the co-translational aspect of protein folding. If one performs a quick Google Images search, using the terms “Protein Synthesis” or “Protein Translation”, the results tell a very interesting story.  The vast majority of nascent protein chains are represented as fully elongates peptide chains. In a majority of pictures, the growing peptides do not even present secondary structure. They are mostly represented by long, unfolded, almost linear polymers.

Now, any first year Biochemistry student learns about something called Hydrophobicity (or hydrophilicity depending on whether you are a glass half empty or half full type of person). It is biochemistry-introductory-text-book stuff that some residues are polar and some residues are apolar, and hence will hide from water, forming a hydrophobic core. That (hydrophobicity) is one of the main driving forces of  protein folding.

Hence, most of the images that appear in our Google Images search are not very representative. They are plain wrong. It is simple physics that the growing peptide chains will form secondary and tertiary structures during the process of protein synthesis. One has to remember that this process is dynamic, it is happening over time. Under these circumstances, time should not be neglected. The time scale at which extrusion occurs is slow enough to allow the nascent chain to probe conformations and simply abide to the laws of physics. A fully elongated, completely unfolded and denatured peptide chain would not exist during protein synthesis. These nascent chains would adopt intermediate conformations simply as a result of apolar residues trying to hide from water.

Ok. Now, the BIG question that can be raised is whether those intermediate conformations actually resemble the native state of the fully elongated protein. I do not want to incur in Baby Kicking, but one thing that evolution has taught us is that cells have evolved to be highly efficient systems. There is no room for wasted energy. It makes sense to hypothesize that over millions of years, the cellular machinery has adapted to explore these intermediate conformations in order to make the process of protein folding more efficient.

Over the past couple of years, substantial evidence has been amassed that codon usage and the degeneracy of the genetic code could be exploited by cells to ensure that protein folding occurs accurately and efficiently. There are many theoretical ways that such exploitation could occur: the codon translation speed could facilitate the formation of certain intermediates that are beneficial for protein folding, that increase stability or that prevent protein aggregation. There is even a biomedical impact given that some observed pathologies have been associated with synonymous codon mutations that may lead to misfolded proteins.

In the paper I presented during this journal club [1], O’Brien and colleagues have devised and described a very interesting kinetic model for protein translation. Their model was used to describe possible scenarios in which both fast and slow translation speed codons are coordinators of co-translational protein folding. Please note that, in this context, co-translational protein folding is perceived as an enrichment of intermediate conformations of  the nascent chains, which resemble the native structure of the fully elongated protein.

In the model described in the paper, they opted for a probabilistic approach instead of an analytical (differential equations) approach. The time is modelled by the use of probabilities. The authors derived a formula to quantify the expected proportion of nascent chains of a given length that would be in a Folded intermediate state (one that resembles the native structure). They have managed to express this in terms of a rate of codon translation. Therefore, they stablish a direct relationship between Co-Translational protein folding and codon translation speed.

Their analysis is robust as none of the constants and kinetic rates need to be experimentally derived in order to provide insights about the protein folding process. Overall, I think the way the model was built was quite ingenious and very interesting. I would suggest any interested reader to read the article if they want to understand how the whole modelling was carried out.

Overall, I think the authors present a compelling argument for how cells could explore codon degeneracy and co-translational aspects of protein folding to improve folding efficiency. One of their results present a scenario in which fast translation speed codons can be used to assist in the fold of unstable protein regions, preventing the formation of misfolded intermediates.

One of the many functions of mathematical models is to provide insights into the underlying biology of the phenomena they attempt to model. The lack of any experimental evidence to support this paper’s results does not make it any less interesting. The article presents to the readers a sound and solid mathematical argument as to how co-translational aspects of protein folding could be beneficial for cell efficiency. If anything, they provide interesting hypotheses that might drive experimentalists in the future.

[1] Kinetic modelling indicates that fast-translating codons can coordinate cotranslational protein folding by avoiding misfolded intermediates.

A very long introductory post about protein structure prediction

If you are a protein informatician, bioinformatician, biochemist, biologist or simply a person well informed about science, you probably heard about protein structure prediction. If that is the case, you might be wondering what all the fuss is about, right? If you never heard those terms before, don’t panic! You are about to find out what protein structure prediction is all about!

Based on my group meeting’s presentation last Wednesday, this blog entry will discuss why protein structure prediction is important and the potential limitations of existing methods. I will also discuss how the quality of input may be a potential source for lack of accuracy in existing software.

First, let us remember a little biology: our genetic code encrypts the inner-works of a complicated cellular machinery tightly regulated by other (macro)molecules such as proteins and RNAs. These two types of macromolecules are agents that perform the set of instructions codified by DNA. Basically, RNAs and proteins are involved in a series of processes that regulate cellular function and control how the genetic code is accessed and used.

For that reason, a huge chunk of genomic data can be pretty useless not that useful if considered on their own. Scientists around the globe have invested millions of moneys and a huge chunk of time in order to amass piles and piles of genome sequencing data. To be fair, this whole “gotta sequence ’em all” mania did not provide us with the fundamental answers everybody was hoping for. Cracking the genetic code was like watching an episode of Lost, in which we were left with more questions than answers. We got a very complicated map that we can’t really understand just yet.

For that reason, I feel obliged to justify myself: protein structures ARE useful. If we know a protein structure, we can formulate a very educated guess about that protein’s function. Combine that with empirical data (e.g. where and when the protein is expressed) and it can help us unveil a lot of info about the protein’s role in cellular processes. Basically, it can answer some of the questions about the (genomic) map. If only we could do that with Lost…

There is also evidence that knowing a protein’s structure can help us design specific drugs to target and inhibit that protein. Although the evidence of such biomedical application is sparse, I believe that with development of the field, there is a trend for protein structures to become more and more important in drug discovery protocols.

Still, if we look at the number of known genome sequences and known protein structures and at the growth of those figures over the past decade, we look at a drastic scenario:

Growth of Sequences vs Structures


There is a tendency for the gap between the number of protein sequences and protein structures to increase. Hence, we are getting more and more questions and little to no answers. Observe how the green line (the protein sequences associated with a known or predicted function) is very close to the red line (the number of known protein structures). However, there is a growing gap between the red and the blue line (the number of protein sequences). Source: http://gorbi.irb.hr/en/method/growth-of-sequence-databases/

Well, gathering protein structure data is just as important, if not more important, than gathering sequence data. This motivated the creation of Structural Genomics Consortiums (SGC), facilities that specialize in solving protein structures.

I am sorry to tell you that this is all old news. We have known this for years. Nonetheless, the graph above hasn’t changed. Why? The cost limitations and the experimental difficulties associated with protein structure determination are holding us back. Solving protein structures in the lab is hard and time consuming and we are far from being as efficient at structure determination as we are at genome sequencing.

There is a possible solution to the problem: you start with a protein sequence (a sequential aminoacid list) and you try to predict its structure. This is known as protein structure prediction or protein structure modelling. Well, we have a limited number of building blocks (20) and a good understanding of their physicochemical properties, it shouldn’t be that hard right?

Unfortunately, modelling protein structure is not as simple as calculating how fast a block slides on an inclined plane. Predicting protein structure from sequence is a very hard problem indeed! It has troubled a plethora of minds throughout the past decades, making people lose many nights of sleep (I can vouch for that).

We can attribute that to two major limitations:

1- There are so many possible ways one can combine 20 “blocks” in a sequence of hundreds of aminoacids. Each aminoacid can also assume a limited range of conformations. We are looking at a massive combinatorial problem. The conformational space (the space of valid conformations a protein with a given sequence can assume) is so large that if you could check a single conformation every nanosecond, it would still take longer than the age of the universe to probe all possible conformations.

2- Our physics (and our statistics) are inaccurate. We perform so many approximations in order to make the calculations feasible with current computers that we end up with very inaccurate models.

Ok! So now you should know what protein structure prediction is, why it is important and, more importantly, why it is such a hard problem to solve. I am going to finish off by giving you a brief overview of the two most commons approaches to perform protein structure prediction: template-based modelling (also known as homology modelling) and de novo structure prediction.

There is a general understanding that if two proteins have very similar sequences (namely, if they are homologs), than they will have similar structures. So, we can use known structures of homologs as templates to predict other structures. This is known as homology modelling.

One can do a lot of fancy talk to justify why this works. There is the evolutionary argument: “selective pressure acts on the phenotype level (which can encompass a protein structure) rather than the genotype level. Hence protein structures tend to be more conserved than sequence. For that reason and considering that sequence alone is enough to determine structure, similar sequences will have even more similar structures.”

One can also formulate some sort of physics argument: “a similar aminoacid composition will lead to a similar behaviour of the interacting forces that keep the protein structure packed together. Furthermore, the energy minimum where a certain protein structure sits is so stable that it would take quite a lot of changes in the sequence to disturb that minimum energy conformation drastically.”

Probably the best argument in favour of homology modelling is that it works somewhat well. Of course, the accuracy of the models has a strong dependency on the sequence similarity, but for proteins with more than 40% identity, we can use this method in order to obtain good results.

This raises another issue: what if we can’t find a homolog with known structure? How can we model our templateless protein sequence then? Well, turns out that if we group proteins together into families based on their sequence similarity, more than half of the families would not have a member with known structure. [This data was obtained by looking at the representativeness of the Pfam (a protein family database) on the PDB (a protein structure database).]

Ergo, for a majority of cases we have to perform predictions from scratch (known as free modelling or de novo modelling).

Well, not necessarily from scratch. There is a specific approach to free modelling where we can build our models using existing knowledge. We can use chunks of protein, contiguous fragments extracted from known structures, to generate models. This is known as a fragment-based approach to de novo protein structure prediction. And that is one big name!

One can think of this as a small scale homology modelling, where both the physics and evolutionary arguments should still hold true to some degree. And how do we do? Can we generate good models? We perform appallingly! Accuracies are too low to generate any useful knowledge in a majority of cases. The problem with the rare cases when you get it right is that you have no means to know if you actually got the right answer.

The poor quality of the results can be justified by the 2 biggest limitations discussed above. Yet  something else might be in play. In homology modelling, if you use a bad template, you will most certainly get a bad model. In a similar way, using a bad set of fragments will lead you to a very poor final model.

Considering we already have the other two big issues (size of conformational space and accuracy of current potentials) to worry about, we should aim to use the best fragment library we possibly can. This has been the recent focus of my work. An attempt to make a small contribution to solve such a hard problem.

I would love to detail my work on finding better fragments here, but I believe this post is already far too long for anyone to actually endure it and read it until the end. So, congratulations if you made it through!

Journal Club: Can Linear Progamming (LP) be useful to us?

Linear programming (LP) is known as a fast and powerful computational technique. It has been applied to a large range of problems in finances and economics, but it is not very popular among us bioinformaticians, computational biologists, and the likes.

Source: http://hotmath.com/hotmath_help/topics/linear-programming.html

Linear Programming is all about find feasible solutions that satisfy a series of constraints (usually represented by inequalities). Does it sound like a familiar problem to bioinformaticians and computational biologists out there?

Source: http://hotmath.com/hotmath_help/topics/linear-programming.html

This leaves room for some questioning: can biological phenomena be modelled or simplified under the assumption of linearity? Furthermore, can LP be used to tackle the many difficult problems posed in our field? Perhaps an even bigger question: why would any of us use Linear Programming instead of another type of linear modelling? What are the advantages of it?

I will not incur in explaining the particulars of LP here. There is a plethora of materials available online (Wikipedia and Wolfram are accessible starting points) that detail Linear Programming. For those eager for something more substantial, V. Chvatal’s Linear Programming and Dantzig’s Linear Programming and Extensions are two good texts on the subject.

During this week’s journal club, I discussed an article that attempted to use Linear Programming to devise knowledge-based Docking Potentials (DP) tailored for transient protein-protein complexes. Transient complexes tend to be under-represented on the PDB, mostly due to the inherent difficulties of crystallizing such complexes. Hence, the development of knowledge-based potentials for these special cases of protein interaction is drastically hindered by a sample size limitation.

Source: Bizzarri AR, Brunori E, Bonanni B, Cannistraro S. Docking and molecular dynamics simulation of the Azurin–Cytochrome c551 electron transfer complex. J. Mol. Recognit. 2007; 20: 122–131

A cartoon representation of a transient complex between Azurin (cyan) and its partner Cytochrome C551 (dark blue) from Pseudomonas aeruginosa. Transient protein complexes are hard to crystallize, hence, under-represented on the PDB.

Source: Bizzarri AR, Brunori E, Bonanni B, Cannistraro S. Docking and molecular dynamics simulation of the Azurin–Cytochrome c551 electron transfer complex. J. Mol. Recognit. 2007; 20: 122–131

To offset such limitation, it would be ideal if one could extrapolate information from decoys (non-native conformations obtained from computational docking tools) in order to improve the Docking potentials. Furthermore, in an ideal world, one would also address the bias introduced by homology/sequence similarity between the existing proteins in the available structures of transient complexes.

The author of the article “Designing coarse grained-and atom based-potentials for protein-protein docking – Tobi D. – BMC Structural Biology 2010, 10:40 doi:10.1186/1472-6807-10-40 ” claims that LP can address such issues by incorporating information from the decoys as linear constraints to the model. The article describes a linear problem, in which the aim is to minimize the variance of how much the non-native energy potentials differ from the native ones. Also, they impose the constraints that native structures must have a lower energy than all of the non-native structures for a given complex (lower in this case is good).

The energy is defined as a weighted sum of the counts of specific interaction types on the complex interface. In their work, the author employed two models: an atom-based model and a side chain-based model. These models are used to classify atoms into groups and to simplify calculations. Initially, they define boolean (one-step) interactions: two atoms interact if they are within a cutoff distance of each other. This cutoff varies according to the type of atoms involved. The initial model led to a state of infeasibility, and it was then replaced by a two-step model, where you have strong and weak interactions and two sets of cutoff (this leads to twice as many unknowns in the LP model).

Well, does it work? How does it fair against other existing knowledge-based DPs?

Source: Designing coarse grained-and atom based-potentials for protein-protein docking. - Tobi D. - BMC Structural Biology 2010, 10:40 doi:10.1186/1472-6807-10-40Source: Designing coarse grained-and atom based-potentials for protein-protein docking. – Tobi D. – BMC Structural Biology 2010, 10:40 doi:10.1186/1472-6807-10-40

Despite the lack of brilliant results or any apparent improvement compared to the state-of-art, the potentials described in the article seem to slightly outperform ZDOCK2.3’s scoring functions.

This may actually speak in favour of the applicability of LP to problems in our area. In the case presented during the journal club, an LP approach produced comparable results to more conventional techniques.

Perhaps the best answer to “why should I use LP?” is that it is an unconventional, creative solution. It is significantly fast and, therefore, easy to try out depending on your problem. Science is all about experimentation, after all. Why would you not try a different technique if you have the means to?

Image Source: http://www.designthenewbusiness.com/blog/documenting/thinking-inside-the-box.html

The moral of the story: it is good to think outside the box, as long as you keep your feet on the ground.

Image Source: http://www.designthenewbusiness.com/blog/documenting/thinking-inside-the-box.html

Check the article discussed in the post here.