# de novo Protein Structure Prediction software: an elegant “monkey with a typewriter”

In this week’s OPIG group meeting, I discussed the inner-works and the algorithm behind ROSETTA, one of the most well-known software for de novo protein structure prediction.

Before we even attempt to understand how ROSETTA works, let us start with a theorem.

Theorem: given an infinite number of monkeys with typewriters and an infinite amount of time, they are very likely to recreate the works of William Shakespeare.

Monkey with a typewriter… Time to write that Shakespeare!

Well, let us be a little more modest and attempt to recreate just a phrase of old Bill, instead of his whole works:

# “The fool doth think he is wise, but the wise man knows himself to be a fool.”

Well, if we exclude spaces and punctuation marks, that leaves us 58 positions in our phrase (the length of the quote). Considering we have 26 possible letters for each position, we would expect to generate this phrase at random once every of 26^58 times. Wow!

That means that we need to evolve from monkeys (pun intended) and appeal to our over-developed encephalon!

In order to steer our Monkey typewriter, we can reduce this problem to a Global Optimisation problem. In a Global Optimisation problem, we define a function f (named an objective function) which we want to minimise for a given set of parameters x. Bare in mind that if we want to maximise a given function fwe can define g = -f

In a global optimisation problem, we are interested in finding the values of X that minimise the function f(X).

Now, all we need is to define an objective function in order to guide our Monkey typewriter towards the right answer.

Let us define the following objective function: given our Shakespearean phrase and a sequence of 58 letters, the value of the objective function equals the number of letters that are different between the phrase and the sequence of letters.

We can now proceed to define a slightly more refined Monkey Typewriter:

2- WHILE sequence != shakespearean_phrase:
3-________ Select a random position in the sequence.
4-________ Assign a new letter to that position.
5-________ IF score of new sequence < score of old sequence:
6-__________________ Accept the change.
7-________ ELSE:

This way we can steer our Monkeys and reduce the time it would take to generate our Shakespearean phrase to a more feasible time.

Now, let’s talk about protein structure prediction (PSP). More specifically, let us talk about de novo protein structure prediction (different flavours of protein structure prediction have been discussed previously here).

One of the great ideas behind the creators of ROSETTA, was to use a combination of two different techniques to address the big problems of protein structure prediction:

1- Problem number #1 of PSP is the size of the conformational space. A protein can be represented by it’s backbone atoms, which, in turn, can be reconstructed from a sequence of torsion angles. A set of 3 torsion angles can be used to represent every protein residue. Therefore, for a protein with 100 residues, we would have a total of 300 angles. If we approximate each angle to assume one of 360 values (degrees), that gives us 360^300 possible conformations (not huge at all, han?).

One of the main ideas behind ROSETTA was to reduce the search space by using fragments extracted from known structures. The use of fragments restricts the possible angles to a set of values that are known to occur in nature. Therefore, instead of looking at 360^300 possible angles, we deal with a much more feasible search space.

The name ROSETTA is based on the Rosetta Stone, an archaeological artefact that allowed modern civilisation to interpret and convert between different alphabets. In reality, ROSETTA can be seen as a very elegant monkey typewriter. ROSETTA uses sequence and structure similarity to define a structural alphabet. For every single position in our protein sequence, we have a set of fragments extracted from now protein structures to represent that position.  Originally, each position would be represented by 25 fragments (letters?). If you combine the different pieces of known structures in the right order, you will get your Shakespearean Phrase in the end (the correct Protein Structure!).

2- Well, we still have a pretty big conformational space considering we have 25 fragments per position (approximately 25^100 possible conformations, for a protein with 100 residues). The second technique employed by ROSETTA is Simulated Annealing.

Simulated Annealing is a Global Optimisation heuristic. It attempts to find a good enough solution to the problem of minimising a given function f. It is very similar to our Monkey Typewriter algorithm. The main difference is that Simulated Annealing implements some tricks to avoid local minima entrapment. In simpler terms, if we ONLY accept favourable changes (Line 5 of Monkey Typewriter pseudo-code), once we reach a local minimum, we get trapped. No possible change would lead to an improvement, yet we are still far from finding the global minimum.

In order to mitigate that entrapment effect, Simulated Annealing defines a probability of accepting an unfavourable change. This probability is higher at the beginning of the simulation and it becomes lower and lower as the simulation progresses. This process is usually referred to as “cooling down”.

Ok! So we reduced our PSP problem to an elegant Monkey Typewriter. We have our Monkeys working to create the best possible Shakespeare, in a pretty clever and sophisticated manner. Well, we should be able to create some fine piece of literature, correct?

Not quite!

There are still several problems with this whole pipeline. I will mention a few:

• When you define your structural alphabet, you may not have the right fragment to represent a certain position. This would be the same as trying to get to a Shakespearean phrase without using vowels for the first 10 letters or only using consonants in the middle of the sentence. It would never happen…
• Despite the many efforts to define a very good objective function, no current software presents a function that truly mimics the behaviour of an energy function. This implies that we have a vague idea of how the Shakespearean phrase should look like, but we cannot precisely pinpoint where each letter goes.
• No matter how elegant our Monkey typewriter becomes, the combinatorial problem still persists. We are still dealing with 25^100 possible conformations and it is impossible to try every single conformation.
• The objective function, if plotted in a graph, would look completely hideous (unlike the picture above). We are talking about a gigantic multi-dimensional surface, filled with local minima that confuse and entrap our simulations. Combine that with the fact that our objective function is not accurate and you waste most of your computing power into generating solutions that are completely useless.
• Another common technique to address the previous limitations is to increase the number of Monkeys in order to speed up the search process. If you use thousands and thousands of Monkeys (multiple runs of ROSETTA), each individual Monkey will get to a local minimum (decoy = something that looks like a phrase). In recent years, tens of thousands of decoys are generated in order to predict a single structure. A new problem arises, because out of these tens of thousands of phrases, we cannot tell apart Hamlet from Twilight. We don’t know which Monkeys got close to the right answer. All we know is that for some cases some of them did.

In conclusion, de novo Protein Structure Prediction still has a long way to go.