Monthly Archives: May 2014

Protein Folding: Man vs Machine

In 1996 Gary Kasparov, the reigning world chess champion, played IBM’s Deep blue, a computer whose sole purpose was to play chess better than any human. Losing the first match, Gary sprung back swiftly defeating Deep Blue 4-2 over the remaining matches. However, his success was short lived. In a rematch with an updated Deep Blue the following year, the score was 3.5-2.5 to the computer. The media (and IBM) declared this as a pivotal moment in history, where a machine had proven itself better than humanities champion at a game deemed a highly intellectual pursuit. The outcry was that the age of machines had arrived. Was it true? Should humanity have surrendered to machine overloads at that moment? Obviously the answer is a large and resounding no. However, this competition allows for insightful comparison between the manner in which humans and computers play chess and think. By comparing the two, we learn the strengths and weaknesses of both parties from which we can make combined approaches that may exceed either.

Firstly, lets discuss the manner in which a computer “plays” chess. They simply search all possible configurations of moves that are available and pick the most optimal. However, things are not that simple. Consider only the opening sequence, there are 20 possible moves a player can make, so after only a single move by each player there is 400 possible chess positions. This count grows exponentially fast, after 5 moves by each player there is approximately 5 million combinations. For example, it was estimated that Deep Blue could analyse 2 million positions per second. However, since this is not nearly fast enough to examine all possible games from start to end in a reasonable time scale, computers cannot foresee lines of plays which are far in the distance. To overcome this, in the early game the computer will use a reference table developed by grandmasters that list both common openings and the assumed best manner to respond to them. Obviously, these are only assumed as optimal and have never been completely tested. In short, machines participate through a brute force, utilising their intricate ability to perform calculations at high speed to find the best move. However, the search is too large in the initial and end stages of a game to be completely thorough, a reference table is instead used to “inform” of the correct move at these times.

takahashi03

While a human can quite easily see that the following board leads to a draw, computers cannot draw the same conclusion without huge effort.

In contrast, human players use far more visual and spatial recognition alongside both memory and calculation to pick their moves. Like a computer, a player will analyse a portion of the moves available at any given moment. Though since a human cannot compare on computation speed to that of a computer, they cannot analyse nearly the same magnitude of moves. Hence, this subset of moves chosen for analysis must contain the most optimal move(s) to compete against the computer’s raw power. This is where the visual and spatial recognition abilities of humans come to bare. Firstly, a human can easily dissect the board into pieces worth considering and those to be ignored. For example, consider a possible move that would result in your queen being exposed and then taken. A human would conclude this as bad (normally) and discard further moves leading from such a play. A computer, however, would explore the resultant board state. One can see how this immediately and drastically reduces the required search. Another human ability is that a player will often be able to able to see sub-structures within a full set-up that are common in the game and hence can be processed in a known manner. In other words, the game is broken down into fragments which can be processed far easier and with less computation. Obviously, both of the above techniques rely on prior knowledge of chess to be useful, but they based upon our human ability to perceive both the substructure of the game and the overall picture with relative ease.

So how does all this chess talk relate to protein folding? In 2010, the Baker group and creators of the ROSETTA protein fold prediction program produced the protein folding game “Foldit”. In Foldit the general public could attempt to fold proteins for themselves and try to get closer to the native structure than the computer algorithms. Obviously, simplified in presentation to that of academic structural biology, it was hoped that the visual and spatial reasoning abilities of humans, the same ones that differentiated them from machines at chess, would prove useful in protein structure prediction. A key issue within ROSETTA drove this train of thought, the fact that is is relatively bad at exploring fully the confirmation space. Often, it will get stuck in the one general configuration and not explore the fold space fully. Furthermore, due to the size of configuration space, this is not easily overcome with simulated annealing due to the sheer scale of the problem. The ability of humans to view the overall picture meant that it should be easier for them to see other possible configurations. As end goals for Foldit, it was hoped that structures that proved unsolvable by current algorithms would be solved by humans and also that new techniques would emerge as “moves” employed by players to achieve high scores could be studied.

To make a comparison of the structures produced by Foldit players and ROSETTA viable, the underlying energy “scores” that judge a structure is the same between the programs. It is assumed, though is not always true, that the better the score the closer you are to the native fold. In addition,  Foldit players were also able to use a set of optimisation tools that were deterministic and would alter the backbone and side chains to the most optimal local configuration to the arrangement the player would make. This meant that players could focus predominantly on altering the overall structure of the protein rather than the fine detail, such as the position of sidec hains. To make the game as approachable as possible, technical terms were replaced by common analogues and visual cues where displayed to highlight poor scoring areas of the protein. For example, clashes between atoms are shown via large spiked red orbs, while the backbone is coloured from green to red depending on how well buried the hydrophobic residues on that segment are. To drive players, gamification elements were also included such as leader boards and rewarding “fireworks” as graphical effects.

To objectively compare the ability of the player base to that of the ROSETTA algorithm, they performed blind predictions on a set of 10 proteins whose structure were not in the public domain. This was run in a similar manner to CASP for those familiar with that set-up. The results exemplified the innate human ability of visual and spatial recognition. In 5 of the cases the playerbase performed significantly better than the ROSETTA program. In 3 of the cases they performed similar. And in the remaining 2 cases the ROSETTA algorithm performed better, though in both of these the model produced was still extremely far from the native structure. Looking through the cases individually, it was identified that the most crucial element used by players was that they were able to deal with large rearrangements that ROSETTA struggled to deal with, including register shifts and strand swapping. This highlights the ability of humans to view the overall picture and to persevere through “bad scoring patches” to reach a more optimal configuration.

foldit_protein

Comparison of foldit player’s solutions (green) to ROSETTA’s solutions (red) and the native 2KPO protein structure (blue). The players correctly identified a strand swap needed to reach the native form, while this large reconfiguration was not seen by ROSETTA.

Since the release of the game and the accompanying paper in 2010, Foldit has received much praise in conveying the field of protein folding in an approachable manner to so many people. In addition, the player base has contributed to science as whole. In 2011 the player base successfully solved the structure of a M-PMV protein, a retrovirus whose structure was unobtainable via normal means. Then in 2012, by analysing the common set of moves employed by the player base, they collectively produced an algorithm that outperforms previously published fold prediction methods. Personally, I think of Foldit as a fun and relative intuitive game that introduces the core elements of the protein folding problem. As to its scientific merit, I’m unsure as to how much impact it will continue to have. As Saulo discussed last week, if infinite monkeys have infinite time then Shakespeare will be reproduced. Likewise, if enough people manipulate a protein structure, eventually the best structure will be found. Though who am I to judge, if people find the game fun, then there are far worse past-times one can have than trying to solve structures. As a finishing note I would be extremely interested in using Foldit to teach structural biology in the future, though feel it is overall too simple for a university setting.

de novo Protein Structure Prediction software: an elegant “monkey with a typewriter”

In this week’s OPIG group meeting, I discussed the inner-works and the algorithm behind ROSETTA, one of the most well-known software for de novo protein structure prediction.

Before we even attempt to understand how ROSETTA works, let us start with a theorem.

Theorem: given an infinite number of monkeys with typewriters and an infinite amount of time, they are very likely to recreate the works of William Shakespeare.

Monkey with a typewriter… Time to write that Shakespeare!

Well, let us be a little more modest and attempt to recreate just a phrase of old Bill, instead of his whole works:

“The fool doth think he is wise, but the wise man knows himself to be a fool.”

Well, if we exclude spaces and punctuation marks, that leaves us 58 positions in our phrase (the length of the quote). Considering we have 26 possible letters for each position, we would expect to generate this phrase at random once every of 26^58 times. Wow!

That means that we need to evolve from monkeys (pun intended) and appeal to our over-developed encephalon!

In order to steer our Monkey typewriter, we can reduce this problem to a Global Optimisation problem. In a Global Optimisation problem, we define a function f (named an objective function) which we want to minimise for a given set of parameters x. Bare in mind that if we want to maximise a given function fwe can define g = -f 

In a global optimisation problem, we are interested in finding the values of X that minimise the function f(X).

Now, all we need is to define an objective function in order to guide our Monkey typewriter towards the right answer.

Let us define the following objective function: given our Shakespearean phrase and a sequence of 58 letters, the value of the objective function equals the number of letters that are different between the phrase and the sequence of letters.

We can now proceed to define a slightly more refined Monkey Typewriter:

1- Start with a random sequence of letters.
2- WHILE sequence != shakespearean_phrase:
3-________ Select a random position in the sequence.
4-________ Assign a new letter to that position.
5-________ IF score of new sequence < score of old sequence:
6-__________________ Accept the change.
7-________ ELSE:
8-__________________ Discard the change.

This way we can steer our Monkeys and reduce the time it would take to generate our Shakespearean phrase to a more feasible time.

Now, let’s talk about protein structure prediction (PSP). More specifically, let us talk about de novo protein structure prediction (different flavours of protein structure prediction have been discussed previously here).

One of the great ideas behind the creators of ROSETTA, was to use a combination of two different techniques to address the big problems of protein structure prediction:

1- Problem number #1 of PSP is the size of the conformational space. A protein can be represented by it’s backbone atoms, which, in turn, can be reconstructed from a sequence of torsion angles. A set of 3 torsion angles can be used to represent every protein residue. Therefore, for a protein with 100 residues, we would have a total of 300 angles. If we approximate each angle to assume one of 360 values (degrees), that gives us 360^300 possible conformations (not huge at all, han?).

One of the main ideas behind ROSETTA was to reduce the search space by using fragments extracted from known structures. The use of fragments restricts the possible angles to a set of values that are known to occur in nature. Therefore, instead of looking at 360^300 possible angles, we deal with a much more feasible search space.

The name ROSETTA is based on the Rosetta Stone, an archaeological artefact that allowed modern civilisation to interpret and convert between different alphabets. In reality, ROSETTA can be seen as a very elegant monkey typewriter. ROSETTA uses sequence and structure similarity to define a structural alphabet. For every single position in our protein sequence, we have a set of fragments extracted from now protein structures to represent that position.  Originally, each position would be represented by 25 fragments (letters?). If you combine the different pieces of known structures in the right order, you will get your Shakespearean Phrase in the end (the correct Protein Structure!).

2- Well, we still have a pretty big conformational space considering we have 25 fragments per position (approximately 25^100 possible conformations, for a protein with 100 residues). The second technique employed by ROSETTA is Simulated Annealing.

Simulated Annealing is a Global Optimisation heuristic. It attempts to find a good enough solution to the problem of minimising a given function f. It is very similar to our Monkey Typewriter algorithm. The main difference is that Simulated Annealing implements some tricks to avoid local minima entrapment. In simpler terms, if we ONLY accept favourable changes (Line 5 of Monkey Typewriter pseudo-code), once we reach a local minimum, we get trapped. No possible change would lead to an improvement, yet we are still far from finding the global minimum.

In order to mitigate that entrapment effect, Simulated Annealing defines a probability of accepting an unfavourable change. This probability is higher at the beginning of the simulation and it becomes lower and lower as the simulation progresses. This process is usually referred to as “cooling down”.

Ok! So we reduced our PSP problem to an elegant Monkey Typewriter. We have our Monkeys working to create the best possible Shakespeare, in a pretty clever and sophisticated manner. Well, we should be able to create some fine piece of literature, correct?

Not quite!

There are still several problems with this whole pipeline. I will mention a few:

  • When you define your structural alphabet, you may not have the right fragment to represent a certain position. This would be the same as trying to get to a Shakespearean phrase without using vowels for the first 10 letters or only using consonants in the middle of the sentence. It would never happen…
  • Despite the many efforts to define a very good objective function, no current software presents a function that truly mimics the behaviour of an energy function. This implies that we have a vague idea of how the Shakespearean phrase should look like, but we cannot precisely pinpoint where each letter goes.
  • No matter how elegant our Monkey typewriter becomes, the combinatorial problem still persists. We are still dealing with 25^100 possible conformations and it is impossible to try every single conformation.
  • The objective function, if plotted in a graph, would look completely hideous (unlike the picture above). We are talking about a gigantic multi-dimensional surface, filled with local minima that confuse and entrap our simulations. Combine that with the fact that our objective function is not accurate and you waste most of your computing power into generating solutions that are completely useless.
  • Another common technique to address the previous limitations is to increase the number of Monkeys in order to speed up the search process. If you use thousands and thousands of Monkeys (multiple runs of ROSETTA), each individual Monkey will get to a local minimum (decoy = something that looks like a phrase). In recent years, tens of thousands of decoys are generated in order to predict a single structure. A new problem arises, because out of these tens of thousands of phrases, we cannot tell apart Hamlet from Twilight. We don’t know which Monkeys got close to the right answer. All we know is that for some cases some of them did.

In conclusion, de novo Protein Structure Prediction still has a long way to go.

MAMMOTH: a case study in protein structure alignment

I’ve talked about protein structure alignment before in the context of a rather novel, mathematical approach. This time I wanted to revisit the topic in a general sense, using a more established algorithm as a case study. MAMMOTH stands for Matching Molecular Models Obtained from Theory and was first published in 2002. Since then it has been cited nearly 400 times and the underlying algorithm has been extended to a multiple alignment program: MAMMOTH-mult.

Establishing biologically relevant and consistent alignments between protein structures is one of the major unsolved problems in computational bioinformatics. However, it’s an important part of many challenges that we face: such as establishing homology between distantly related proteins, functional inference for unannotated proteins, and evaluating the accuracy of models of predicted structure for competitions such as CASP.

Problem Outline

In essence the problem of protein structure alignment can be outlined by considering two ordered sets of coordinates, A = {a1,a2,…,an} and B = {b1,b2,…,bm}, representing points in 3D space. In most cases these points will be the location of the Cα atoms along each structure’s backbone. The sets A and B might be completely different lengths and, if an alignment exists, are almost certainly orientated differently compared to each other.

coordinates

Establishing an alignment between these sets is equivalent to two steps:

  1. Establish a match M = {(ai,bj) | ai ∈ A, bj ∈ B}
  2. Rotate and translate A onto B so that equivalent atoms are as close as possible.

Of course, it is not immediately clear how to best fulfill these requirements. In particular, we don’t really know what features to prioritise for a biologically relevant match. Should we try to match secondary structure elements and what penalty should we attach to mismatched elements? How about maintaining the correct hydrogen bonding patterns between matched residues? And how much weight should we put on the matched atoms being consecutive in each set (i.e. how should we penalise gaps)?

The second step is equally ambiguous. Especially as there is no consensus on what the correct interpretation of close is. Minimising the RMSD between equivalent atoms is a popular choice of distance measure. However, as the MAMMOTH paper points out, RMSD is often dominated by the mismatched portions of remotely related structures and is thus largely inappropriate in these cases. Furthermore, even if we have a well-defined distance metric, should the superposition prioritise minimising the distances between nearly identical parts of the different structures, at the expense of less similar substructures? Or should the emphasis be on maintaining as lengthy a match as possible at the possible cost of a lower closeness of fit? How about the relative importance of a close fit for atoms in the core of the structure vs. those on the surface?

The majority of these questions remain unanswered and as a result it is often hard to validate alignments as we simply do not know what the right answer is. In fact, in many cases, manual analysis is preferred over any of the available computational techniques.

In this post I’ll go through how the MAMMOTH algorithm approaches each of these steps. For many of the above questions MAMMOTH does not postulate a solution, primarily because, as its name suggests, it was designed to assess prediction models which are often at low resolutions and lacking secondary structure or hydrogen bonding information. I think it’s important to keep these questions in mind, but there’s certainly no necessity to design a programme which deals with them all.

Step 1: Pairing up residues (similarity matrix)

In order to establish a match between equivalent atoms in A and B, MAMMOTH, like several other structural alignment algorithms, uses a well-established alignment technique: a similarity matrix (often inferred from and referenced as a distance matrix). A similarity matrix for alignment is an n x m array where each entry S(ai,bj) represents the pairwise similarity score between residues ai and bj. An alignment between the residues of A and B is any non-decreasing path (that is, a pair (ai,bj) in the path must appear later in the ordering of coordinates of both A and B than the preceding pair of residues in the path) from the top left corner of the array (a1,b1) to the bottom right corner (an,bm). For example the following path can be interpreted as an alignment between A = {a1, …, a11} and B = {b1, …, b8}

similarity_matrix

Any alignment can be scored by summing up the similarity scores along this path, while penalising any gaps in an appropriate way (normally, these algorithms use trial and error to decide on sensible penalties). For example, the above alignment would have the score S = S(a1,b1) + S(a2,b2) + S(a3,b3) + S(a7,b4) + S(a8,b5) + S(a9,b6) + S(a10,b7) + S(a11,b8) + α + 2β, where α and β are gap opening and gap extension penalties respectively. The optimal alignment is simply the alignment which maximises this score.

For sequence alignments similarity scores can be assigned to residues from substitution tables like BLOSUM. However, it is not immediately clear of an appropriate equivalent for structures. MAMMOTH, like several other algorithms, defines the similarity between different residues by examining their local structural landscape. Specifically, this means comparing fragments of each backbone, centred on the residue of interest. MAMMOTH uses the URMS distance between heptapeptide fragments. This distance is illustrated below using 2D chains and tripeptide fragments.

urms

Comparing residues a2 and b3 involves looking at the directions between each successive residue of the fragment. Each vector is mapped to the unit sphere, beginning at the origin and ending at the surface of the sphere (in this case 2 vectors are considered, and in MAMMOTH’s case 6 different 3D vectors are mapped). The optimal rotation is found, superposing equivalent vectors as best as possible, and then the RMSD of the endpoints on the surface of the sphere is calculated as URMS(ai,bj).

Aside: The optimal superposition of a set of vectors is actually a non-trivial problem. It is essentially equivalent to step 2 in our alignment protocol outlined above, but is significantly easier for the 6 vectors characterising a fragment in MAMMOTH’s algorithm.

Finally, S(ai,bj) is calculated by converting the distance into a similarity measure:

similarity

where URMSR is the expected URMS of a random set of vectors and:

delta

The optimal alignment through this MAMMOTH matrix is the path which maximises the sum of similarities between matched residues (each residue being at the centre of the heptapeptide fragment) using gap opening and extension penalties of 7.00 and 0.45 respectively.

Step 2: Global superposition (MaxSub)

The above alignment results in a match M’ optimising the local structural similarity of residues in each structure, however, their is no guarantee that this will result in a set of coordinates close in global space. In order to finalise the match set M ⊆ M’ as well as calculating the optimal superposition of the paired residues of A onto their equivalent points in B, MAMMOTH use the MaxSub algorithm. This is a very efficient algorithm (worth a read if you’re that way inclined) for calculating the maximal subset from a set of paired up atoms which are close in global space. MAMMOTH decide that close means < 4A away after superposition. They do not try to optimise a closer superposition than that but attempt to find the largest possible set of matched residues.

The MaxSub algorithm relies on the assumption (made for computational tractability) that the final subset M ⊆ M’ will have, somewhere, a set of at least four residues consecutive in M’. The algorithm then starts with every possible seed of four consecutive residues (just to illustrate the power of the assumption in reducing computational time: for a 150 residue protein there are just 147 such seeds, but over 2 million sets of four non-consecutive residues!! And it’s a pretty reasonable assumption to make as well). The MaxSub algorithm then calculates the superposition for those four matched pairs, extending the set of residues that are <4A away from their partners, recalculating the superposition using these new pairs as well, then removing any pairs which are no longer within the threshold of each other. It repeats these steps, gradually extending the set M, until the algorithm converges.

Scoring the alignment

Using the two approaches outlined above, MAMMOTH generates an alignment between the two input structures. In order to summarise the significance of this alignment, the algorithm generates the PSI score: the percentage structural identity (which is simply the size of the maximum subset divided by the length of the shortest protein). As a global measure of the strength of similarity the PSI score is poorly constructed and scales with protein length. In order to adjust for this bias, MAMMOTH fits a Gumbel distribution to PSI scores obtained from random structure comparisons between unrelated proteins at bins of different lengths. This results in a z-score measuring, instead of the PSI of an alignment, the likelihood of obtaining a PSI score as good as that by chance between any two proteins of the same lengths.

Distance matrix clustering

In Bioinformatics, we often deal with distance matrices such as:

  • Quantifying pairwise similarities between sequences
  • Structural similarity between proteins (RMSD?)
  • etc.

Next step is to study the groupings within the distance matrix using an appropriate clustering scheme. The obvious issue with most clustering methods is that you would need to specify the number of clusters beforehand (as for K-Means). Assuming that you do not know very much about the data and ‘plotting’ it is not an option, you might try non-parametric hierarchichal clustering such as linkage. The main difference between the two approaches is that using linkage you specify what the maximal distance within each cluster should be and thus the number of clusters will be adjusted accordingly. Par contre, using K-Means you do not have such a distance-guarantee within each cluster since the number of groups is predefined.

Here I will provide a short piece of python code that employs the hcluster library to perform linkage clustering.

Installation

Download hcluster, unpack it and inside the unpacked folder type:

python setup.py install

Alternatively, if you’re not an admin on your machine type:

python setup.py install --user

 Example Code

The purpose of the example bit of code is to generate a random set of points within (0,10) in the 2D space and cluster them according to user’s euclidean distance cutoff.

import matplotlib.pyplot as plt
from matplotlib.pyplot import show
from hcluster import pdist, linkage, dendrogram
import numpy
import random
import sys

#Input: z= linkage matrix, treshold = the treshold to split, n=distance matrix size
def split_into_clusters(link_mat,thresh,n):
   c_ts=n
   clusters={}
   for row in link_mat:
      if row[2] < thresh:
          n_1=int(row[0])
          n_2=int(row[1])

          if n_1 >= n:
             link_1=clusters[n_1]
             del(clusters[n_1])
          else:
             link_1= [n_1]

          if n_2 >= n:
             link_2=clusters[n_2]
             del(clusters[n_2])
          else:
             link_2= [n_2]

	  link_1.extend(link_2)
          clusters[c_ts] = link_1
          c_ts+=1
      else:
          return clusters

#Size of the point matrix
rows = 10 #number of points
columns = 2 #number of dimensions - 2=2D, 3=3D etc.
samples = numpy.empty((rows, columns))

#Initialize a random points matrix with values between 0, 10 (all points in the upper right 0,10 quadrant)
for i in xrange(0,rows):
    for j in xrange(0,columns):
       samples[i][j] = random.randint(0,10)

#Show the points we have randomly generated
print "Samples:\n ", samples

#Create the distance matrix for the array of sample vectors.
#Look up 'squareform' if you want to submit your own distance matrices as they need to be translated into reduced matrices
dist_mat = pdist(samples,'euclidean')

#Perform linkage clustering - here we use 'average', but other options are possible which you can explore here: http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.cluster.hierarchy.linkage.html
z = linkage(dist_mat, method='average',metric='euclidean')

#Specify a cutoff that will define the clustering - command line argument: 
#python ClusterExample.py 3.0
cutoff = float(sys.argv[1])
clustering = split_into_clusters(z,cutoff,rows)
if clustering==None:
	print "No clusters! Most likely your distance cutoff is too high - all falls into the same bag!"
	quit()

#Print the potential singletons - in magenta
for i in xrange(0,rows):
	plt.plot(samples[i][0],samples[i][1], marker='o', color='m', ls='')
	plt.text(samples[i][0],samples[i][1], str(i), fontsize=12)
#If there are more clusters than these the code will fail!
colors = ['b','r','g','y','c','k']

cluster_num = 0
for cluster in clustering:
   print "Cluster: ",cluster_num
   cluster_num+=1
   for i in clustering[cluster]:
	print "-->",i
	plt.plot(samples[i][0],samples[i][1], marker='o', color=colors[cluster_num], ls='')

#Set the axis limits
plt.xlim([-1,12])
plt.ylim([-1,12])

show()
#Alternatively plot it as dendogram to see where your distance cutoff performed the tree cut
dendrogram(z)
show()

When I ran the code above (python [whatever you call the script].py 2.0) that’s what I got (colors correspond to clusters with ‘magenta’ being singletons):

figure_1

And there is a dendogram command on the bottom of the script to see what the clustering has actually done and where it performed the cut according to your specified cutoff (colors DO NOT correspond to clusters here):

figure_2

 

Hcluster library forms part of scipy with very useful methods for data analysis. You can modify the above code to use a variety of other hierarchichal clustering methods which you can  further explore here.

Freezing python code

Many of us in the group use python as our primary programming language. It is in my opinion an awesome language for lots of reasons. However what happens when you write an application and want to share it with the world? Simply distributing the source code requires a great deal of configuration by the end user. I’m sure you’ve all been there, you have version 1.5.1 they use version 1.6.3. However to download and install this breaks every other bit of code you are using. Creating virtual environments can help towards this, but then do you really want to go towards all the hassle of this for every application you want to use? In the end I have given up trying on a number of projects, which is a fate you would never want for your own code!

From my point of view there are three ways of counteracting this issue.

  1. Make limited use of libraries and imports
  2. Have incredibly clear instructions on how to set up the virtual env
  3. Freeze your code!

The first solution is sometimes just not possible or desirable. For example if you want to use a web framework or connect to third party database engines. The second could be massively time consuming and it is virtually impossible to cover all bases. For example, RDKit, my favourite cheminformatics package, has a lengthy install process with platform specific quirks and many of its own dependencies.

In my project I opted for solution number three. I use PyInstaller however there are many others available (cx_freeze, py2apppy2exe). I used PyInstaller because my application uses the Django project and they offer extra Django support. Also PyInstaller is cross-platform, allowing me (in theory) to package applications for Windows, Mac and Linux using the same protocol.

Here I will briefly outline how to set freeze your code using PyInstaller. This application validates a smiles string and shows you the RDKit canonical form of the smiles string.
This is the structure of the code:

src/
  main.py
  module/
    __init__.py
    functions.py
build/
dist/

main.py is:

import sys
from module.functions import my_fun
if len(sys.argv) > 1:
  smiles = sys.argv[1]
  print my_fun(smiles)
else:
  print "No smiles string requested for validation"

functions.py is:

from rdkit import Chem
def my_fun(smiles):
  mol = Chem.MolFromSmiles(smiles)
  if mol is None:
    return "Invalid smiles"
  else:
    return "Valid smiles IN:  " + smiles + "  OUT: " + Chem.MolToSmiles(mol,isomericSmiles=True) 

  1. Download and install PyInstaller 
  2. Type the following (assuming main.py is your python script)
  3. pyinstaller src\main.py --name frozen.exe --onefile
      This will produce a the following directory structure:

      src/
        main.py
        module/
          functions.py
      build/
        frozen/
      dist/
        frozen.exe
      frozen.spec
      

      frozen.spec is a file containing the options for building the application:

      a = Analysis(['src\\main.py'],
                   pathex=['P:\\PATH\\TO\\HEAD'],
                   hiddenimports=[],
                   hookspath=None,
                   runtime_hooks=None)
      pyz = PYZ(a.pure)
      exe = EXE(pyz,
                a.scripts,
                a.binaries,
                a.zipfiles,
                a.datas,
                name='frozen.exe',
                debug=False,
                strip=None,
                upx=True,
                console=True )
      

      “build” contains files used in the building of the executable

      “dist” contains the executable that you can distribute freely around. Because I used the “–onefile” option above it creates one single .exe file. This makes the file very easy to ship – HOWEVER for large programmes this isn’t totally ideal. All the dependencies are compressed into the .exe and uncompressed into a temporary folder at runtime. If there are lots of files, this process can be VERY slow.

      So now we can run the program:

      dist/frozen.exe c1ccccc1 

      Running dist/frozen.exe returns the error: ImportError: numpy.core.multiarray failed to import
      This is because the RDKit uses this module and it is not packaged up in the frozen code. The easiest way to resolve this is to include this import in main.py:

      from rdkit import Chem
      import numpy
      import sys
      from module.functions import my_fun
      if len(sys.argv) > 1:
        smiles = sys.argv[1]
        print my_fun(smiles)
      else:
        print "No smiles string requested for validation"
      

      And there you have it. “frozen.exe” can be passed around to anyone using windows (in this case) and will work on their box.

      Obviously this is a very simple application. However I have used this to package Django applications, using Tornado web servers and with multiple complex dependencies to produce native windows desktop applications. It works! Any questions, post below!

OPIG Algorithm Club Problem #1: The 3n+1 problem

In the first meeting of the OPIG Algotirhm Club, we tackled the problem 3n+1 from the Sphere Online Judge (SPOJ).

 —Background —

The problem relates to the Collatz conjecture. Before describing the conjecture, let us define a very simple recursive relation:

  • If a number is odd, multiply it by 3 and add 1 (hence 3n+1).
  • If a number is even, divide it by 2.

The Collatz conjecture states that for any given integer, you can repeat this recursion indefinetely and you will always reach the number 1.

The 3n+1 problem requires yet another concept which is the cycle length (i.e. the number of times you have to repeat the recursion for a given number until you reach 1).

— Goal —

The aim of the problem is: given 2 integers and j, find the number with the highest cycle length that lies in the interval comprised between the 2 integers (them included).

— Important Issues —

Two things are worth mentioning before attempting to implement this problem:

  •  could be either greater than or lesser than j.
  • Despite the fact that the description of the problem states that all operations can be performed using int, there are certain inputs between 100,000 and 1,000,000 that violate that statement. 

— Implementation —

A simple solution can be implemented by using a recursive function to compute the cycle length:

int Cycle_Length (unsigned int n) /* Note the unsigned precision */
{
    	unsigned int aux;

        /* Stopping Condition - Don't want the recursion to run for forever */
        if (n == 1) return 1;

        /* Here is the recursion, if n is odd  else     if n is even   */
	    aux = (n%2) ? (1 + Cycle_Length(3*n+1) ) : (1+Cycle_Length(n>>1));
        /* Note that division by two is performed by shifitng a bit to the right */
	    return aux;
}

Now, there are two ways we can optimise this function and make it more efficient.

The first one relates to a certain property of odd numbers: we know that if n is odd, than 3*n+1 is going to be even. Therefore we can simply skip one iteration of the cycle:

aux = (n%2) ? (2+Cycle_Length((3*n+1)>>1)) : (1+Cycle_Length(n>>1)) ;

We can further optimise our code by using Dynamic Programming. We can store the Cycle Lengths we already computed so we don’t need to compute them ever again.

Using dynamic programming and the tricks above, we can get to the final solution (which runs pretty fast on the Online Judge platform):

#include<stdio.h>
#include<stdlib.h>

int CycleLength[100001];

int Cycle_Length (unsigned int n)
{
	int aux;
	if(n>100000)
	{
		aux = (n%2) ? (2+Cycle_Length((3*n+1)>>1)) : (1+Cycle_Length(n>>1)) ;
		return aux;
	}
	if(!CycleLength[n]) 	
		CycleLength[n] = (n%2) ? (2+Cycle_Length((3*n+1)>>1)) : (1+Cycle_Length(n>>1)) ;
	return CycleLength[n];
}

int findmax(int a, int b)
{
	int max=0,aux,i;
	for(i=a;i<=b;i++)
	{
		aux=Cycle_Length(i);
		if(max<aux)
			max=aux;
	}
	return max;	
}

int main()
{
	int a,b;
	CycleLength[1]=1;
	while(scanf("%d %d",&a,&b)!=EOF)
		a<b?printf("%d %d %d\n",a,b,findmax(a,b)):printf("%d %d %d\n",a,b,findmax(b,a));
	return 0;
}

Ta daam!

Introduction to the protein folding problem

Recently (read: this week), I had to give a presentation on my research to my college. We were informed that the audience would be non-specialist, which in fact turned out to be an understatement. For example, my presentation followed on from a discussion of the differences in the education systems of North and South Korea for the period 1949-1960. Luckily, I had tailored my entire talk to be understandable by all and devoid of all Jargon. I choose to deviate from the prescribed topic and Instead of talking about my research specifically, I choose to discuss the protein folding problem in general. Below you’ll find the script I wrote, which I feel gives a good introduction to the core problem of this field.

———————-

The protein folding problem is one of the great projects within the life sciences. Studied by vast numbers of great scientists over the last half century, with backgrounds including chemistry, physics, maths and biology, all were beaten by the sheer complexity of the problem. As a community we still have only scraped the surface with regards to solving it. While I could, like many of you here, go into my own research in great detail and bore you wholeheartedly for the next 10 minutes with technical details and cryptic terminology, I will instead try to give an overview of the problem and why thousands of scientists around the globe are still working on cracking this.

First of all, I guess that a few of you are trying to remind yourself what a protein is; the horror that is high school biology crawling back from that area in your brain you keep for traumatic experiences like family gatherings. Luckily, I’m fairly new to the topic myself, my background being in physics and chemistry, so hopefully my explanation will be still in the naive terms that I use to explain the core concepts to myself. Proteins are the micro-machines of your body, the cogs that keep the wheels turning, the screws that hold the pieces together and the pieces themselves. Proteins run nearly all aspects of your body and biochemistry, your immune system, your digestion and your heart beating. There are approximately between 20 to 30 thousand different proteins in your body, depending on who you ask, and trillions overall. In fact, if we take every protein in our body and scale it up to the size of a penny, the proteins in a single human, albeit a rather dead human, would be enough to fill the entire pacific ocean. Basically, there is a hell of a lot of proteins, with a vast range of different types, each of which is very individual, both in its compositions and function, and, crucially, they are nearly all essential. The loss of any protein can lead to dramatic consequences including heart disease, cancer, and even death.

So now that you know that they are important and there are lots of them, what exactly is a protein? The easiest analogy I have is that of a pearl necklace, a long string of beads in a chain. Now consider your significant other has gone slightly insane and instead of purchasing jewelry for you that consists of a single bead type, or even two if you have slightly exotic tastes, they have been shopping at one of the jewellery stores found in the part of town that smells rather “herby”. You receive a necklace which has different beads across the entire length of the necklace. We have blues, yellows, pinks, and so on and so forth. In fact, we have 20 different types of beads, each with its own colour. This is basically a protein chain: each of the beads represents one of the twenty essential amino acids, each of which has its own chemical and physical properties. Now suppose you can string those pearls together in any order: red, green, blue, blue, pink etc. It turns out that the specific order that these beads are arranged along the length of the protein chain define exactly how this chain “crumples” into a 3D shape. If you think that adding an extra dimension is impossible, just consider crumpling a piece of paper; that is 2D –> 3D transition (mathematicians please bite your tongue). Now one string of colours, blue, blue, pink for example, will crunch into one shape, and that shape may become your muscle, while a different sequence, say, green, blue, orange, will crumple down into something different, for example an antibody to patrol your blood stream.

So essentially we have this “genetic code”, the sequence of amino acids (or beads), which in turn defines the shape that the protein will take. We in fact know that it is this shape that is the most important aspect of any protein, this having been found to define the protein’s actual function. This is because, returning to the bead analogy, we can change up to 80% of the beads to different colour while still retaining the same shape and function. This is amazing when you consider how many other objects can have their baseline composition changed to the same extent while still retaining the same function. The humble sausage is one of those objects (actually below 40% meat content they are referred to as “bangers”), but even then would you want 80% of you sausage to be filler. There is a reason Tesco value sausages taste so different to the nice ones you buy at the butchers. Returning to proteins, we are not trying to say that the sequence isn’t important, sometimes changing just a single bead can lead to a completely different shape. Instead it says that the shape is the critical aspect which defines the function. To summarise, sequence leads to shape, which in turn leads to function.

This is unfortunate, because while it is getting increasingly simple to experimentally determine the sequence of a protein, that is the exact order of coloured beads, the cost and time of getting the corresponding structure (shape) is still extremely prohibitive. In fact, we can look at two of the major respective databases, the PDB, which contains all known proteins structures, and GENBANK, which contains all known protein sequences, and compare the respective number of entries. The disparity between the two is huge, we are talking orders of magnitude huge, 10^15 huge, i.e the number of humans on the planet squared huge. AND this gap is growing larger every year. Basically, people in the last few years have suddenly gained access to cheap and fast tools to get a protein’s sequence, to the extent that people are widely taking scoops of water across the world and sequencing everything, not even bothering to separate the cells and microorganisms beforehand. Nothing analogous exists to get the structure of a protein. The process taking months to years, depending on many factors, each of which may be “something” for one protein and then a completely different “something” for a similar protein. This has led to a scenario where we know the sequence of every protein in the human genome, yet we only know the structure of only about 10% of them. This is utterly preposterous in my opinion given how important this information is to us! We basically don’t know what 90% of our DNA does!

Basically, until an analogous method for structure determination is produced, we have no choice but to turn to predictive methods to suggest the function of proteins that we do not have the structure for. This is important as it allows us, to some degree, target proteins that we “think” may have an important effect. If we didn’t do this, we simply would be searching for needles in haystacks. This is where my research, and that of my group, kicks in. We attempt to take these sequences, these string of beads, and predict the shape that they produce. Unfortunately, the scientific community as a whole still relatively sucks at this. Currently, we are only successful in predicting structures for very small proteins, and when anything more complex is attempted, we, in general, fail utterly miserably. In my opinion, this is because the human body is by far the most complex system on the planet and so far we have tried to simply supplant physics on top of the problem. This has failed miserably due to the sheer complexity and multitude of factors involved. Physics has mostly nice vacuums and pleasant equations, however ask a physicist about a many body system and they will cry. So many factors are involved that we must integrate them altogether which is why there are so many people are working on this, and will be for many years to come. Well, I guess that’s good news for my future academic career.

Anyway, I hope this talk has given you some degree of insight into the work I do and you have learned something about how your body works. For those extremely interested, please feel free to approach me later and I will happily regale you with the exact aspect of protein folding I work on. But for now I would love to try and answer any questions you all have on the content contained in this talk.