de novo Protein Structure Prediction software: an elegant “monkey with a typewriter”

In this week’s OPIG group meeting, I discussed the inner-works and the algorithm behind ROSETTA, one of the most well-known software for de novo protein structure prediction.

Before we even attempt to understand how ROSETTA works, let us start with a theorem.

Theorem: given an infinite number of monkeys with typewriters and an infinite amount of time, they are very likely to recreate the works of William Shakespeare.

Monkey with a typewriter… Time to write that Shakespeare!

Well, let us be a little more modest and attempt to recreate just a phrase of old Bill, instead of his whole works:

“The fool doth think he is wise, but the wise man knows himself to be a fool.”

Well, if we exclude spaces and punctuation marks, that leaves us 58 positions in our phrase (the length of the quote). Considering we have 26 possible letters for each position, we would expect to generate this phrase at random once every of 26^58 times. Wow!

That means that we need to evolve from monkeys (pun intended) and appeal to our over-developed encephalon!

In order to steer our Monkey typewriter, we can reduce this problem to a Global Optimisation problem. In a Global Optimisation problem, we define a function f (named an objective function) which we want to minimise for a given set of parameters x. Bare in mind that if we want to maximise a given function fwe can define g = -f 

In a global optimisation problem, we are interested in finding the values of X that minimise the function f(X).

Now, all we need is to define an objective function in order to guide our Monkey typewriter towards the right answer.

Let us define the following objective function: given our Shakespearean phrase and a sequence of 58 letters, the value of the objective function equals the number of letters that are different between the phrase and the sequence of letters.

We can now proceed to define a slightly more refined Monkey Typewriter:

1- Start with a random sequence of letters.
2- WHILE sequence != shakespearean_phrase:
3-________ Select a random position in the sequence.
4-________ Assign a new letter to that position.
5-________ IF score of new sequence < score of old sequence:
6-__________________ Accept the change.
7-________ ELSE:
8-__________________ Discard the change.

This way we can steer our Monkeys and reduce the time it would take to generate our Shakespearean phrase to a more feasible time.

Now, let’s talk about protein structure prediction (PSP). More specifically, let us talk about de novo protein structure prediction (different flavours of protein structure prediction have been discussed previously here).

One of the great ideas behind the creators of ROSETTA, was to use a combination of two different techniques to address the big problems of protein structure prediction:

1- Problem number #1 of PSP is the size of the conformational space. A protein can be represented by it’s backbone atoms, which, in turn, can be reconstructed from a sequence of torsion angles. A set of 3 torsion angles can be used to represent every protein residue. Therefore, for a protein with 100 residues, we would have a total of 300 angles. If we approximate each angle to assume one of 360 values (degrees), that gives us 360^300 possible conformations (not huge at all, han?).

One of the main ideas behind ROSETTA was to reduce the search space by using fragments extracted from known structures. The use of fragments restricts the possible angles to a set of values that are known to occur in nature. Therefore, instead of looking at 360^300 possible angles, we deal with a much more feasible search space.

The name ROSETTA is based on the Rosetta Stone, an archaeological artefact that allowed modern civilisation to interpret and convert between different alphabets. In reality, ROSETTA can be seen as a very elegant monkey typewriter. ROSETTA uses sequence and structure similarity to define a structural alphabet. For every single position in our protein sequence, we have a set of fragments extracted from now protein structures to represent that position.  Originally, each position would be represented by 25 fragments (letters?). If you combine the different pieces of known structures in the right order, you will get your Shakespearean Phrase in the end (the correct Protein Structure!).

2- Well, we still have a pretty big conformational space considering we have 25 fragments per position (approximately 25^100 possible conformations, for a protein with 100 residues). The second technique employed by ROSETTA is Simulated Annealing.

Simulated Annealing is a Global Optimisation heuristic. It attempts to find a good enough solution to the problem of minimising a given function f. It is very similar to our Monkey Typewriter algorithm. The main difference is that Simulated Annealing implements some tricks to avoid local minima entrapment. In simpler terms, if we ONLY accept favourable changes (Line 5 of Monkey Typewriter pseudo-code), once we reach a local minimum, we get trapped. No possible change would lead to an improvement, yet we are still far from finding the global minimum.

In order to mitigate that entrapment effect, Simulated Annealing defines a probability of accepting an unfavourable change. This probability is higher at the beginning of the simulation and it becomes lower and lower as the simulation progresses. This process is usually referred to as “cooling down”.

Ok! So we reduced our PSP problem to an elegant Monkey Typewriter. We have our Monkeys working to create the best possible Shakespeare, in a pretty clever and sophisticated manner. Well, we should be able to create some fine piece of literature, correct?

Not quite!

There are still several problems with this whole pipeline. I will mention a few:

  • When you define your structural alphabet, you may not have the right fragment to represent a certain position. This would be the same as trying to get to a Shakespearean phrase without using vowels for the first 10 letters or only using consonants in the middle of the sentence. It would never happen…
  • Despite the many efforts to define a very good objective function, no current software presents a function that truly mimics the behaviour of an energy function. This implies that we have a vague idea of how the Shakespearean phrase should look like, but we cannot precisely pinpoint where each letter goes.
  • No matter how elegant our Monkey typewriter becomes, the combinatorial problem still persists. We are still dealing with 25^100 possible conformations and it is impossible to try every single conformation.
  • The objective function, if plotted in a graph, would look completely hideous (unlike the picture above). We are talking about a gigantic multi-dimensional surface, filled with local minima that confuse and entrap our simulations. Combine that with the fact that our objective function is not accurate and you waste most of your computing power into generating solutions that are completely useless.
  • Another common technique to address the previous limitations is to increase the number of Monkeys in order to speed up the search process. If you use thousands and thousands of Monkeys (multiple runs of ROSETTA), each individual Monkey will get to a local minimum (decoy = something that looks like a phrase). In recent years, tens of thousands of decoys are generated in order to predict a single structure. A new problem arises, because out of these tens of thousands of phrases, we cannot tell apart Hamlet from Twilight. We don’t know which Monkeys got close to the right answer. All we know is that for some cases some of them did.

In conclusion, de novo Protein Structure Prediction still has a long way to go.

MAMMOTH: a case study in protein structure alignment

I’ve talked about protein structure alignment before in the context of a rather novel, mathematical approach. This time I wanted to revisit the topic in a general sense, using a more established algorithm as a case study. MAMMOTH stands for Matching Molecular Models Obtained from Theory and was first published in 2002. Since then it has been cited nearly 400 times and the underlying algorithm has been extended to a multiple alignment program: MAMMOTH-mult.

Establishing biologically relevant and consistent alignments between protein structures is one of the major unsolved problems in computational bioinformatics. However, it’s an important part of many challenges that we face: such as establishing homology between distantly related proteins, functional inference for unannotated proteins, and evaluating the accuracy of models of predicted structure for competitions such as CASP.

Problem Outline

In essence the problem of protein structure alignment can be outlined by considering two ordered sets of coordinates, A = {a1,a2,…,an} and B = {b1,b2,…,bm}, representing points in 3D space. In most cases these points will be the location of the Cα atoms along each structure’s backbone. The sets A and B might be completely different lengths and, if an alignment exists, are almost certainly orientated differently compared to each other.

coordinates

Establishing an alignment between these sets is equivalent to two steps:

  1. Establish a match M = {(ai,bj) | ai ∈ A, bj ∈ B}
  2. Rotate and translate A onto B so that equivalent atoms are as close as possible.

Of course, it is not immediately clear how to best fulfill these requirements. In particular, we don’t really know what features to prioritise for a biologically relevant match. Should we try to match secondary structure elements and what penalty should we attach to mismatched elements? How about maintaining the correct hydrogen bonding patterns between matched residues? And how much weight should we put on the matched atoms being consecutive in each set (i.e. how should we penalise gaps)?

The second step is equally ambiguous. Especially as there is no consensus on what the correct interpretation of close is. Minimising the RMSD between equivalent atoms is a popular choice of distance measure. However, as the MAMMOTH paper points out, RMSD is often dominated by the mismatched portions of remotely related structures and is thus largely inappropriate in these cases. Furthermore, even if we have a well-defined distance metric, should the superposition prioritise minimising the distances between nearly identical parts of the different structures, at the expense of less similar substructures? Or should the emphasis be on maintaining as lengthy a match as possible at the possible cost of a lower closeness of fit? How about the relative importance of a close fit for atoms in the core of the structure vs. those on the surface?

The majority of these questions remain unanswered and as a result it is often hard to validate alignments as we simply do not know what the right answer is. In fact, in many cases, manual analysis is preferred over any of the available computational techniques.

In this post I’ll go through how the MAMMOTH algorithm approaches each of these steps. For many of the above questions MAMMOTH does not postulate a solution, primarily because, as its name suggests, it was designed to assess prediction models which are often at low resolutions and lacking secondary structure or hydrogen bonding information. I think it’s important to keep these questions in mind, but there’s certainly no necessity to design a programme which deals with them all.

Step 1: Pairing up residues (similarity matrix)

In order to establish a match between equivalent atoms in A and B, MAMMOTH, like several other structural alignment algorithms, uses a well-established alignment technique: a similarity matrix (often inferred from and referenced as a distance matrix). A similarity matrix for alignment is an n x m array where each entry S(ai,bj) represents the pairwise similarity score between residues ai and bj. An alignment between the residues of A and B is any non-decreasing path (that is, a pair (ai,bj) in the path must appear later in the ordering of coordinates of both A and B than the preceding pair of residues in the path) from the top left corner of the array (a1,b1) to the bottom right corner (an,bm). For example the following path can be interpreted as an alignment between A = {a1, …, a11} and B = {b1, …, b8}

similarity_matrix

Any alignment can be scored by summing up the similarity scores along this path, while penalising any gaps in an appropriate way (normally, these algorithms use trial and error to decide on sensible penalties). For example, the above alignment would have the score S = S(a1,b1) + S(a2,b2) + S(a3,b3) + S(a7,b4) + S(a8,b5) + S(a9,b6) + S(a10,b7) + S(a11,b8) + α + 2β, where α and β are gap opening and gap extension penalties respectively. The optimal alignment is simply the alignment which maximises this score.

For sequence alignments similarity scores can be assigned to residues from substitution tables like BLOSUM. However, it is not immediately clear of an appropriate equivalent for structures. MAMMOTH, like several other algorithms, defines the similarity between different residues by examining their local structural landscape. Specifically, this means comparing fragments of each backbone, centred on the residue of interest. MAMMOTH uses the URMS distance between heptapeptide fragments. This distance is illustrated below using 2D chains and tripeptide fragments.

urms

Comparing residues a2 and b3 involves looking at the directions between each successive residue of the fragment. Each vector is mapped to the unit sphere, beginning at the origin and ending at the surface of the sphere (in this case 2 vectors are considered, and in MAMMOTH’s case 6 different 3D vectors are mapped). The optimal rotation is found, superposing equivalent vectors as best as possible, and then the RMSD of the endpoints on the surface of the sphere is calculated as URMS(ai,bj).

Aside: The optimal superposition of a set of vectors is actually a non-trivial problem. It is essentially equivalent to step 2 in our alignment protocol outlined above, but is significantly easier for the 6 vectors characterising a fragment in MAMMOTH’s algorithm.

Finally, S(ai,bj) is calculated by converting the distance into a similarity measure:

similarity

where URMSR is the expected URMS of a random set of vectors and:

delta

The optimal alignment through this MAMMOTH matrix is the path which maximises the sum of similarities between matched residues (each residue being at the centre of the heptapeptide fragment) using gap opening and extension penalties of 7.00 and 0.45 respectively.

Step 2: Global superposition (MaxSub)

The above alignment results in a match M’ optimising the local structural similarity of residues in each structure, however, their is no guarantee that this will result in a set of coordinates close in global space. In order to finalise the match set M ⊆ M’ as well as calculating the optimal superposition of the paired residues of A onto their equivalent points in B, MAMMOTH use the MaxSub algorithm. This is a very efficient algorithm (worth a read if you’re that way inclined) for calculating the maximal subset from a set of paired up atoms which are close in global space. MAMMOTH decide that close means < 4A away after superposition. They do not try to optimise a closer superposition than that but attempt to find the largest possible set of matched residues.

The MaxSub algorithm relies on the assumption (made for computational tractability) that the final subset M ⊆ M’ will have, somewhere, a set of at least four residues consecutive in M’. The algorithm then starts with every possible seed of four consecutive residues (just to illustrate the power of the assumption in reducing computational time: for a 150 residue protein there are just 147 such seeds, but over 2 million sets of four non-consecutive residues!! And it’s a pretty reasonable assumption to make as well). The MaxSub algorithm then calculates the superposition for those four matched pairs, extending the set of residues that are <4A away from their partners, recalculating the superposition using these new pairs as well, then removing any pairs which are no longer within the threshold of each other. It repeats these steps, gradually extending the set M, until the algorithm converges.

Scoring the alignment

Using the two approaches outlined above, MAMMOTH generates an alignment between the two input structures. In order to summarise the significance of this alignment, the algorithm generates the PSI score: the percentage structural identity (which is simply the size of the maximum subset divided by the length of the shortest protein). As a global measure of the strength of similarity the PSI score is poorly constructed and scales with protein length. In order to adjust for this bias, MAMMOTH fits a Gumbel distribution to PSI scores obtained from random structure comparisons between unrelated proteins at bins of different lengths. This results in a z-score measuring, instead of the PSI of an alignment, the likelihood of obtaining a PSI score as good as that by chance between any two proteins of the same lengths.

Distance matrix clustering

In Bioinformatics, we often deal with distance matrices such as:

  • Quantifying pairwise similarities between sequences
  • Structural similarity between proteins (RMSD?)
  • etc.

Next step is to study the groupings within the distance matrix using an appropriate clustering scheme. The obvious issue with most clustering methods is that you would need to specify the number of clusters beforehand (as for K-Means). Assuming that you do not know very much about the data and ‘plotting’ it is not an option, you might try non-parametric hierarchichal clustering such as linkage. The main difference between the two approaches is that using linkage you specify what the maximal distance within each cluster should be and thus the number of clusters will be adjusted accordingly. Par contre, using K-Means you do not have such a distance-guarantee within each cluster since the number of groups is predefined.

Here I will provide a short piece of python code that employs the hcluster library to perform linkage clustering.

Installation

Download hcluster, unpack it and inside the unpacked folder type:

python setup.py install

Alternatively, if you’re not an admin on your machine type:

python setup.py install --user

 Example Code

The purpose of the example bit of code is to generate a random set of points within (0,10) in the 2D space and cluster them according to user’s euclidean distance cutoff.

import matplotlib.pyplot as plt
from matplotlib.pyplot import show
from hcluster import pdist, linkage, dendrogram
import numpy
import random
import sys

#Input: z= linkage matrix, treshold = the treshold to split, n=distance matrix size
def split_into_clusters(link_mat,thresh,n):
   c_ts=n
   clusters={}
   for row in link_mat:
      if row[2] < thresh:
          n_1=int(row[0])
          n_2=int(row[1])

          if n_1 >= n:
             link_1=clusters[n_1]
             del(clusters[n_1])
          else:
             link_1= [n_1]

          if n_2 >= n:
             link_2=clusters[n_2]
             del(clusters[n_2])
          else:
             link_2= [n_2]

	  link_1.extend(link_2)
          clusters[c_ts] = link_1
          c_ts+=1
      else:
          return clusters

#Size of the point matrix
rows = 10 #number of points
columns = 2 #number of dimensions - 2=2D, 3=3D etc.
samples = numpy.empty((rows, columns))

#Initialize a random points matrix with values between 0, 10 (all points in the upper right 0,10 quadrant)
for i in xrange(0,rows):
    for j in xrange(0,columns):
       samples[i][j] = random.randint(0,10)

#Show the points we have randomly generated
print "Samples:\n ", samples

#Create the distance matrix for the array of sample vectors.
#Look up 'squareform' if you want to submit your own distance matrices as they need to be translated into reduced matrices
dist_mat = pdist(samples,'euclidean')

#Perform linkage clustering - here we use 'average', but other options are possible which you can explore here: http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.cluster.hierarchy.linkage.html
z = linkage(dist_mat, method='average',metric='euclidean')

#Specify a cutoff that will define the clustering - command line argument: 
#python ClusterExample.py 3.0
cutoff = float(sys.argv[1])
clustering = split_into_clusters(z,cutoff,rows)
if clustering==None:
	print "No clusters! Most likely your distance cutoff is too high - all falls into the same bag!"
	quit()

#Print the potential singletons - in magenta
for i in xrange(0,rows):
	plt.plot(samples[i][0],samples[i][1], marker='o', color='m', ls='')
	plt.text(samples[i][0],samples[i][1], str(i), fontsize=12)
#If there are more clusters than these the code will fail!
colors = ['b','r','g','y','c','k']

cluster_num = 0
for cluster in clustering:
   print "Cluster: ",cluster_num
   cluster_num+=1
   for i in clustering[cluster]:
	print "-->",i
	plt.plot(samples[i][0],samples[i][1], marker='o', color=colors[cluster_num], ls='')

#Set the axis limits
plt.xlim([-1,12])
plt.ylim([-1,12])

show()
#Alternatively plot it as dendogram to see where your distance cutoff performed the tree cut
dendrogram(z)
show()

When I ran the code above (python [whatever you call the script].py 2.0) that’s what I got (colors correspond to clusters with ‘magenta’ being singletons):

figure_1

And there is a dendogram command on the bottom of the script to see what the clustering has actually done and where it performed the cut according to your specified cutoff (colors DO NOT correspond to clusters here):

figure_2

 

Hcluster library forms part of scipy with very useful methods for data analysis. You can modify the above code to use a variety of other hierarchichal clustering methods which you can  further explore here.

Freezing python code

Many of us in the group use python as our primary programming language. It is in my opinion an awesome language for lots of reasons. However what happens when you write an application and want to share it with the world? Simply distributing the source code requires a great deal of configuration by the end user. I’m sure you’ve all been there, you have version 1.5.1 they use version 1.6.3. However to download and install this breaks every other bit of code you are using. Creating virtual environments can help towards this, but then do you really want to go towards all the hassle of this for every application you want to use? In the end I have given up trying on a number of projects, which is a fate you would never want for your own code!

From my point of view there are three ways of counteracting this issue.

  1. Make limited use of libraries and imports
  2. Have incredibly clear instructions on how to set up the virtual env
  3. Freeze your code!

The first solution is sometimes just not possible or desirable. For example if you want to use a web framework or connect to third party database engines. The second could be massively time consuming and it is virtually impossible to cover all bases. For example, RDKit, my favourite cheminformatics package, has a lengthy install process with platform specific quirks and many of its own dependencies.

In my project I opted for solution number three. I use PyInstaller however there are many others available (cx_freeze, py2apppy2exe). I used PyInstaller because my application uses the Django project and they offer extra Django support. Also PyInstaller is cross-platform, allowing me (in theory) to package applications for Windows, Mac and Linux using the same protocol.

Here I will briefly outline how to set freeze your code using PyInstaller. This application validates a smiles string and shows you the RDKit canonical form of the smiles string.
This is the structure of the code:

src/
  main.py
  module/
    __init__.py
    functions.py
build/
dist/

main.py is:

import sys
from module.functions import my_fun
if len(sys.argv) > 1:
  smiles = sys.argv[1]
  print my_fun(smiles)
else:
  print "No smiles string requested for validation"

functions.py is:

from rdkit import Chem
def my_fun(smiles):
  mol = Chem.MolFromSmiles(smiles)
  if mol is None:
    return "Invalid smiles"
  else:
    return "Valid smiles IN:  " + smiles + "  OUT: " + Chem.MolToSmiles(mol,isomericSmiles=True) 

  1. Download and install PyInstaller 
  2. Type the following (assuming main.py is your python script)
  3. pyinstaller src\main.py --name frozen.exe --onefile
      This will produce a the following directory structure:

      src/
        main.py
        module/
          functions.py
      build/
        frozen/
      dist/
        frozen.exe
      frozen.spec
      

      frozen.spec is a file containing the options for building the application:

      a = Analysis(['src\\main.py'],
                   pathex=['P:\\PATH\\TO\\HEAD'],
                   hiddenimports=[],
                   hookspath=None,
                   runtime_hooks=None)
      pyz = PYZ(a.pure)
      exe = EXE(pyz,
                a.scripts,
                a.binaries,
                a.zipfiles,
                a.datas,
                name='frozen.exe',
                debug=False,
                strip=None,
                upx=True,
                console=True )
      

      “build” contains files used in the building of the executable

      “dist” contains the executable that you can distribute freely around. Because I used the “–onefile” option above it creates one single .exe file. This makes the file very easy to ship – HOWEVER for large programmes this isn’t totally ideal. All the dependencies are compressed into the .exe and uncompressed into a temporary folder at runtime. If there are lots of files, this process can be VERY slow.

      So now we can run the program:

      dist/frozen.exe c1ccccc1 

      Running dist/frozen.exe returns the error: ImportError: numpy.core.multiarray failed to import
      This is because the RDKit uses this module and it is not packaged up in the frozen code. The easiest way to resolve this is to include this import in main.py:

      from rdkit import Chem
      import numpy
      import sys
      from module.functions import my_fun
      if len(sys.argv) > 1:
        smiles = sys.argv[1]
        print my_fun(smiles)
      else:
        print "No smiles string requested for validation"
      

      And there you have it. “frozen.exe” can be passed around to anyone using windows (in this case) and will work on their box.

      Obviously this is a very simple application. However I have used this to package Django applications, using Tornado web servers and with multiple complex dependencies to produce native windows desktop applications. It works! Any questions, post below!

OPIG Algorithm Club Problem #1: The 3n+1 problem

In the first meeting of the OPIG Algotirhm Club, we tackled the problem 3n+1 from the Sphere Online Judge (SPOJ).

 —Background —

The problem relates to the Collatz conjecture. Before describing the conjecture, let us define a very simple recursive relation:

  • If a number is odd, multiply it by 3 and add 1 (hence 3n+1).
  • If a number is even, divide it by 2.

The Collatz conjecture states that for any given integer, you can repeat this recursion indefinetely and you will always reach the number 1.

The 3n+1 problem requires yet another concept which is the cycle length (i.e. the number of times you have to repeat the recursion for a given number until you reach 1).

— Goal —

The aim of the problem is: given 2 integers and j, find the number with the highest cycle length that lies in the interval comprised between the 2 integers (them included).

— Important Issues —

Two things are worth mentioning before attempting to implement this problem:

  •  could be either greater than or lesser than j.
  • Despite the fact that the description of the problem states that all operations can be performed using int, there are certain inputs between 100,000 and 1,000,000 that violate that statement. 

— Implementation —

A simple solution can be implemented by using a recursive function to compute the cycle length:

int Cycle_Length (unsigned int n) /* Note the unsigned precision */
{
    	unsigned int aux;

        /* Stopping Condition - Don't want the recursion to run for forever */
        if (n == 1) return 1;

        /* Here is the recursion, if n is odd  else     if n is even   */
	    aux = (n%2) ? (1 + Cycle_Length(3*n+1) ) : (1+Cycle_Length(n>>1));
        /* Note that division by two is performed by shifitng a bit to the right */
	    return aux;
}

Now, there are two ways we can optimise this function and make it more efficient.

The first one relates to a certain property of odd numbers: we know that if n is odd, than 3*n+1 is going to be even. Therefore we can simply skip one iteration of the cycle:

aux = (n%2) ? (2+Cycle_Length((3*n+1)>>1)) : (1+Cycle_Length(n>>1)) ;

We can further optimise our code by using Dynamic Programming. We can store the Cycle Lengths we already computed so we don’t need to compute them ever again.

Using dynamic programming and the tricks above, we can get to the final solution (which runs pretty fast on the Online Judge platform):

#include<stdio.h>
#include<stdlib.h>

int CycleLength[100001];

int Cycle_Length (unsigned int n)
{
	int aux;
	if(n>100000)
	{
		aux = (n%2) ? (2+Cycle_Length((3*n+1)>>1)) : (1+Cycle_Length(n>>1)) ;
		return aux;
	}
	if(!CycleLength[n]) 	
		CycleLength[n] = (n%2) ? (2+Cycle_Length((3*n+1)>>1)) : (1+Cycle_Length(n>>1)) ;
	return CycleLength[n];
}

int findmax(int a, int b)
{
	int max=0,aux,i;
	for(i=a;i<=b;i++)
	{
		aux=Cycle_Length(i);
		if(max<aux)
			max=aux;
	}
	return max;	
}

int main()
{
	int a,b;
	CycleLength[1]=1;
	while(scanf("%d %d",&a,&b)!=EOF)
		a<b?printf("%d %d %d\n",a,b,findmax(a,b)):printf("%d %d %d\n",a,b,findmax(b,a));
	return 0;
}

Ta daam!

Introduction to the protein folding problem

Recently (read: this week), I had to give a presentation on my research to my college. We were informed that the audience would be non-specialist, which in fact turned out to be an understatement. For example, my presentation followed on from a discussion of the differences in the education systems of North and South Korea for the period 1949-1960. Luckily, I had tailored my entire talk to be understandable by all and devoid of all Jargon. I choose to deviate from the prescribed topic and Instead of talking about my research specifically, I choose to discuss the protein folding problem in general. Below you’ll find the script I wrote, which I feel gives a good introduction to the core problem of this field.

———————-

The protein folding problem is one of the great projects within the life sciences. Studied by vast numbers of great scientists over the last half century, with backgrounds including chemistry, physics, maths and biology, all were beaten by the sheer complexity of the problem. As a community we still have only scraped the surface with regards to solving it. While I could, like many of you here, go into my own research in great detail and bore you wholeheartedly for the next 10 minutes with technical details and cryptic terminology, I will instead try to give an overview of the problem and why thousands of scientists around the globe are still working on cracking this.

First of all, I guess that a few of you are trying to remind yourself what a protein is; the horror that is high school biology crawling back from that area in your brain you keep for traumatic experiences like family gatherings. Luckily, I’m fairly new to the topic myself, my background being in physics and chemistry, so hopefully my explanation will be still in the naive terms that I use to explain the core concepts to myself. Proteins are the micro-machines of your body, the cogs that keep the wheels turning, the screws that hold the pieces together and the pieces themselves. Proteins run nearly all aspects of your body and biochemistry, your immune system, your digestion and your heart beating. There are approximately between 20 to 30 thousand different proteins in your body, depending on who you ask, and trillions overall. In fact, if we take every protein in our body and scale it up to the size of a penny, the proteins in a single human, albeit a rather dead human, would be enough to fill the entire pacific ocean. Basically, there is a hell of a lot of proteins, with a vast range of different types, each of which is very individual, both in its compositions and function, and, crucially, they are nearly all essential. The loss of any protein can lead to dramatic consequences including heart disease, cancer, and even death.

So now that you know that they are important and there are lots of them, what exactly is a protein? The easiest analogy I have is that of a pearl necklace, a long string of beads in a chain. Now consider your significant other has gone slightly insane and instead of purchasing jewelry for you that consists of a single bead type, or even two if you have slightly exotic tastes, they have been shopping at one of the jewellery stores found in the part of town that smells rather “herby”. You receive a necklace which has different beads across the entire length of the necklace. We have blues, yellows, pinks, and so on and so forth. In fact, we have 20 different types of beads, each with its own colour. This is basically a protein chain: each of the beads represents one of the twenty essential amino acids, each of which has its own chemical and physical properties. Now suppose you can string those pearls together in any order: red, green, blue, blue, pink etc. It turns out that the specific order that these beads are arranged along the length of the protein chain define exactly how this chain “crumples” into a 3D shape. If you think that adding an extra dimension is impossible, just consider crumpling a piece of paper; that is 2D –> 3D transition (mathematicians please bite your tongue). Now one string of colours, blue, blue, pink for example, will crunch into one shape, and that shape may become your muscle, while a different sequence, say, green, blue, orange, will crumple down into something different, for example an antibody to patrol your blood stream.

So essentially we have this “genetic code”, the sequence of amino acids (or beads), which in turn defines the shape that the protein will take. We in fact know that it is this shape that is the most important aspect of any protein, this having been found to define the protein’s actual function. This is because, returning to the bead analogy, we can change up to 80% of the beads to different colour while still retaining the same shape and function. This is amazing when you consider how many other objects can have their baseline composition changed to the same extent while still retaining the same function. The humble sausage is one of those objects (actually below 40% meat content they are referred to as “bangers”), but even then would you want 80% of you sausage to be filler. There is a reason Tesco value sausages taste so different to the nice ones you buy at the butchers. Returning to proteins, we are not trying to say that the sequence isn’t important, sometimes changing just a single bead can lead to a completely different shape. Instead it says that the shape is the critical aspect which defines the function. To summarise, sequence leads to shape, which in turn leads to function.

This is unfortunate, because while it is getting increasingly simple to experimentally determine the sequence of a protein, that is the exact order of coloured beads, the cost and time of getting the corresponding structure (shape) is still extremely prohibitive. In fact, we can look at two of the major respective databases, the PDB, which contains all known proteins structures, and GENBANK, which contains all known protein sequences, and compare the respective number of entries. The disparity between the two is huge, we are talking orders of magnitude huge, 10^15 huge, i.e the number of humans on the planet squared huge. AND this gap is growing larger every year. Basically, people in the last few years have suddenly gained access to cheap and fast tools to get a protein’s sequence, to the extent that people are widely taking scoops of water across the world and sequencing everything, not even bothering to separate the cells and microorganisms beforehand. Nothing analogous exists to get the structure of a protein. The process taking months to years, depending on many factors, each of which may be “something” for one protein and then a completely different “something” for a similar protein. This has led to a scenario where we know the sequence of every protein in the human genome, yet we only know the structure of only about 10% of them. This is utterly preposterous in my opinion given how important this information is to us! We basically don’t know what 90% of our DNA does!

Basically, until an analogous method for structure determination is produced, we have no choice but to turn to predictive methods to suggest the function of proteins that we do not have the structure for. This is important as it allows us, to some degree, target proteins that we “think” may have an important effect. If we didn’t do this, we simply would be searching for needles in haystacks. This is where my research, and that of my group, kicks in. We attempt to take these sequences, these string of beads, and predict the shape that they produce. Unfortunately, the scientific community as a whole still relatively sucks at this. Currently, we are only successful in predicting structures for very small proteins, and when anything more complex is attempted, we, in general, fail utterly miserably. In my opinion, this is because the human body is by far the most complex system on the planet and so far we have tried to simply supplant physics on top of the problem. This has failed miserably due to the sheer complexity and multitude of factors involved. Physics has mostly nice vacuums and pleasant equations, however ask a physicist about a many body system and they will cry. So many factors are involved that we must integrate them altogether which is why there are so many people are working on this, and will be for many years to come. Well, I guess that’s good news for my future academic career.

Anyway, I hope this talk has given you some degree of insight into the work I do and you have learned something about how your body works. For those extremely interested, please feel free to approach me later and I will happily regale you with the exact aspect of protein folding I work on. But for now I would love to try and answer any questions you all have on the content contained in this talk.

Quick Standalone BLAST Setup for Ubuntu Linux

Some people run into trouble trying to setup a standalone version of BLAST using the NCBI instructions. Here a stremalined process will be presented, targeted at Ubuntu.

I assume that you are aware of the paradigms of blast, meaning that there are several executables for searching nucleic acids or proteins and there are different databases you can blast against. Sinon, you should read up on the available search tools  and databases before you attempt to install Blast. NB, throughout this document, I am using protein blast and protein input – changing to nucleotide sequences is trivial as you just change blastp to blastn and ‘prot’ to ‘nt’ in obvious places (and of course you use different queries and target databases).
Without further ado, Blast setup for UNIX.
There are two components for the installation:
  1. Executables (bastn, blastp etc.)
  2. Databases. (nr, nt etc.)
Both are described below with follow-up examples of usage.
Ad.1 The executables can be downloaded and compiled from here (download the source, run ./configure then make and finally make install in the directory of the untarred file). However a much easier way to do it under Ubuntu is:
sudo apt-get install ncbi-blast+
This automatically installs everything. In both cases to check if all went ok, type:
which blastp
If you get a directory such as /usr/local/bin than all went well and that’s where your executables are.
Ad.2 FIrst, you need to decide on where to store the databases. Do this by setting the environment variable:
export BLASTDB=/path/to/blastdbs/of/your/chosing
Now, we can either use one of the ncbi-curated databases or create our own. We will do both.
A) Downloading and using an ncbi-curated database.
The databases can be downloaded using the update_blastdb script. As an example I will download a non redundant protein database which is referred to as ‘nr’:
cd $BLASTDB
sudo update_blastdb --passive --timeout 300 --force --verbose nr
ls *.gz |xargs -n1 tar -xzvf
rm *.gz.*

The penultimate command extracts all the files you have downloaded and the last one removes the downloaded archives.

Now you should be able to use your new database by executing (where somesequence.fasta is your sample query):

blastp -db nr -query somesequence.fasta

Done.

B) Creating your own database.

Firstly, put a bunch of fasta protein sequences into a file called sample.fa

Next, execute the following

makeblastdb -in sample.fa -dbtype 'prot' -out NewDb
mv NewDB* $BLASTDB/

We have now created a blast protein database from your fasta file, called NewDB. The last line simply moves all the blast files to the database directory.

Now you should be able to use your new database by executing (where somesequence.fasta is your sample query):

blastp -db NewDb -query somesequence.fasta

Done.

Afterword

These instructions are the shortest way I could find to get a working stand-alone BLAST application. If you require more info, you can look here.

 

Journal Club: Statistical Quality Indicators for Electron Density Maps

This Week I presented Ian Tickle’s 2012 Paper “Statistical quality indicators for electron-density maps”. This paper presented new, statistically robust metrics for describing the agreement between an atomic model and the experimentally derived electron density.

Previous metrics such as the Real-space R (RSR) and Real-Space Correlation Coefficient (RSCC) (Brandon & Jones, 1991, and others) are popular electron density metrics, and can inform on the quality of an atomic model. However, as Tickle claims, they cannot inform on how the model is good, or bad, as they give no indication of the accuracy, or the precision, of the electron density.

Accuracy:

Ian Tickle describes accuracy as – “How close are the results on average to the truth (regardless of precision)?” This is more often referred to as ‘error’. The most accurate model is the one that best agrees with the electron density.

Precision:

Precision is described as – “If you were to repeat the experiment, how much would you expect the results to vary (regardless of accuracy)?” This is more often described as ‘uncertainty’. Precision is a property of the crystal and the experiment. It is independent of the model.

A pictographic representation is shown below –

Pictographic representation of accuracy and precision. Taken from Tickle, 2012.

Before the discussion of the new metrics proposed, there are several assumptions that must be made and several influencing factors to be considered.

Assumptions:

  • The electron density, and the phases used to generate it, are accurate. This assumption is reasonable because density-based validation is generally done near to the end of refinement when the model is mostly correct.

Metric usefulness depends critically on:

  • Accurate calculation and representation of the electron density from our atomic model.
  • Accurate scaling of the observed and model density (neither calculated nor observed density is not on an absolute scale).
  • Accurate determination of the area to be considered for the calculation of the metric. If too large an area is considered, noise and density from other features will influence the metric. Too small an area will not encompass the whole model and its environment.

Calculating the Model Density:

Accurate calculation of the model’s electron density is essential, as the profile of the atoms will of course affect the comparison of the model to the experimental density. Often (as in Jones 1991, and others) a fixed profile is assumed for all atoms. Of course, in reality the profile will depend on atom type, B-factors, data completeness, and resolution limits.

Due to the resolution limits, the electron density from an atom is the convolution of a 3d gaussian and a sphere of constant scattering power (Blundell & Johnson, 1976). The truncated density function for an atom then becomes:

Screen Shot 2014-04-08 at 19.50.14

Scaling the calculated density:

This, fortunately, is already available and calculated by refinement programs (when calculating 2mFo – DFc maps), and the correct scaling factor is the resolution-dependent D.

Visualising the quality of a model:

To demonstrate how the (global) quality of a model can easily be seen, Tickle calculates and normalises difference density maps for a good, and a bad, model. If the model is ‘correct’, then the difference density should be gaussian noise, but if the model is ‘incorrect’, it will be skewed. This can easily be seen in Figure 8 from the paper.

Screen Shot 2014-04-08 at 20.26.06

A difference density map is calculated, sorted by value and normalised to give a density distribution. For a good model, this should look like (a), where the density function is a gaussian ~ N(0,1). For a bad model, (b), the distribution is skewed.

The main feature that appears in a ‘bad’ model is the increased weight in the tails of the distribution. Extra weight on the left-hand side indicates model that is not supported by the evidence, and extra weight on the right-hand side indicates under-modelled density.

The New Accuracy Metric

Using the ideas from above (that the difference density between a model and the experimental density should be distributed as a gaussian) Tickle goes on to develop metrics for determining the likelihood that a model density and an experimental density differ only by the addition of random noise.

The metric he describes tests a hypothesis – Does the distribution of the difference density reflect that obtained from the propagation of random errors in the experimental data (and phases)?

To do this, statistical tests are developed. First we define the difference density Z-score (ZD)

Screen Shot 2014-04-08 at 21.05.19

This quantity is the difference between the calculated electron density and the experimental density (delta rho), divided by the error in the difference density, giving the normal definition of a normalised Z-score.

The difference density (the numerator) has been discussed above, so we now discuss the error in the difference density. Assuming that the experimental data and the phases are ‘correct’, any non-random errors arise only from the model.

That is, errors arising in the experimental data will appear only as random noise, whereas errors in the model will manifest as the signal that we are trying to detect.

To calculate the strength of the noise (that of the experimental data and the phases), we look at the bulk-solvent regions. Here, the atoms are unordered, and so should give uniform density. Any deviations from uniform should be approximately the random noise from the experimental data and the phases.

Maximum Z-score analysis

Tickle considers using the maximum ZD of a sample as a test for model accuracy, and discusses merits and failings. In brief, if we were to sample from a difference density distribution, and take only the most significant ZD score, “focusing only on the maximum value inevitably overstates the significance of the results”.

A Chi-Squared test for ZD scores

The solution that Tickle proposes is to allow that all sample values may be significant (rather than just the largest values). He creates a joint probability density function of the absolute sample values (assumed half-normal and iid). This probability density function then becomes a chi-squared distribution.

Screen Shot 2014-04-08 at 22.34.30By calculating the CDF of the chi-squared (a lower regularised gamma function), Tickle is able to attain p-values for a set of observations.

Screen Shot 2014-04-08 at 22.38.11These can then be converted back to Z-scores, which crystallographers are more comfortable using. As Tickle states, just because the metric is in terms of Z-scores does not mean that the distribution is normal (here it is clearly a chi-squared).

Problem: Large Samples

The problem with the above is that for large samples, a small number of significant values will be drowned out by noise and the signal may be missed. The failure of the above test in this situation is put down to the choice of null hypothesis. Multiple null hypotheses are needed in order to cover all possibilities.

When distinguishing between multiple hypotheses, we must aim to avoid type II errors wherever possible, whilst attempting to minimise type I errors. We must select the hypothesis “that maximises the probability of obtaining a result less extreme that the one actually observed (…) or equivalently the one that minimises the probability of obtaining a result more extreme than that observed”.

Solution: A new JPDF and CDF

To solve this, Tickle takes a subset of the highest values of the original sample of n, say from i=k to n (the n-k highest scores), and calculates the chi-squared and its associated cumulative probability. We will then choose the value of k such that it gives us the highest probability,

Screen Shot 2014-04-08 at 22.56.34However, the cumulative probability of chi-squared is no longer the regularised gamma function due to the bias introduced by selected the largest values. Recalculating the JPDF and integrating analytically and numerically to obtain the CDF, we could arrive at a result. This, however, has the problem of a large dimensionality, which requires the use of very accurate Monte Carlo integration (accuracy of much better 0.27% is required, since we are interested in the p-values between 0.9973 and 1 – greater than 3 sigma).

Fortunately, an approximation can be made to bring about a practical solution.

Examples of Significant Distributions:

Tickle generates a table which gives several scenarios that will give a significant result, for different extremes of Z-value, and different sample sizes. One particularly key point is

…small samples are statistically less reliable so require a higher proportion of significant data points to achieve the same overall level of significance. Large samples require relatively fewer data points but they must have higher values to overcome the ‘multiple comparisons’ effect, where large values are more likely to occur occur purely as a result of random error.

Summary

B-factors:

Tickle shows early in the paper that the RSR and RSCC are correlated with the B-factor of the model. RSZD shows no correlation with the B-factor, as is desired.

RSZD+ & RSZD:

More useful scores can be generated by scoring the negative and positive values of delta-rho separately. This gives two scores, RSZD+ and RSZD-. RSZD+ gives the significance/prevalence of unexplained density (missing atoms) and RSZD- gives the significance/prevalence of unjustified model/misplaced atoms.

Precision & Reliability:

Although not discussed in as much depth in the paper, Tickle also proposes a metric to account for the precision of the electron density

Screen Shot 2014-04-08 at 23.24.34

This is clearly independent of the model, and is the signal-to-noise ratio of the average observed density in a specified region. Weak density (or large noise) will lead to a small RSZO, implying that any model placed here should be considered unreliable.

Structural Biology Module @ the DTC

As part of the DTC Structural Biology module (Feb 2014), first year phD students were given 3 days to answer one of several questions from fields within structural biology. The format had to be an automated presentation, and it had to be ENTERTAINING.

Video 1: Is Your Ligand Really There?

The pilot episode of the award-winning series “Protein Hour”…

Video 2: Protein-Protein Docking

Do not attempt to spoof “The Matrix” – That is impossible…

http://www.youtube.com/watch?v=1hVddocYpwI

Video 3: Are Membrane Proteins Special?

An appeal from “Protein Relief 2014″…

Video 4: Structure-based and fragment-based drug design – do they really work?

Is stop-motion animation the next blockbuster in drug design?

http://www.youtube.com/watch?v=-YHicyspTR4

Journal Club: Native contacts in protein folding

Like your good old headphone cables, strings of amino acids have the potential to fold into a vast number of different conformations given the appropriate conditions. A conservative estimation for the time it would take a 100 residue protein to explore all theoretically possible conformations would exceed the age of the Universe several times. This is obviously not feasible and was pointed out by Levinthal when he published his “How To Fold Graciously” in 1969.

The so called Protein-Folding Problem has since been under intense study, which inevitably has led to a few theories and models about its nature. Due to the lack of appropriate wet-lab methods to study this phenomenon theoretical, computational approaches have been key to devising impactful frameworks for formally describing protein folding. One of these goes under the name of principle of minimum frustration introduced by Bryngelson and Wolynes in the late 80s (1). It states that proteins by evolution were enriched for sequences with the propensity to fold into low-energy structures, while actively selecting against traps. By avoiding mis-folding and non-native contacts, the theory says, a smooth funnel-like energy landscape with native-state minima is created that ensures robust and fast folding.

This implies that native contacts, i.e. residues that interact in the fully folded protein play a major role in the folding process. Gō models (2), named after Nobuhiro Gō who first proposed this method, are based around this assumption with the energetic contributions of native interactions acting as the sole driving forces in the folding process. While this approach has yielded promising results, many of which were in concordance with experiments, its underlying principles have never been validated in a statistically meaningful way.

native contact schematic

A schematic for native-contact-driven protein folding

In 2013 a study by Best, Hummer and Eaton (3) formally addressed this question. By devising a set of statistical quantities aimed at weighting the importance of native and non-native interactions for folding and applying these to the analysis of several long MD folding simulations they were able to show a “native-centric mechanism” for small fast-folding proteins.

In a first step it was assessed whether the fraction of native contacts  provided a suitable reaction coordinate for the simulated folding events. From their equilibrium simulations two thresholds of native-contact-fractions  were chosen that defined folded and unfolded states (a two-state model is assumed). Overlaying the values for the most visited native-contact-fractions during simulation against these thresholds revealed a strong correlation between the two equilibrium probability density maxima and the protein’s fold state. In addition they showed that the range of native-contact-fractions between those found to represent unfolded and folded thresholds were indicative of being on a transition path (defined as the  “.. regions of the trajectories that cross directly from the unfolded well to the folded well ..”).

A further measure was introduced with the contact lifetime test. The log-ratio of the time a contact spent on a transition path vs the time it existed in the unfolded state was calculated and compared in a heat-map to the native contact map coloured by the number of contacts between residues.

figure2

Contact life time test for a selected protein.
Adapted from (3).

Among others this result revealed a clear connection between contacts with longer transition path life times and the number of contacts they made in the native structure.

So what about non-native interactions?

Screenshot from 2014-03-27 12:47:04

One of the measures addressing this question was the Bayesian measure for non-native contacts on transition paths. In the examples used in this paper, no obvious link between being on a transition path given a non-native contact was found unless they were close to native contacts. Further criteria such as the complementary quantity, which is the probability of being on a transition path when a contact is not made, concluded in a similar fashion.

Interestingly, it was found that the one protein that was influenced by non-native contacts was the designed α3D. Best et al. reasoned that additional frustration introduced when building a protein with artificially introduced stability has led to a shifting of helix register giving rise to this outlier.

When taken together, these results lay a robust foundation for further studies along the same lines. It is too early to accept or reject the presented findings as universal truth, but strong arguments for the native-centric mechanism being a reasonable model in small fast-folding proteins have been made. It would not be far-fetched to think that larger proteins would adhere to similar principles with non-native contacts modulating the landscape, especially when considering individual downhill folding modules.

References:

(1) Bryngelson, J.D. et al., 1995. Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins, 21(3), pp.167–95.

(2) Taketomi, H., Ueda, Y. & Gō, N., 1975. Studies on protein folding, unfolding and fluctuations by computer simulation. I. The effect of specific amino acid sequence represented by specific inter-unit interactions. International journal of peptide and protein research, 7(6), pp.445–59.

(3) Best, R.B., Hummer, G. & Eaton, W.A., 2013. Native contacts determine protein folding mechanisms in atomistic simulations. Proceedings of the National Academy of Sciences of the United States of America, 110(44), pp.17874–9.