Author Archives: Alistair Martin

Colour page counter

So you’ve written the thesis, you’ve been examined, the corrections are done, and now you are left with just wearing the silly clown robes to get a piece of paper with your name on it. However, you’ve been informed that you aren’t allowed to don the silly robes until you print the damn thing (again) and submit it to the Bod to be ignored for generations to come. Oh, and the added bonus is that you have to pay for it. Naturally, you want the high-quality printing and paper to match for the final versions, but it’s all so expensive. At least you can save a few meagre pounds by specifying only the pages for colour printing. Naturally, I decided that I would spend far more time making a script than just counting them myself (which I did anyway to verify it works). Enjoy.

#!/usr/bin/env Rscript

library(data.table)

args <- commandArgs(trailingOnly=TRUE) x <- system(paste("gs -q -o - -sDEVICE=inkcov",args[1]," | awk '{print $1,$2,$3}'"),intern=TRUE) x <- as.data.table(tstrsplit(x,' ')) x[,c("V1","V2","V3"):=.(as.numeric(V1),as.numeric(V2),as.numeric(V3))] print(paste("Colour pages total:",sum(rowSums(x)!=0))) print(paste("Colour pages:", paste(which(rowSums(x) != 0),collapse=', ')))

Processing large files using python: part duex

Last week I wrote a post on some of the methods I use in python to efficiently process very large datasets. You can find that here. Roughly it details how one can break a large file into chunks which then can be passed onto multiple cores to do the number crunching. Below I expand upon this, first creating a parent class which turns a given (large) file into chunks. I construct it in a manner which children classes can be easily created and tailored for specific file types, given some examples. Finally, I give some wrapping functions for use in conjunction with any of the chunkers so that the chunks can be processed using multiple cores.

First, and as an aside, I was asked after the previous post, at what scale these methods should be considered. A rough answer would be when the size of the data becomes comparable to the available RAM. A better answer would be, when the overhead of reading each individual line(/entry) is more than the operation on that entry. Here is an example of this case, though it isn’t really that fair a comparison:


>>> import timeit,os.path
>>> os.path.getsize("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa")
10095955
>>> timeit.timeit('f = open("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa");sum([l.count(">") for l in f])',number=10)
0.8403599262237549
>>> timeit.timeit('f = open("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa");sum([c.count(">") for c in iter(lambda: f.read(1024*1024),"")])',number=10)
0.15671014785766602

For a small 10MB fasta file, we count the number of sequences present in a fifth of the time using chunks. I should be honest though, and state that the speedup is mostly due not having the identify newline characters in the chunking method; but nevertheless, it shows the power one can have using chunks. For a 14GB fasta file, the times for the chunking (1Mb chunks) and non-chunking methods are 55s and 130s respectively.

Getting back on track, let’s turn the chunking method into a parent class from which we can build on:


import os.path

class Chunker(object):

    #Iterator that yields start and end locations of a file chunk of default size 1MB.
    @classmethod
    def chunkify(cls,fname,size=1024*1024):
        fileEnd = os.path.getsize(fname)
        with open(fname,'r') as f:
            chunkEnd = f.tell()
            while True:
                chunkStart = chunkEnd
                f.seek(size,1)
                cls._EOC(f)
                chunkEnd = f.tell()
                yield chunkStart, chunkEnd - chunkStart
                if chunkEnd >= fileEnd:
                    break

    #Move file pointer to end of chunk
    @staticmethod
    def _EOC(f):
        f.readline()

    #read chunk
    @staticmethod
    def read(fname,chunk):
        with open(fname,'r') as f:
            f.seek(chunk[0])
            return f.read(chunk[1])

    #iterator that splits a chunk into units
    @staticmethod
    def parse(chunk):
        for line in chunk.splitlines():
            yield chunk

In the above, we create the class Chunker which has the class method chunkify as well as the static methods, _EOC, read, and parse. The method chunkify does the actual chunking of a given file, returning an iterator that yields tuples containing a chunk’s start and size. It’s a class method so that it can make use of _EOC (end of chunk) static method, to move the pointer to a suitable location to split the chunks. For the simplest case, this is just the end/start of a newline. The read and parse methods read a given chunk from a file and split it into units (single lines in the simplest case) respectively. We make the non-chunkify methods static so that they can be called without the overhead of creating an instance of the class.

Let’s now create some children of this class for specific types of files. First, one of the most well known file types in bioinformatics, FASTA. Below is an segment of a FASTA file. Each entry has a header line, which begins with a ‘>’, followed by a unique identifier for the sequence. After the header line, one or more lines follow giving the sequence. Sequences may be either protein or nucleic acid sequences, and they may contain gaps and/or alignment characters.


>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

And here is the file type specific chunker:


from Bio import SeqIO
from cStringIO import StringIO

class Chunker_FASTA(Chunker):

    @staticmethod
    def _EOC(f):
        l = f.readline() #incomplete line
        p = f.tell()
        l = f.readline()
        while l and l[0] != '>': #find the start of sequence
            p = f.tell()
            l = f.readline()
        f.seek(p) #revert one line

    @staticmethod
    def parse(chunk):
        fh = cStringIO.StringIO(chunk)
        for record in SeqIO.parse(fh,"fasta"):
            yield record
        fh.close()

We update the _EOC method to find when one entry finishes and the next begins by locating “>”, following which we rewind the file handle pointer to the start of that line. We also update the parse method to use fasta parser from the BioPython module, this yielding SeqRecord objects for each entry in the chunk.

For a second slightly harder example, here is one designed to work with output produced by bowtie, an aligner of short reads from NGS data. The format consists of of tab separated columns, with the id of each read located in the first column. Note that a single read can align to multiple locations (up to 8 as default!), hence why the same id appears in multiple lines. A small example section of the output is given below.


SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN25-2	2502 GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN25-1	2502	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN37-1	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN25-1	2502	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN37-1	4460	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN25-2	2502	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.8 HWI-EAS355_3_Nick_1_1_187_1549 length=36	+	RDN25-2	2502	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3	
SRR014374.8 HWI-EAS355_3_Nick_1_1_187_1549 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3

with the corresponding chunker given by:


class Chunker_BWT(chunky.Chunker):

    @staticmethod
    def _EOC(f):
        l = f.readline()#incomplete line
        l = f.readline()
        if not l: return #EOF
        readID = l.split()[0]
        while l and (l.split()[0] != readID): #Keep reading lines until read IDs don't match
            p = f.tell()	
            l = f.readline()
        f.seek(p) #revert one line

    @staticmethod
    def parse(chunk):
        lines = chunk.splitlines()
        N = len(lines)
        i = 0 
        while i < N:
            readID = lines[i].split('\t')[0]
            j = i
            while lines[j].split('\t')[0] == readID:
                j += 1
                if j == N:
                    break
            yield lines[i:j]
            i = j

This time, the end of chunk is located by reading lines until there is a switch in the read id, whereupon we revert one line. For parsing, we yield all the different locations a given read aligns to as a single entry.

Hopefully these examples show you how the parent class can be expanded upon easily for most file types. Let’s now combine these various chunkers with the code from previous post to show how we can enable multicore parallel processing of the chunks they yield. The code below contains a few generalised wrapper functions which work in tandem with any of the above chunkers to allow most tasks to be parallelised .


import multiprocessing as mp, sys

def _printMP(text):
    print text
    sys.stdout.flush()

def _workerMP(chunk,inFile,jobID,worker,kwargs):
    _printMP("Processing chunk: "+str(jobID))
    output = worker(chunk,inFile,**kwargs)
    _printMP("Finished chunk: "+str(jobID))
    return output	

def main(inFile,worker,chunker=Chunker,cores=1,kwargs={}):
    pool = mp.Pool(cores)

    jobs = []
    for jobID,chunk in enumerate(chunker.chunkify(inFile)):
        job = pool.apply_async(_workerMP,(chunk,inFile,jobID,worker,kwargs))
        jobs.append(job)

    output = []
    for job in jobs:
        output.append( job.get() )

    pool.close()
    
    return output

The main function should be recognisable as the code from the previous post. It generates the pool of workers, ideally one for each core, before using the given chunker to split the corresponding file into a series of chunks for processing. Unlike before, we collect the output given by each job and return it after processing is finished. The main function acts as wrapper allowing us to specify different processing functions and different chunkers, as given by the variables worker and chunker respectively. We have wrapped the processing function call within the function _workerMP which prints to the terminal as tasks are completed. It uses the function _printMP to do this, as you need to flush the terminal after a print statement when using multi core processing, otherwise nothing appears until all tasks are completed.

Let’s finish by showing an example of how we would use the above to do the same task as we did at the start of this post, counting the sequences within a fasta file, using the base chunker:


def seq_counter(chunk,inFile):
    data = Chunker.read(inFile,chunk)
    return data.count('>')

and using the FASTA chunker:


def seq_counter_2(chunk,inFile):
    data = list(Chunker_FASTA.parse(Chunker_FASTA.read(inFile,chunk)))
    return len(data)

And time they take to count the sequences within the 14GB file from before:


>>> os.path.getsize("SRR951828.fa")
14944287128
>>> x=time.time();f = open("SRR951828.fa");sum([l.count(">") for l in f]);time.time()-x
136829250
190.05533599853516
>>> x=time.time();f = open("SRR951828.fa");sum([c.count(">") for c in iter(lambda: f.read(1024*1024),"")]);time.time()-x
136829250
26.343637943267822
>>> x=time.time();sum(main("SRR951828.fa",seq_counter,cores=8));time.time()-x
136829250
4.36846399307251
>>> x=time.time();main("SRR951828.fa",seq_counter_2,Chunker_FASTA,8);time.time()-x
136829250
398.94060492515564

Let’s ignore that last one, as the slowdown is due to turning the entries into BioPython SeqRecords. The prior one, which combines chunking and multicore processing, has roughly a factor of 50 speed up. I’m sure this could be further reduced using more cores and/or optimising the chunk size, however, this difference alone can change something from being computationally implausible, to plausible. Not bad for only a few lines of code.

Anyway, as before, I hope that some of the above was either new or even perhaps helpful to you. If you know of a better way to do things (in python), then I’d be very interested to hear about it. If I feel like it, I may follow this up with a post about how to integrate a queue into the above which outputs the result of each job as they are produced. In the above, we currently hold collate all the results in the memory, which has the potential to cause a memory overflow depending on what is being returned.

Processing large files using python

In the last year or so, and with my increased focus on ribo-seq data, I have come to fully appreciate what the term big data means. The ribo-seq studies in their raw forms can easily reach into hundreds of GBs, which means that processing them in both a timely and efficient manner requires some thought. In this blog post, and hopefully those following, I want to detail some of the methods I have come up (read: pieced together from multiple stack exchange posts), that help me take on data of this magnitude. Specifically I will be detailing methods for python and R, though some of the methods are transferrable to other languages.

My first big data tip for python is learning how to break your files into smaller units (or chunks) in a manner that you can make use of multiple processors. Let’s start with the simplest way to read a file in python.


with open("input.txt") as f:
    data = f.readlines()
    for line in data:
        process(line)

This mistake made above, with regards to big data, is that it reads all the data into RAM before attempting to process it line by line. This is likely the simplest way to cause the memory to overflow and an error raised. Let’s fix this by reading the data in line by line, so that only a single line is stored in the RAM at any given time.


with open("input.txt") as f:
    for line in f:
        process(line)

This is a big improvement, namely it doesn’t crash when fed a big file (though also it’s shorter!). Next we should attempt to speed this up a bit by making use of all these otherwise idle cores.


import multiprocessing as mp

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
    for line in f:
        jobs.append( pool.apply_async(process,(line)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Provided the order of which you process the lines don’t matter, the above generates a set (pool) of workers, ideally one for each core, before creating a bunch of tasks (jobs), one for each line, for the workers to do. I tend to use the Pool object provided by the multiprocessing module due to ease of use, however, you can spawn and control individual workers using mp.Process if you want finer control. For mere number crunching, the Pool object is very good.

While the above is now making use of all those cores, it sadly runs into memory problems once again. We specifically use apply_async function so that the pool isn’t blocked while each line processes. However, in doing so, all the data is read into memory once again; this time stored as individual lines associated with each job, waiting inline to be processed. As such, the memory will again overflow. Ideally the method will only read the line into memory when it is its turn to be processed.


import multiprocessing as mp

def process_wrapper(lineID):
    with open("input.txt") as f:
        for i,line in enumerate(f):
            if i != lineID:
                continue
            else:
                process(line)
                break

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
    for ID,line in enumerate(f):
        jobs.append( pool.apply_async(process_wrapper,(ID)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Above we’ve now changed the function fed to pool of workers to include opening the file, locating the specified line, reading it into memory, and then processing it. The only input now stored for each job spawned is the line number, thereby preventing the memory overflow. Sadly, the overhead involved in having to locate the line by reading iteratively through the file for each job is untenable, getting progressively more time consuming as you get further into the file. To avoid this we can use the seek function of file objects which skips you to a particular location within a file. Combining with the tell function, which returns the current location within a file, gives:


import multiprocessing as mp

def process_wrapper(lineByte):
    with open("input.txt") as f:
        f.seek(lineByte)
        line = f.readline()
        process(line)

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
    nextLineByte = f.tell()
    for line in f:
        jobs.append( pool.apply_async(process_wrapper,(nextLineByte)) )
        nextLineByte = f.tell()

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Using seek we can move directly to the correct part of the file, whereupon we read a line into the memory and process it. We have to be careful to correctly handle the first and last lines, but otherwise this does exactly what we set out, namely using all the cores to process a given file while not overflowing the memory.

I’ll finish this post with a slight upgrade to the above as there is a reasonable amount of overhead associated with opening and closing the file for each individual line. If we process multiple lines of the file at a time as a chunk, we can reduce these operations. The biggest technicality when doing this is noting that when you jump to a location in a file, you are likely not located at the start of a line. For a simple file, as in this example, this just means you need to call readline, which reads to next newline character. More complex file types likely require additional code to locate a suitable location to start/end a chunk.


import multiprocessing as mp,os

def process_wrapper(chunkStart, chunkSize):
    with open("input.txt") as f:
        f.seek(chunkStart)
        lines = f.read(chunkSize).splitlines()
        for line in lines:
            process(line)

def chunkify(fname,size=1024*1024):
    fileEnd = os.path.getsize(fname)
    with open(fname,'r') as f:
        chunkEnd = f.tell()
    while True:
        chunkStart = chunkEnd
        f.seek(size,1)
        f.readline()
        chunkEnd = f.tell()
        yield chunkStart, chunkEnd - chunkStart
        if chunkEnd > fileEnd:
            break

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
for chunkStart,chunkSize in chunkify("input.txt"):
    jobs.append( pool.apply_async(process_wrapper,(chunkStart,chunkSize)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Anyway, I hope that some of the above was either new or even perhaps helpful to you. If you know of a better way to do things (in python), then I’d be very interested to hear about it. In another post coming in the near future, I will expanded on this code, turning it into a parent class from which create multiple children to use with various file types.

Racing along transcripts: Correlating ribosome profiling and protein structure.

A long long time ago, in a galaxy far away, I gave a presentation about the state of my research to the group (can you tell I’m excited for the new Star Wars!). Since then, little has changed due to my absenteeism from Oxford, which means ((un)luckily) the state of work is by and large the same. Now, my work focusses on the effect that the translation speed of a given mRNA sequence can have on the eventual protein product, specifically through the phenomena of cotranslational folding. I’ve discussed the evidence behind this in prior posts (see here and here), though I find the below video a good reminder of why we can’t always just go as fast as we like.

So given that translation speed is important, how do we in fact measure it? Traditional measures, such as tAI and CAI, infer them using the codon bias within the genome or by comparing the counts of tRNA genes in a genome. However, while these have been shown to somewhat relate to speed, they are still solely theoretical in their construction. An alternative is ribosome profiling, which I’ve discussed in depth before (see here), which provides an actual experimental measure of the time taken to translate each codon in an mRNA sequence. In my latest work, I have compiled ribosome profiling data from 7 different experiments, consisting of 6 diverse organisms and processed them all in the same fashion from their respective raw data. Combined, the dataset gives ribosome profiling “speed” values for approximately 25 thousand genes across the various organisms.

Our first task with this dataset is to see how well the traditional measures compare to the ribosome profiling data. For this, we calculated the correlation against CAI, MinMax, nTE and tAI, with the results presented in the figure above. We find that basically no measure adequately captures the entirety of the translation speed; some measures failing completely, others obviously capturing some part of the behaviour, and then some others even predicting the reverse! Given that no measure captured the behaviour adequately, we realised that existing results that related the translation speed to the protein structure, may, in fact, be wrong. Thus, we decided that we should recreate the analysis using our dataset to either validate or correct the original observations. To do this we combined our ribosome profiling dataset with matching PDB structures, such that we had the sequence, the structure, and the translation speed for approximately 4500 genes over the 6 species. While I won’t go in to details here (see upcoming paper – touch wood), we analysed the relationship between the speed and the solvent accessibility, the secondary structure, and linker regions. We found striking differences to the observations found in the literature that I’ll be excited to share in the near future.

Journal Club: Mechanical force releases nascent chain-mediated ribosome arrest in vitro and in vivo

For this week’s journal club, I presented the paper by Goldman et al, “Mechanical force releases nascent chain-mediated ribosome arrest in vitro and in vivo”. The reason for choosing this paper is that it discussed an influence on protein folding/creation/translation that is not considered in any of today’s modelling efforts and I think it is massively important that every so often we, as a community, step-back and appreciate the complexity of the system we attempt to understand. This work focuses on the the SecM protein, which is known to regulate SecA (which is part of the translocon) which in turn regulates SecM. The bio-mechanical manner in which this regulation takes place is not fully understood. However, SecM contains within its sequence a peptide motiff that binds so strongly to the ribosome tunnel wall that translation is stopped. It is hypothesised that SecA regulates SecM by applying a force to the nascent chain to pull it past this stalling point and, hence, allow translation to continue.

To begin their study, Goldman wanted to confirm that one could advance past the stall point merely by the application of force. By attaching the nascent chain and the ribosome to nano-tweezers and a micro-pipette respectively they could do this. However, to confirm that the system was stalled before applying a (larger) force, they created a sequence which included CaM, a protein which periodically hops between a folded and unfolded state when pulled at 7pN, followed by the section of SecM which causes the stalling. The nano-tweezers were able to sense the slight variations in length at 7pn from the unfolding and refolding of CaM, though no continuing extension, which would indicate translation, was found. This indicated the system had truly stalled due to the SecM sequence. Once at this point, Goldman increased the applied force, at which point distance between the pipette and the optical tweezers slowly increased until detachment when the stop codon was reached. As well as confirming that force on the nascent chain could make the SecM system proceed past the stalling point, they also noted a force dependence to the speed with which it would overcome this barrier.

With this force dependence established, they pondered whether a domain folding upchain of the stall point could generate enough force that it could cause translation to continue. To investigate, Goldman created a protein that contained Top7 followed by a linker of variable length, followed by the SeqM stalling motif, which was in turn followed by GFP. Shown in the figure above, altering the length of the linker region defined the location of Top7 while it attempts to fold. A long linker allows Top7 to fold completely clear from the ribosome tunnel. A short linker means that a it can’t fold due to many of its residues being inside the ribosome tunnel. Between these extremes, however, the protein may only have a few residues within the tunnel and by stretching the nascent chain it may access them so as to be able to fold. In addition, Top7 was chosen specifically as it was known to fold even under light pressure. Hence, by newtons third, Top7 would fold even while its C terminus would be under strain into the ribosome, it in turn generates an equal and opposite force on the stalling peptide sequence within the heart of the ribosome tunnel, which should allow translation to proceed past the stall. Crucially, if Top7 folded too far away from the ribosome, this interaction would not occur and translation would not continue.

Goldman’s experiments showed that this is in fact the case; they found that only linkers of 15 to 22 amino acid would successfully complete translation. This confirms that a protein folding at the mouth of the ribosome tunnel can generate sizeable force (they calculate roughly 12pN in this instance). Now I find this whole system especially interesting as the I wonder how this may generalise to all translation, both in terms of interactions of the nascent chain with the side wall and the domain folding at the ribosome tunnel mouth. Should I consider these when I calculate translation speeds for example? Oh well, we need a reasonable model for translation while ignoring these special cases first before I really need to worry!

How do we measure translation speed?

Two major trains of thought have emerged in how one can define the translation speed, one uses the cognate tRNA concentrations and the other the codon bias within the genome. The former is a natural measure, the amount of cognate tRNA available to the ribosome for a given codon has been shown to affect the translation. In the extreme case, when no cognate tRNA is available, the ribosome is even found to detach from the transcript after a period of waiting. The latter, the codon bias, is the relative quantities of codons found within a synonymous group. The codons found the most are assumed to be optimal as it is hypothesised that the genome will be optimised to produce proteins in the fastest most efficient manner. Lastly, there is a new third school of thought were one has to balance both the supply and the usage of any given codon. Namely if a codon is overused it will actually have a lower tRNA concentration than would be suggested by its tRNA gene copy numbers (an approximation of the tRNA’s concentration). Each of these three descriptions have been used in their own respective computational studies to show the association of the speed, represented as each measure, to the protein structure.

A simplified schematic of ribosome profiling. Ribosome profiling begins with separating a cell’s polysomes (mRNA with ribosomes attached) from its lysate. Erosion by nuclease digestion removes all mRNA not shielded by a ribosome while also cleaving ribosomes attached to the same mRNA strand. Subsequent removal of the ribosomes leaves behind only the mRNA fragments which were undergoing translation at the point of cell lysis. Mapping these fragments back to the genome gives a codon-level resolution transcriptome-wide overview of the translation occurring within the cell. From this we can infer the translation speed associated with any given codon from any given gene.

However, while these definitions have been in existence for the past few decades, there has been no objective way, till now, to test how accurate they actually are in measuring the translation speed. Essentially, we have based years of research on the extrapolation of a few coarse experiments, or in some cases purely theoretical models, to all translation. There now exists an experimental measure of the translation occurring in-vivo. Ribosome profiling, outlined in above, measures the translation occurring within a cell, mapping the position of the ribosome on the genome at the points of cell lysing. Averaging over many cells gives an accurate measure of the expected translation occurring on any given transcript at any time.

Comparing the log transformed ribosome profile data to the translation speed as defined by each of the algorithms for B. Subtilis. We show the mean ribosome occupancy against the mean translation speed when stratified by codon, finding that the assigned values for each algorithm fails to capture the variation of the ribosome profiling data.

As an initial comparison shown above, we compared some of the most popular speed measures based on the above descriptions to the ribosome profiling data. None of the measures were found to recreate the ribosome profiling data adequately. In fact, while some association is found, it is opposite to what we would expect! The faster the codon according to the algorithm the more a ribosome is likely to occupy it!We thought that this may be due to treating all the codons together instead of with respect to the genes they are from. Essentially, is a given codon actually fast if it is just within a gene that is in general fast? To test for this, we created a set of models which account for a shift in ribosome data profile depending on the source gene. However, these showed even less association to the speed algorithms!

These findings suggest that the algorithms that the scientific community have based there work on for the past decades may in fact be poor representations of the translations speed. This leads to a conundrum, however, as these measures have been made use of in experimental studies, namely the paper by Sander et al (see journal club entry here). In addition, codon bias matching has been used extensively in increasing expression of non-native proteins in bacteria. Clearly these algorithms are a measure of something and, as such, this contradiction needs to be resolved in the near future.

Protein Folding: Man vs Machine

In 1996 Gary Kasparov, the reigning world chess champion, played IBM’s Deep blue, a computer whose sole purpose was to play chess better than any human. Losing the first match, Gary sprung back swiftly defeating Deep Blue 4-2 over the remaining matches. However, his success was short lived. In a rematch with an updated Deep Blue the following year, the score was 3.5-2.5 to the computer. The media (and IBM) declared this as a pivotal moment in history, where a machine had proven itself better than humanities champion at a game deemed a highly intellectual pursuit. The outcry was that the age of machines had arrived. Was it true? Should humanity have surrendered to machine overloads at that moment? Obviously the answer is a large and resounding no. However, this competition allows for insightful comparison between the manner in which humans and computers play chess and think. By comparing the two, we learn the strengths and weaknesses of both parties from which we can make combined approaches that may exceed either.

Firstly, lets discuss the manner in which a computer “plays” chess. They simply search all possible configurations of moves that are available and pick the most optimal. However, things are not that simple. Consider only the opening sequence, there are 20 possible moves a player can make, so after only a single move by each player there is 400 possible chess positions. This count grows exponentially fast, after 5 moves by each player there is approximately 5 million combinations. For example, it was estimated that Deep Blue could analyse 2 million positions per second. However, since this is not nearly fast enough to examine all possible games from start to end in a reasonable time scale, computers cannot foresee lines of plays which are far in the distance. To overcome this, in the early game the computer will use a reference table developed by grandmasters that list both common openings and the assumed best manner to respond to them. Obviously, these are only assumed as optimal and have never been completely tested. In short, machines participate through a brute force, utilising their intricate ability to perform calculations at high speed to find the best move. However, the search is too large in the initial and end stages of a game to be completely thorough, a reference table is instead used to “inform” of the correct move at these times.

While a human can quite easily see that the following board leads to a draw, computers cannot draw the same conclusion without huge effort.

In contrast, human players use far more visual and spatial recognition alongside both memory and calculation to pick their moves. Like a computer, a player will analyse a portion of the moves available at any given moment. Though since a human cannot compare on computation speed to that of a computer, they cannot analyse nearly the same magnitude of moves. Hence, this subset of moves chosen for analysis must contain the most optimal move(s) to compete against the computer’s raw power. This is where the visual and spatial recognition abilities of humans come to bare. Firstly, a human can easily dissect the board into pieces worth considering and those to be ignored. For example, consider a possible move that would result in your queen being exposed and then taken. A human would conclude this as bad (normally) and discard further moves leading from such a play. A computer, however, would explore the resultant board state. One can see how this immediately and drastically reduces the required search. Another human ability is that a player will often be able to able to see sub-structures within a full set-up that are common in the game and hence can be processed in a known manner. In other words, the game is broken down into fragments which can be processed far easier and with less computation. Obviously, both of the above techniques rely on prior knowledge of chess to be useful, but they based upon our human ability to perceive both the substructure of the game and the overall picture with relative ease.

So how does all this chess talk relate to protein folding? In 2010, the Baker group and creators of the ROSETTA protein fold prediction program produced the protein folding game “Foldit”. In Foldit the general public could attempt to fold proteins for themselves and try to get closer to the native structure than the computer algorithms. Obviously, simplified in presentation to that of academic structural biology, it was hoped that the visual and spatial reasoning abilities of humans, the same ones that differentiated them from machines at chess, would prove useful in protein structure prediction. A key issue within ROSETTA drove this train of thought, the fact that is is relatively bad at exploring fully the confirmation space. Often, it will get stuck in the one general configuration and not explore the fold space fully. Furthermore, due to the size of configuration space, this is not easily overcome with simulated annealing due to the sheer scale of the problem. The ability of humans to view the overall picture meant that it should be easier for them to see other possible configurations. As end goals for Foldit, it was hoped that structures that proved unsolvable by current algorithms would be solved by humans and also that new techniques would emerge as “moves” employed by players to achieve high scores could be studied.

To make a comparison of the structures produced by Foldit players and ROSETTA viable, the underlying energy “scores” that judge a structure is the same between the programs. It is assumed, though is not always true, that the better the score the closer you are to the native fold. In addition, Foldit players were also able to use a set of optimisation tools that were deterministic and would alter the backbone and side chains to the most optimal local configuration to the arrangement the player would make. This meant that players could focus predominantly on altering the overall structure of the protein rather than the fine detail, such as the position of sidec hains. To make the game as approachable as possible, technical terms were replaced by common analogues and visual cues where displayed to highlight poor scoring areas of the protein. For example, clashes between atoms are shown via large spiked red orbs, while the backbone is coloured from green to red depending on how well buried the hydrophobic residues on that segment are. To drive players, gamification elements were also included such as leader boards and rewarding “fireworks” as graphical effects.

To objectively compare the ability of the player base to that of the ROSETTA algorithm, they performed blind predictions on a set of 10 proteins whose structure were not in the public domain. This was run in a similar manner to CASP for those familiar with that set-up. The results exemplified the innate human ability of visual and spatial recognition. In 5 of the cases the playerbase performed significantly better than the ROSETTA program. In 3 of the cases they performed similar. And in the remaining 2 cases the ROSETTA algorithm performed better, though in both of these the model produced was still extremely far from the native structure. Looking through the cases individually, it was identified that the most crucial element used by players was that they were able to deal with large rearrangements that ROSETTA struggled to deal with, including register shifts and strand swapping. This highlights the ability of humans to view the overall picture and to persevere through “bad scoring patches” to reach a more optimal configuration.

Comparison of foldit player’s solutions (green) to ROSETTA’s solutions (red) and the native 2KPO protein structure (blue). The players correctly identified a strand swap needed to reach the native form, while this large reconfiguration was not seen by ROSETTA.

Since the release of the game and the accompanying paper in 2010, Foldit has received much praise in conveying the field of protein folding in an approachable manner to so many people. In addition, the player base has contributed to science as whole. In 2011 the player base successfully solved the structure of a M-PMV protein, a retrovirus whose structure was unobtainable via normal means. Then in 2012, by analysing the common set of moves employed by the player base, they collectively produced an algorithm that outperforms previously published fold prediction methods. Personally, I think of Foldit as a fun and relative intuitive game that introduces the core elements of the protein folding problem. As to its scientific merit, I’m unsure as to how much impact it will continue to have. As Saulo discussed last week, if infinite monkeys have infinite time then Shakespeare will be reproduced. Likewise, if enough people manipulate a protein structure, eventually the best structure will be found. Though who am I to judge, if people find the game fun, then there are far worse past-times one can have than trying to solve structures. As a finishing note I would be extremely interested in using Foldit to teach structural biology in the future, though feel it is overall too simple for a university setting.

Introduction to the protein folding problem

Recently (read: this week), I had to give a presentation on my research to my college. We were informed that the audience would be non-specialist, which in fact turned out to be an understatement. For example, my presentation followed on from a discussion of the differences in the education systems of North and South Korea for the period 1949-1960. Luckily, I had tailored my entire talk to be understandable by all and devoid of all Jargon. I choose to deviate from the prescribed topic and Instead of talking about my research specifically, I choose to discuss the protein folding problem in general. Below you’ll find the script I wrote, which I feel gives a good introduction to the core problem of this field.

———————-

The protein folding problem is one of the great projects within the life sciences. Studied by vast numbers of great scientists over the last half century, with backgrounds including chemistry, physics, maths and biology, all were beaten by the sheer complexity of the problem. As a community we still have only scraped the surface with regards to solving it. While I could, like many of you here, go into my own research in great detail and bore you wholeheartedly for the next 10 minutes with technical details and cryptic terminology, I will instead try to give an overview of the problem and why thousands of scientists around the globe are still working on cracking this.

First of all, I guess that a few of you are trying to remind yourself what a protein is; the horror that is high school biology crawling back from that area in your brain you keep for traumatic experiences like family gatherings. Luckily, I’m fairly new to the topic myself, my background being in physics and chemistry, so hopefully my explanation will be still in the naive terms that I use to explain the core concepts to myself. Proteins are the micro-machines of your body, the cogs that keep the wheels turning, the screws that hold the pieces together and the pieces themselves. Proteins run nearly all aspects of your body and biochemistry, your immune system, your digestion and your heart beating. There are approximately between 20 to 30 thousand different proteins in your body, depending on who you ask, and trillions overall. In fact, if we take every protein in our body and scale it up to the size of a penny, the proteins in a single human, albeit a rather dead human, would be enough to fill the entire pacific ocean. Basically, there is a hell of a lot of proteins, with a vast range of different types, each of which is very individual, both in its compositions and function, and, crucially, they are nearly all essential. The loss of any protein can lead to dramatic consequences including heart disease, cancer, and even death.

So now that you know that they are important and there are lots of them, what exactly is a protein? The easiest analogy I have is that of a pearl necklace, a long string of beads in a chain. Now consider your significant other has gone slightly insane and instead of purchasing jewelry for you that consists of a single bead type, or even two if you have slightly exotic tastes, they have been shopping at one of the jewellery stores found in the part of town that smells rather “herby”. You receive a necklace which has different beads across the entire length of the necklace. We have blues, yellows, pinks, and so on and so forth. In fact, we have 20 different types of beads, each with its own colour. This is basically a protein chain: each of the beads represents one of the twenty essential amino acids, each of which has its own chemical and physical properties. Now suppose you can string those pearls together in any order: red, green, blue, blue, pink etc. It turns out that the specific order that these beads are arranged along the length of the protein chain define exactly how this chain “crumples” into a 3D shape. If you think that adding an extra dimension is impossible, just consider crumpling a piece of paper; that is 2D –> 3D transition (mathematicians please bite your tongue). Now one string of colours, blue, blue, pink for example, will crunch into one shape, and that shape may become your muscle, while a different sequence, say, green, blue, orange, will crumple down into something different, for example an antibody to patrol your blood stream.

So essentially we have this “genetic code”, the sequence of amino acids (or beads), which in turn defines the shape that the protein will take. We in fact know that it is this shape that is the most important aspect of any protein, this having been found to define the protein’s actual function. This is because, returning to the bead analogy, we can change up to 80% of the beads to different colour while still retaining the same shape and function. This is amazing when you consider how many other objects can have their baseline composition changed to the same extent while still retaining the same function. The humble sausage is one of those objects (actually below 40% meat content they are referred to as “bangers”), but even then would you want 80% of you sausage to be filler. There is a reason Tesco value sausages taste so different to the nice ones you buy at the butchers. Returning to proteins, we are not trying to say that the sequence isn’t important, sometimes changing just a single bead can lead to a completely different shape. Instead it says that the shape is the critical aspect which defines the function. To summarise, sequence leads to shape, which in turn leads to function.

This is unfortunate, because while it is getting increasingly simple to experimentally determine the sequence of a protein, that is the exact order of coloured beads, the cost and time of getting the corresponding structure (shape) is still extremely prohibitive. In fact, we can look at two of the major respective databases, the PDB, which contains all known proteins structures, and GENBANK, which contains all known protein sequences, and compare the respective number of entries. The disparity between the two is huge, we are talking orders of magnitude huge, 10^15 huge, i.e the number of humans on the planet squared huge. AND this gap is growing larger every year. Basically, people in the last few years have suddenly gained access to cheap and fast tools to get a protein’s sequence, to the extent that people are widely taking scoops of water across the world and sequencing everything, not even bothering to separate the cells and microorganisms beforehand. Nothing analogous exists to get the structure of a protein. The process taking months to years, depending on many factors, each of which may be “something” for one protein and then a completely different “something” for a similar protein. This has led to a scenario where we know the sequence of every protein in the human genome, yet we only know the structure of only about 10% of them. This is utterly preposterous in my opinion given how important this information is to us! We basically don’t know what 90% of our DNA does!

Basically, until an analogous method for structure determination is produced, we have no choice but to turn to predictive methods to suggest the function of proteins that we do not have the structure for. This is important as it allows us, to some degree, target proteins that we “think” may have an important effect. If we didn’t do this, we simply would be searching for needles in haystacks. This is where my research, and that of my group, kicks in. We attempt to take these sequences, these string of beads, and predict the shape that they produce. Unfortunately, the scientific community as a whole still relatively sucks at this. Currently, we are only successful in predicting structures for very small proteins, and when anything more complex is attempted, we, in general, fail utterly miserably. In my opinion, this is because the human body is by far the most complex system on the planet and so far we have tried to simply supplant physics on top of the problem. This has failed miserably due to the sheer complexity and multitude of factors involved. Physics has mostly nice vacuums and pleasant equations, however ask a physicist about a many body system and they will cry. So many factors are involved that we must integrate them altogether which is why there are so many people are working on this, and will be for many years to come. Well, I guess that’s good news for my future academic career.

Anyway, I hope this talk has given you some degree of insight into the work I do and you have learned something about how your body works. For those extremely interested, please feel free to approach me later and I will happily regale you with the exact aspect of protein folding I work on. But for now I would love to try and answer any questions you all have on the content contained in this talk.

Expanding Anfinsen’s Principle (Journal Club)

Paper: Expanding Anfinsen’s Principle: Contributions of Synonymous Codon Selection to Rational Protein Design.

In 1961, Anfinsen performed his now (in)famous denaturing experiment upon ribonuclease A, a small one hundred residue globular protein. He showed that it could both unfold and refold via the addition and subsequent removal of chemical substances. From this he concluded that a protein’s fold is that of its global free energy minimum and, consequently, all the information required to know the folded structure of a given protein is solely encoded within its sequence of amino acids. In 1972, Anfinsen was awarded the Nobel prize for this work from which stemmed the vast field of protein folding prediction, a global arms race to see who could best predict/find the elusive global minimum for any given protein.

Unfortunately, protein fold prediction is still in its infancy with regards to its predictive power. As a scientific community, we have made huge progress using homology models, whereby we use the structure of a protein with similar sequence to the one under investigation to provide a reasonable starting point for refinement. However, when there is no similar structure in existence, we struggle abysmally due to being forced to resort to de novo models. This lack of ability when we are given solely a sequence to work with, shows that that our fundamental understanding of the protein folding process must be incomplete.

An increasingly common viewpoint, one that is at odds with Anfinsen’s conclusions, is that there is additional information required for a protein to fold. One suggested source of information is in the production of the protein itself at the ribosome. Coined as cotranslational folding, it has been shown that a protein undergoing synthesis will fold as it emerges from the ribosome, not waiting until the entire sequence is synthesised. As such, the energy landscape that the protein must explore to fold is under constant change and development as more and more of the protein emerges from the ribosome. It is an iterative process of smaller searches as the energy landscape is modulated in steps to that of the complete amino acid sequence.

Another suggested source of information is within the degeneracy observed within the genetic code. Each amino acid is encoded for by up to 6 different codons, and as such, one can never determine exactly the coding DNA that created a given protein. While this degeneracy has been suggested as merely an effect to reduce the deleterious nature of point mutations, it has also been found that each of these codons are translated at a different rate. It is evident that information is consumed when RNA is converted into protein at the ribosome, sine reverse translation is impossible, and it is hypothesised that these variations in speed can alter the final protein structure.

Figure 1. Experimental design for kinetically controlled folding. (a) Schematic of YKB, which consists of three half-domains connected by flexible (AGQ)5 linkers (black lines). The Y (yellow) and B (blue) half-domains compete to form a mutually exclusive kinetically trapped folded domain with the central K (black) half-domain. The red wedge indicates the location of synonymous codon substitutions (see text). (b) Energy landscapes for proteins that fold under kinetic control have multiple deep minima, representing alternative folded structures, separated by large barriers. The conformations of the unfolded protein and early folding intermediates (colored arrows) determine the final folded state of the protein. Forces that constrict the unfolded ensemble (grey cone) can bias folding toward one structure. (c) During translation of the nascent chain by the ribosome (orange), folding cannot be initiated from the untranslated C-terminus, which restricts the ensemble of unfolded states and leads to the preferential formation of one folded structure. Image sourced from J. Am. Chem. Soc., 2014, 136(3),

The journal club paper by Sander et al. looked experimentally at whether both cotranslational folding and codon choice can have effect on the resultant protein structure. This was achieved through the construction of a toy model protein, consisting of three half domains as shown in Figure 1. Each of these half domains were sourced from bifluorescent proteins, a group of protein half domains that when combined fluoresce. The second half domain (K) could combine with either the first (Y) or the third (B) half domains to create a fluorophore, crucially this occurring in a non-reversible fashion such that once a full domain was formed it could not form the other. By choosing flurophores that differed in wavelength, it was simple to measure the ratio in which the species, YK-B or Y-KB, were formed.

They found that the ratio of these two species differed between in-vitro and in-vivo formation. When denatured Y-K-B species were allowed to refold, a racemic mixtrue was produced, both species found the be equally likely to form. In contrast, when synthesised at the ribosome, the protein showed an extreme bias to form the YK-B species as shown in Figure 2. They concluded that this is caused by cotranslational folding, the half domains Y and K having time to form the YK species before B was finished being produced. As pointed out by some members within the OPIG group, it would have been appreciated to see if similar results were produced if the species were reversed, such that B was synthesised first and Y last, but this point does not invalidate what was reported.

Figure 2. Translation alters YKB folded structure. (a) Fluorescence emission spectra of intact E. coli expressing the control fluorescent protein constructs YK (yellow) or KB (cyan). (b) Fluorescence emission spectra of intact E. coli expressing YKB constructs with common or rare codon usage (green versus red solid lines) versus the same YKB constructs folded in vitro upon dilution from a chemical denaturant (dashed lines). Numbers in parentheses correspond to synonymous codon usage; larger positive numbers correspond to more common codons. (c) E. coli MG1655 relative codon usage(3) for codons encoding three representative YKB synonymous mutants: (+65) (light green), (−54) (red), and (−100) (pink line). Image sourced from J. Am. Chem. Soc., 2014, 136(3).

Following the above, they also probed the role of codon choice using this toy model system. They varied the codons choice over a small segment of residues between the K and B half domains, such that the had multitude of species which would be encoded either “faster” or “slower” across this region. Codon usage was used as the measure of speed, though this has yet to established within the literature as to its appropriateness. They found that the slower species increased the bias towards the YK-B over Y-KB, while faster species reduced it. This experiment shows clearly that codon choice has a role on a protein’s final structure, though they only show a large global effect. My work is primarily on whether codon choice has a role at the secondary structure level, so I will be avidly hoping that more experiments will follow that show the role of codons at finer levels.

In conclusion, Sander et al. performed one of the cleanest experimental proofs of cotranslational folding to date. Older evidence is more anecdotal in nature, with reports of protein X or Y changing in response to a single synonymous mutation. In contrast, the experiment reported here is systematic in the approach and leaves little room for doubt over the results. Secondly and more ground breaking, is the (again) systematic nature in which codon choice is investigated and shown to effect the global protein structure. This is one of those rare pieces of science which the conclusions are clear and forthcoming to all readers.

Conservation of codon optimality

The unidirectional process of translation converts one dimensional mRNA sequences into three dimensional protein structures. The reactant, the mRNA, is built from 64 codon variants, while the product, the protein, is constructed from the 20 biologically relevant amino acids. This reduction of 64 variants to a mere 20 is indicative of the degeneracy contained within the genetic code. Multiple codons encode for the same amino acid, and hence any given protein can be encoded by a multitude of synonymous mRNA sequences. Current structural biology research often assumes that these synonymous mRNA sequences are equivalent; the ability of certain proteins to unfold and refold without detrimental effects leading to the assumption that the information pertaining to the folded structure is contained exclusively within the protein sequence. In actuality, experimental evidence is mounting that shows synonymous sequences can in fact produce proteins with different properties; synonymous codon switches having been observed to cause a wide range of detrimental effects, such as decreases in enzyme activity, reductions in solubility, and even causing misfolding.

The ribosome (yellow) passes along the mRNA, turning the sequence of codons into a protein. Under our model, the speed of the translation for each codon varies, in turn differing the time available to the nascent peptide chain to explore the fold space. Through this method codon choice becomes an additional source of structural information.

For my 10 week DTC project within the Deane group, I was tasked with resurrecting the Coding Sequence and Structure database (CSandS; read: Sea Sands) and using it to test for the evolutionary conservation of codon choice over groups of proteins with similar structures. With experimental differences noted in protein product structure between synonymous sequences, we wished to investigate if codon choice is an evolutionary constraint on protein structure, and if so, to what degree, and in what manner. To test for evolutionary conservation we combined the theories of codon optimality and cotranslational folding, our hypothesis being that the choice of codon directly affects the translation efficiency of the ribosome; consequently different codons give the nascent polypeptide emerging from the ribosome varying amounts of time to explore the fold-space. By assigning optimality scores to each codon, we can search for regions of fast translation and slow translation, then by looking for correlations within aligned sets of structural similar proteins we can identify sites where codon choice is of importance. While the 10 weeks project focussed mainly on methodology and implementation, my preliminary results indicate that a large percentage of proteins are under selective pressures with regards to codon choice. I found that in most sets of proteins, there is an greater amount of correlation than would be expected by chance, this crucially suggests that there is additional structural information contained within the mRNA that is lost once translation occurs.

For additional information on the background, methodology and implementation of this research, please feel free to view the presentation slides at: http://www.slideshare.net/AlistairMartin/evolut-26981404

Oxford Protein Informatics Group

or "OPIG" to friends