Author Archives: Eoin Malins

New toys for OPIG

OPIG recently acquired 55 additional computers all of the same make and model; they are of a decent specification (for 2015), each with quad-core i5 processor and 8GB of RAM, but what to do with them? Cluster computing time!

Along with a couple of support servers, this provides us with 228 computation cores, 440GB of RAM and >40TB of storage. Whilst this would be a tremendous specification for a single computer, parallel computing on a cluster is a significantly different beast.

This kind of architecture and parallelism really lends itself to certain classes of problems, especially those that have:

Independent data
Parameter sweeps
Multiple runs with different random seeds
Dirty great data sets
Or can be snipped up and require low inter-processor communication

With a single processor and a single core, a computer looks like this:

These days, when multiple processor cores are integrated onto a single die, the cores are normally independent but share a last-level cache and both can access the same memory. This gives a layout similar to the following:

Add more cores or more processors to a single computer and you start to tessellate the above. Each pair of cores have access to their own shared cache, they have access to their own memory and they can access the memory attached to any other cores. However, accessing memory physically attached to other cores comes at the cost of increased latency.

Cluster computing on the other hand rarely exhibits this flat memory architecture, as no node can directly another node’s memory. Instead we use a Message Passing Interface (MPI) to pass messages between nodes. Though it takes a little time to wrap your head around working this way, effectively every processor simultaneously runs the exact same piece of code, the sole difference being the “Rank” of the execution core. A simple example of MPI is getting every code to greet us with the traditional “Hello World” and tell us its rank. A single execution with mpirun simultaneously executes the code on multiple cores:

$mpirun -n 4 ./helloworld_mpi Hello, world, from processor 3 of 4 Hello, world, from processor 1 of 4 Hello, world, from processor 2 of 4 Hello, world, from processor 0 of 4

Note that the responses aren’t in order, some cores may have been busy (for example handling the operating system) so couldn’t run their code immediately. Another simple example of this would be a sort. We could for example tell every processor to take several million values, find the smallest value and pass a message to whichever core has “Rank 0” that number. The core at Rank 0 will then sort that much smaller number set of values. Below is the kind of speedup which was achieved by simply splitting the same problem over 4 physically independent computers of the cluster.

As not everyone in the group will have the time or inclination to MPI-ify their code, there is also HTCondor. HTCondor, is a workload management system for compute intensive jobs which allows jobs to be queued, scheduled, assigned priorities and distributed from a single head node to processing nodes, with the results copied back on demand. The server OPIG provides the job distribution system, whilst SkyOctopus provides shared storage on every computation node. Should the required package currently not be available on all of the computation nodes, SkyOctopus can reach down and remotely modify the software installations on all of the lesser computation nodes.

A program to aid primary protein structure determination -1962 style.

This year, OPIG have been doing series of weekly lectures on papers we considered to be seminal in the field of protein informatics. I initially started looking at “Comprotein: A computer program to aid primary protein structure determination” as it was one of the earliest (1960s) papers discussing a computational method of discovering the primary structure of proteins. Many bioinformaticians use these well-formed, tidy, sterile arrays of amino acids as the input to their work, for example:

MGLSDGEWQL VLNVWGKVEA DIPGHGQEVL IRLFKGHPET LEKFDKFKHL KSEDEMKASE DLKKHGATVL TALGGILKKK GHHEAEIKPL AQSHATKHKI PVKYLEFISE CIIQVLQSKH PGDFGADAQG AMNKALELFR KDMASNYKEL GFQG
(For those of you playing at home, that’s myoglobin.)

As the OPIG crew come from a diverse background and frequently ask questions well beyond my area of expertise, if for nothing other than posterior-covering, I needed to do some background reading. Though I’m not a researcher by trade any more, I began to realise despite the lectures/classes/papers/seminars I’d been exposed to, regarding all the clever things you do with a sequence when you have it, I didn’t know how you would actually go from a bunch of cells expressing (amongst a myriad of other molecules) the protein you were interested in, to the neat array of characters shown above. So without further ado:

The first stage in obtaining your protein is: cell lysis and there’s not much in it for the cell.
Mangle your cell using chemicals, enzymes, sonification or a French press (not your coffee one).

The second stage is producing a crude extract by centrifuging the above cell-mangle. This, terrifyingly, appears to be done between 10,000G and 100,000G and removes the cellular debris leaving it as a pellet in the bottom of the container, with the supernatant containing little but a mix of the proteins which were present in the cytoplasm along with some additional macromolecules.

Stage three is to purify the crude extract. Depending on the properties of the protein you’re interested in, one or more of the following stages are required:

Reverse-phase chromatography to separate based on hydrophobicity
Ion-exchange to separate based on the charge of the proteins
Gel-filtration to separate based on the size of the proteins

If all of the above are preformed, whilst the sequence of these variously charged/size-sorted/polar proteins will still be still unknown, they will now be sorted into various substrates based upon their properties. This is where the the third stage departs from science and lands squarely in the realm of art. The detergents/protocols/chemicals/enzymes/temperatures/pressures of the above techniques all differ depending on the hydrophobicity/charge/animal source of the type of protein one is aiming to extract.

Since at this point we still don’t know their sequence, working out the concentrations of the various constituent amino acids will be useful. One of the simplest methods of determining the amino acid concentrations of a protein is follow a procedure similar to:

Heat the sample in 6M HCL at at a temperature of 110C for 18-24h (or more) to fully hydrolyse all the peptide bonds. This may require an extended period (over 72h) to hydrolyse peptide bonds which are known to be more stable, such as those involving valine, isoleucine and leucine. This however can degrade Ser/Thr/Tyr/Try/Gln and Cys which will subsequently skew results. An alternative is to raise the pressure in the vessel to allow temperatures of 145-155C to for 20-240 minutes.

TL;DR: Take the glassware that’s been lying about your lab since before you were born, put 6M hydrochloric acid in it and bring to the boil. Take one difficultly refined and still totally unknown protein and put it in your boiling hydrochloric acid. Seal the above glassware in order to use it as a pressure vessel. Retreat swiftly whilst the apparatus builds up the appropriate pressure and cleaves the protein as required. -What could go wrong?

At this point I wondered if the almost exponential growth in PDB entries was due to humanity’s herd of biochemists now having been thinned to those which remained simply being several generations worth of lucky.

Once you have an idea of how many of each type of amino acid comprise your protein, we can potentially rebuild it. However at this point it’s like we’ve got a jigsaw puzzle and though we’ve got all the pieces and each piece can only be one of a limited selection of colours (thus making it a combinatorial problem) we’ve no idea what the pattern on the box should be. To further complicate matters, since this isn’t being done on but a single copy of the protein at a time, it’s like someone has put multiple copies of the same jigsaw into the box.

Once we have all the pieces, to determine the actual sequence, a second technique needs to be used. Though invented in 1950, Edman degradation appears not to have been a particularly widespread protocol, or at least it wasn’t in the National Biomedical Research Foundation from which the above paper emerged. This means of degradation tags the N-terminal amino acid and cleaves it from the rest of the protein. This can then be identified and the protocol repeated. Whilst this would otherwise be ideal, it suffers from a few issues in that it takes about an hour per cycle, only works reliably on sequences of about 30 amino acids and doesn’t work at all for proteins which have their n-terminus bonded or buried.

Instead, the refined protein is cleaved into a number of fragments at known points using a single enzyme. For example, Trypsin will cleave on the carboxyl side of arginine and lysine residues. A second copy of the protein is then cleaved using a different enzyme at a different point. These individual fragments are then sorted as above and their individual (non-sequential) components determined.

For example, if we have a protein which has an initial sequence ABCDE
Which then gets cleaved by two different enzymes to give:
Enzyme 1 : (A, B, C) and (D, E)
Enzyme 2 : (A, B) and (C, D)

We can see that the (C, D) fragment produced by Enzyme 2 overlaps with the (A, B, C) and (D, E) fragments produced by Enzyme 1. However, as we don’t know the order in which the amino acid appear within in each fragment, thus there are a number of different sequences which can come to light:
Possibility 1 : A B C D E Possibility 2 : B A C D E Possibility 3 : E D C A B Possibility 4 : E D C B A

At this point the paper comments that such a result highlights to the biochemist that the molecule requires further work for refinement. Sadly the above example whilst relatively simple doesn’t include the whole host of other issues which plague the biochemist in their search for an exact sequence.

Protein Interaction Networks

Proteins don’t just work in isolation, they form complex cliques and partnerships while some particularly gregarious proteins take multiple partners. It’s becoming increasingly apparent that in order to better understand a system, it’s insufficient to understand its component parts in isolation, especially if the simplest cog in the works end up being part of system like this.

So we know what an individual protein looks like, but what does it actually do?

On a macroscopic scale, a cell doesn’t care if the glucose it needs comes from lactose, converted by lactase into galactose and glucose, or from starch converted by amalase, or from glycogen, or from amino acids converted by gluconeogenesis. All it cares about is the glucose. If one of these multiple pathways should become unavailable, as long as the output is the same (glucose) the cell can continue to function. At a lower level, by forming networks of cooperating proteins, these increase a system’s robustness to change. The internal workings may be rewired, but many systems don’t care where their raw materials come from, just so long as they get them.

Whilst sequence similarity and homology modelling can explain the structure and function of an individual protein, its role in the greater scheme of things may still be in question. By modelling interaction networks, higher level questions can be asked such as: ‘What does this newly discovered complex do’? – ‘I don’t know, but yeast’s got something that looks quite like it.’ Homology modelling therefore isn’t just for single proteins.

Scoring the similarity of proteins in two species can be done using many non-exclusive metrics including:

Sequence Similarity – Is this significantly similar to another protein?

Gene Ontology – What does it do?

Interaction Partners – What other proteins does this one hang around with?

Subsequently clustering these proteins based on their interaction partners, highlights the groups of proteins which form functional units. These are highly connected internally whilst having few edges to adjacent clusters. This can provide insight into previously un-investigated proteins which by virtue of being in a cluster of known purpose, their function can be inferred.

GPGPUs for bioinformatics

As the clock speed in computer Central Processing Units (CPUs) began to plateau, their data and task parallelism was expanded to compensate. These days (2013) it is not uncommon to find upwards of a dozen processing cores on a single CPU and each core capable of performing 8 calculations as a single operation. Graphics Processing Units were originally intended to assist CPUs by providing hardware optimised to speed up rendering highly parallel graphical data into a frame buffer. As graphical models became more complex, it became difficult to provide a single piece of hardware which implemented an optimised design for every model and every calculation the end user may desire. Instead, GPU designs evolved to be more readily programmable and exhibit greater parallelism. Top-end GPUs are now equipped with over 2,500 simple cores and have their own CUDA or OpenCL programming languages. This new found programmability allowed users the freedom to take non-graphics tasks which would otherwise have saturated a CPU for days and to run them on the highly parallel hardware of the GPU. This technique proved so effective for certain tasks that GPU manufacturers have since begun to tweak their architectures to be suitable not just for graphics processing but also for more general purpose tasks, thus beginning the evolution General Purpose Graphics Processing Unit (GPGPU).

Improvements in data capture and model generation have caused an explosion in the amount of bioinformatic data which is now available. Data which is increasing in volume faster than CPUs are increasing in either speed or parallelism. An example of this can be found here, which displays a graph of the number of proteins stored in the Protein Data Bank per year. To process this vast volume of data, many of the common tools for structure prediction, sequence analysis, molecular dynamics and so forth have now been ported to the GPGPU. The following tools are now GPGPU enabled and offer significant speed-up compared to their CPU-based counterparts:

Application	Description	Expected Speed Up	Multi-GPU Support
Abalone	Models molecular dynamics of biopolymers for simulations of proteins, DNA and ligands	4-29x	No
ACEMD	GPU simulation of molecular mechanics force fields, implicit and explicit solvent	160 ns/day GPU version only	Yes
AMBER	Suite of programs to simulate molecular dynamics on biomolecule	89.44 ns/day JAC NVE	Yes
BarraCUDA	Sequence mapping software	6-10x	Yes
CUDASW++	Open source software for Smith-Waterman protein database searches on GPUs	10-50x	Yes
CUDA-BLASTP	Accelerates NCBI BLAST for scanning protein sequence databases	10	Yes
CUSHAW	Parallelized short read aligner	10x	Yes
DL-POLY	Simulate macromolecules, polymers, ionic systems, etc on a distributed memory parallel computer	4x	Yes
GPU-BLAST	Local search with fast k-tuple heuristic	3-4x	No
GROMACS	Simulation of biochemical molecules with complicated bond interactions	165 ns/Day DHFR	No
GPU-HMMER	Parallelized local and global search with profile Hidden Markov models	60-100x	Yes
HOOMD-Blue	Particle dynamics package written from the ground up for GPUs	2x	Yes
LAMMPS	Classical molecular dynamics package	3-18x	Yes
mCUDA-MEME	Ultrafast scalable motif discovery algorithm based on MEME	4-10x	Yes
MUMmerGPU	An open-source high-throughput parallel pairwise local sequence alignment program	13x	No
NAMD	Designed for high-performance simulation of large molecular systems	6.44 ns/days STMV 585x 2050s	Yes
OpenMM	Library and application for molecular dynamics for HPC with GPUs	Implicit: 127-213 ns/day; Explicit: 18-55 ns/day DHFR	Yes
SeqNFind	A commercial GPU Accelerated Sequence Analysis Toolset	400x	Yes
TeraChem	A general purpose quantum chemistry package	7-50x	Yes
UGENE	Opensource Smith-Waterman for SSE/CUDA, Suffix array based repeats finder and dotplot	6-8x	Yes
WideLM	Fits numerous linear models to a fixed design and response	150x	Yes

It is important to note however, that due to how GPGPUs handle floating point arithmetic compared to CPUs, results can and will differ between architectures, making a direct comparison impossible. Instead, interval arithmetic may be useful to sanity-check the results generated on the GPU are consistent with those from a CPU based system.

Oxford Protein Informatics Group

or "OPIG" to friends

Author Archives: Eoin Malins

New toys for OPIG

A program to aid primary protein structure determination -1962 style.

Protein Interaction Networks

GPGPUs for bioinformatics