Eoin Malins | Oxford Protein Informatics Group

Proteins don’t just work in isolation, they form complex cliques and partnerships while some particularly gregarious proteins take multiple partners. It’s becoming increasingly apparent that in order to better understand a system, it’s insufficient to understand its component parts in isolation, especially if the simplest cog in the works end up being part of system like this.

So we know what an individual protein looks like, but what does it actually do?

On a macroscopic scale, a cell doesn’t care if the glucose it needs comes from lactose, converted by lactase into galactose and glucose, or from starch converted by amalase, or from glycogen, or from amino acids converted by gluconeogenesis. All it cares about is the glucose. If one of these multiple pathways should become unavailable, as long as the output is the same (glucose) the cell can continue to function. At a lower level, by forming networks of cooperating proteins, these increase a system’s robustness to change. The internal workings may be rewired, but many systems don’t care where their raw materials come from, just so long as they get them.

Whilst sequence similarity and homology modelling can explain the structure and function of an individual protein, its role in the greater scheme of things may still be in question. By modelling interaction networks, higher level questions can be asked such as: ‘What does this newly discovered complex do’? – ‘I don’t know, but yeast’s got something that looks quite like it.’ Homology modelling therefore isn’t just for single proteins.

Scoring the similarity of proteins in two species can be done using many non-exclusive metrics including:

Sequence Similarity – Is this significantly similar to another protein?

Gene Ontology – What does it do?

Interaction Partners – What other proteins does this one hang around with?

Subsequently clustering these proteins based on their interaction partners, highlights the groups of proteins which form functional units. These are highly connected internally whilst having few edges to adjacent clusters. This can provide insight into previously un-investigated proteins which by virtue of being in a cluster of known purpose, their function can be inferred.

As the clock speed in computer Central Processing Units (CPUs) began to plateau, their data and task parallelism was expanded to compensate. These days (2013) it is not uncommon to find upwards of a dozen processing cores on a single CPU and each core capable of performing 8 calculations as a single operation. Graphics Processing Units were originally intended to assist CPUs by providing hardware optimised to speed up rendering highly parallel graphical data into a frame buffer. As graphical models became more complex, it became difficult to provide a single piece of hardware which implemented an optimised design for every model and every calculation the end user may desire. Instead, GPU designs evolved to be more readily programmable and exhibit greater parallelism. Top-end GPUs are now equipped with over 2,500 simple cores and have their own CUDA or OpenCL programming languages. This new found programmability allowed users the freedom to take non-graphics tasks which would otherwise have saturated a CPU for days and to run them on the highly parallel hardware of the GPU. This technique proved so effective for certain tasks that GPU manufacturers have since begun to tweak their architectures to be suitable not just for graphics processing but also for more general purpose tasks, thus beginning the evolution General Purpose Graphics Processing Unit (GPGPU).

Improvements in data capture and model generation have caused an explosion in the amount of bioinformatic data which is now available. Data which is increasing in volume faster than CPUs are increasing in either speed or parallelism. An example of this can be found here, which displays a graph of the number of proteins stored in the Protein Data Bank per year. To process this vast volume of data, many of the common tools for structure prediction, sequence analysis, molecular dynamics and so forth have now been ported to the GPGPU. The following tools are now GPGPU enabled and offer significant speed-up compared to their CPU-based counterparts:

Application	Description	Expected Speed Up	Multi-GPU Support
Abalone	Models molecular dynamics of biopolymers for simulations of proteins, DNA and ligands	4-29x	No
ACEMD	GPU simulation of molecular mechanics force fields, implicit and explicit solvent	160 ns/day GPU version only	Yes
AMBER	Suite of programs to simulate molecular dynamics on biomolecule	89.44 ns/day JAC NVE	Yes
BarraCUDA	Sequence mapping software	6-10x	Yes
CUDASW++	Open source software for Smith-Waterman protein database searches on GPUs	10-50x	Yes
CUDA-BLASTP	Accelerates NCBI BLAST for scanning protein sequence databases	10	Yes
CUSHAW	Parallelized short read aligner	10x	Yes
DL-POLY	Simulate macromolecules, polymers, ionic systems, etc on a distributed memory parallel computer	4x	Yes
GPU-BLAST	Local search with fast k-tuple heuristic	3-4x	No
GROMACS	Simulation of biochemical molecules with complicated bond interactions	165 ns/Day DHFR	No
GPU-HMMER	Parallelized local and global search with profile Hidden Markov models	60-100x	Yes
HOOMD-Blue	Particle dynamics package written from the ground up for GPUs	2x	Yes
LAMMPS	Classical molecular dynamics package	3-18x	Yes
mCUDA-MEME	Ultrafast scalable motif discovery algorithm based on MEME	4-10x	Yes
MUMmerGPU	An open-source high-throughput parallel pairwise local sequence alignment program	13x	No
NAMD	Designed for high-performance simulation of large molecular systems	6.44 ns/days STMV 585x 2050s	Yes
OpenMM	Library and application for molecular dynamics for HPC with GPUs	Implicit: 127-213 ns/day; Explicit: 18-55 ns/day DHFR	Yes
SeqNFind	A commercial GPU Accelerated Sequence Analysis Toolset	400x	Yes
TeraChem	A general purpose quantum chemistry package	7-50x	Yes
UGENE	Opensource Smith-Waterman for SSE/CUDA, Suffix array based repeats finder and dotplot	6-8x	Yes
WideLM	Fits numerous linear models to a fixed design and response	150x	Yes

It is important to note however, that due to how GPGPUs handle floating point arithmetic compared to CPUs, results can and will differ between architectures, making a direct comparison impossible. Instead, interval arithmetic may be useful to sanity-check the results generated on the GPU are consistent with those from a CPU based system.

Oxford Protein Informatics Group

or "OPIG" to friends

Author Archives: Eoin Malins

Protein Interaction Networks

GPGPUs for bioinformatics