Author Archives: Eoin Malins

Protein Interaction Networks

Proteins don’t just work in isolation, they form complex cliques and partnerships while some particularly gregarious proteins take multiple partners. It’s becoming increasingly apparent that in order to better understand a system, it’s insufficient to understand its component parts in isolation, especially if the simplest cog in the works end up being part of system like this.

So we know what an individual protein looks like, but what does it actually do?

On a macroscopic scale, a cell doesn’t care if the glucose it needs comes from lactose, converted by lactase into galactose and glucose, or from starch converted by amalase, or from glycogen, or from amino acids converted by gluconeogenesis. All it cares about is the glucose. If one of these multiple pathways should become unavailable, as long as the output is the same (glucose) the cell can continue to function. At a lower level, by forming networks of cooperating proteins, these increase a system’s robustness to change. The internal workings may be rewired, but many systems don’t care where their raw materials come from, just so long as they get them.

Whilst sequence similarity and homology modelling can explain the structure and function of an individual protein, its role in the greater scheme of things may still be in question. By modelling interaction networks, higher level questions can be asked such as: ‘What does this newly discovered complex do’? – ‘I don’t know, but yeast’s got something that looks quite like it.’ Homology modelling therefore isn’t just for single proteins.

Scoring the similarity of proteins in two species can be done using many non-exclusive metrics including:

  • Sequence Similarity – Is this significantly similar to another protein?
  • Gene Ontology – What does it do?
  • Interaction Partners – What other proteins does this one hang around with?

  • Subsequently clustering these proteins based on their interaction partners, highlights the groups of proteins which form functional units. These are highly connected internally whilst having few edges to adjacent clusters. This can provide insight into previously un-investigated proteins which by virtue of being in a cluster of known purpose, their function can be inferred.

    GPGPUs for bioinformatics

    As the clock speed in computer Central Processing Units (CPUs) began to plateau, their data and task parallelism was expanded to compensate. These days (2013) it is not uncommon to find upwards of a dozen processing cores on a single CPU and each core capable of performing 8 calculations as a single operation. Graphics Processing Units were originally intended to assist CPUs by providing hardware optimised to speed up rendering highly parallel graphical data into a frame buffer. As graphical models became more complex, it became difficult to provide a single piece of hardware which implemented an optimised design for every model and every calculation the end user may desire. Instead, GPU designs evolved to be more readily programmable and exhibit greater parallelism. Top-end GPUs are now equipped with over 2,500 simple cores and have their own CUDA or OpenCL programming languages. This new found programmability allowed users the freedom to take non-graphics tasks which would otherwise have saturated a CPU for days and to run them on the highly parallel hardware of the GPU. This technique proved so effective for certain tasks that GPU manufacturers have since begun to tweak their architectures to be suitable not just for graphics processing but also for more general purpose tasks, thus beginning the evolution General Purpose Graphics Processing Unit (GPGPU).

    Improvements in data capture and model generation have caused an explosion in the amount of bioinformatic data which is now available. Data which is increasing in volume faster than CPUs are increasing in either speed or parallelism. An example of this can be found here, which displays a graph of the number of proteins stored in the Protein Data Bank per year. To process this vast volume of data, many of the common tools for structure prediction, sequence analysis, molecular dynamics and so forth have now been ported to the GPGPU. The following tools are now GPGPU enabled and offer significant speed-up compared to their CPU-based counterparts:

    Application Description Expected Speed Up Multi-GPU Support
    Abalone Models molecular dynamics of biopolymers for simulations of proteins, DNA and ligands 4-29x No
    ACEMD GPU simulation of molecular mechanics force fields, implicit and explicit solvent 160 ns/day GPU version only Yes
    AMBER Suite of programs to simulate molecular dynamics on biomolecule 89.44 ns/day JAC NVE Yes
    BarraCUDA Sequence mapping software 6-10x Yes
    CUDASW++ Open source software for Smith-Waterman protein database searches on GPUs 10-50x Yes
    CUDA-BLASTP Accelerates NCBI BLAST for scanning protein sequence databases 10 Yes
    CUSHAW Parallelized short read aligner 10x Yes
    DL-POLY Simulate macromolecules, polymers, ionic systems, etc on a distributed memory parallel computer 4x Yes
    GPU-BLAST Local search with fast k-tuple heuristic 3-4x No
    GROMACS Simulation of biochemical molecules with complicated bond interactions 165 ns/Day DHFR No
    GPU-HMMER Parallelized local and global search with profile Hidden Markov models 60-100x Yes
    HOOMD-Blue Particle dynamics package written from the ground up for GPUs 2x Yes
    LAMMPS Classical molecular dynamics package 3-18x Yes
    mCUDA-MEME Ultrafast scalable motif discovery algorithm based on MEME 4-10x Yes
    MUMmerGPU An open-source high-throughput parallel pairwise local sequence alignment program 13x No
    NAMD Designed for high-performance simulation of large molecular systems 6.44 ns/days STMV 585x 2050s Yes
    OpenMM Library and application for molecular dynamics for HPC with GPUs Implicit: 127-213 ns/day; Explicit: 18-55 ns/day DHFR Yes
    SeqNFind A commercial GPU Accelerated Sequence Analysis Toolset 400x Yes
    TeraChem A general purpose quantum chemistry package 7-50x Yes
    UGENE Opensource Smith-Waterman for SSE/CUDA, Suffix array based repeats finder and dotplot 6-8x Yes
    WideLM Fits numerous linear models to a fixed design and response 150x Yes

    It is important to note however, that due to how GPGPUs handle floating point arithmetic compared to CPUs, results can and will differ between architectures, making a direct comparison impossible. Instead, interval arithmetic may be useful to sanity-check the results generated on the GPU are consistent with those from a CPU based system.