Parallel Computing: GNU Parallel

Recently I started using the OPIG servers to run the algorithm I have developed (CRANkS) on datasets from DUDE (Database of Useful Decoys Enhanced).

This required learning how to run jobs in parallel. Previously I had been using computer clusters with their own queuing system (Torque/PBS) which allowed me to submit each molecule to be scored by the algorithm as a separate job. The queuing system would then automatically allocate nodes to jobs and execute jobs accordingly. On a side note I learnt how to submit these jobs an array, which was preferable to submitting ~ 150,000 separate jobs:

qsub -t 1:X array_submit.sh

where the contents of array_submit.sh would be:

#!/bin/bash
./$SGE_TASK_ID.sh

which would submit jobs 1.sh to X.sh, where X is the total number of jobs.

However the OPIG servers do not have a global queuing system to use. I needed a way of being able to run the code I already had in parallel with minimal changes to the workflow or code itself. There are many ways to run jobs in parallel, but to minimise work for myself, I decided to use GNU parallel [1].

This is an easy-to-use shell tool, which I found quick and easy to install onto my home server, allowing me to access it on each of the OPIG servers.

To use it I simply run the command:

cat submit.sh | parallel -j Y

where Y is the number of cores to run the jobs on, and submit.sh contains:

./1.sh
./2.sh
...
./X.sh

This executes each job making use of Y number of cores when available to run the jobs in parallel.

Quick, easy, simple and minimal modifications needed! Thanks to Jin for introducing me to GNU Parallel!

[1] O. Tange (2011): GNU Parallel – The Command-Line Power Tool, The USENIX Magazine, February 2011:42-47.