Singularity: a guide for the bewildered bioinformatician

Have you ever worked with a piece of software that is awfully difficult to set up? That legacy code written on FORTRAN 77, that other one that requires significant modifications to compile, or any of those that require a long-winded bash script with a thousand dependencies (which you also have to install!). Would it not be helpful if, when that red-eyed PhD student, that one that just spent three months writing up their thesis, says that they absolutely must use that server where you have installed all your stuff, you could just relocate to another one without trouble? Well, you may be able to do that now. You just need to use containerization.

The idea behind containerization is rather simple. The best way to ensure anyone can reproduce your work is to, well, ship your entire system to whomever needs to use it. You could, for example, pack up your desktop in a box, and ship it to your collaborators anywhere in the world. Unfortunately, this idea is quite unpractical, not only because of tedious logistics (ever had to deal with customs?), but also because suddenly you won’t be able to run your own pipeline. However, it is a good enough thought that at some point made a clever engineer wonder whether there was a way to ship an entire system without physically delivering the computer. And that’s exactly what they designed.

40ft x 8ft (9ft 6") One trip high cube shipping container bl
Best way to make sure your collaborators on the other side of the world can run your pipeline — just pack your desktop in one of these, and ship it away!

Containerization systems, like Docker and Singularity, create a minimal operating system, which can be stored in a single modestly sized file, and provide a simple environment that can be easily transferred between computers and used seamlessly. Anyone can spin it up and run your pipeline exactly in the same way that you did, provided they have the appropriate container file. Besides the convenience of fast and easy deployment, containers also ensure reproducibility, which has become a serious problem in modern science. Unsurprisingly, these advantages have made containerization a standard spread across the tech industry.

In this blog post, I will be adding my two cents to improve the reproducibility in our field. I will explain how you could use Singularity in computational research, and in particular bioinformatics. Why Singularity, in particular, is a topic that has been already explored in this blog before, and that I will not be entertaining further. My objective will be instead to walk you through the steps of, first, spinning up someone else’s container, and right thereafter, how to create your own. More than that, I will be aiming to convince you to use Singularity regularly in your pipelines, so much so that next time you publish a piece of software, you will make sure to include a Singularity recipe.

Preamble

First of all, how do you get Singularity working? The most straightforward way is to download the version in your distro’s repositories. In Ubuntu, for example, you simply need to run:

$ sudo wget -O- http://neuro.debian.net/lists/xenial.us-ca.full | sudo tee /etc/apt/sources.list.d/neurodebian.sources.list && \
    sudo apt-key adv --recv-keys --keyserver hkp://pool.sks-keyservers.net:80 0xA5D32F012649A5A9 && \
    sudo apt-get update
$ sudo apt-get install singularity-container

The story is different if you are running Windows or Mac. I am going to be brutally honest, and say that, if you are trying to run bioinformatics pipelines on any of them you kind of already made your own bed. You may want to take the best decision of your life and install a Linux distro in your machine, or at least mount a Linux virtual machine, like the WSL in Windows 10. If you are adamant that only pure Windows or Mac will work, you may give the official guide a try. However, note that Singularity containers can only emulate Linux, so you will have to get used to it anyway.

How to use Singularity image?

Let’s first assume that someone has given you a container for you to use. Many wise developers are choosing to release their software in Docker or Singularity containers, both of which can be deployed by Singularity. Many images are additionally available through repositories like Docker Hub. In this brief guide, I suggest that you download the following lightweight image:

$ singularity build hello-world.img shub://vsoch/hello-world

To spin up your container, then you only need to do:

$ singularity shell hello-world.img

And voila, you will find yourself in a virtual shell that behaves just as whatever system the creator decided. No more “it works on my system”: the developer’s machine has been shipped — virtually, of course — to your desk. The container will also log into the simulated operating system with your username, and redirect you to your home directory, just like this:

It is your system, but it is inside my system, which is also your system but also my system. It makes sense, right?

Most containers will provide you more functionality than the hello world example, which is, well, nothing. For example, one of my most perused containers includes all the bioinformatics programs that I use regularly: BLAST, DSSP, HHblits, and others. You could, using the same procedure as above, mount this container and run BLAST:

Incredibly useful execution of BLAST for demonstrative purposes

I should probably mention that you are not limited to run folders in your home directory. Most systems will have different folders for long-term storage and/or scratch, and it is also likely that you store large files, like sequence databases, in a special partition. In the OPIG servers, most of our data is stored in the /data/${SERVER_NAME}/${USER_NAME}/ path. These partitions are not part of the default spinning up of the container, and need to be treated in a special way.

One of the most useful Singularity flags is the bind command, which allows you to bind any directory inside of your environment. For example, suppose that you want to run your work on /scratch/, and that your databases are in /opt/shared/databases. Then, you can simply run:

$ singularity shell --bind /scratch:/opt/scratch --bind /opt/shared/databases:/opt/databases mybeautifulimage.simg

As an example, let us actually run BLAST using one of the databases in the OPIG servers:

Clare West, if you read this, I am still using your databases directory

And that’s about it, if you want to run something interactively. Often, however, you would like to run commands non-interactively, for example to include them in a launcher script for a supercomputer. You can achieve this using the exec command:

$ singularity exec --bind /scratch:/opt/scratch --bind /opt/shared/databases:/opt/databases mybeautifulimage.simg echo "Hello world!"

As an example, let’s run BLAST non-interactively using the exec instruction:

At this point, you know just about enough Singularity to use other people’s images effectively. Let us now talk about how to create your own.

How to create a Singularity image?

All of the above is wonderful if you have some sort of senior minion that can package applications for you, but here is the real deal: how can you build a Singularity image for yourself? Well, it turns out it is slightly more complicated. First of all, you will need to have root privileges. In the Department of Statistics, someone made the executive decision (probably, with good reason) of not giving anyone Super-Cow powers, hence we deployed a custom-purpose virtual machine for people to build their own images. In practice, you will probably be fine if you use your own machine, which is what I do.

The second ingredient is an original Singularity or Docker image to we will build upon. Fortunately, many of these images are available from online repositories like DockerHub. If you don’t have any specific requirements, I suggest that you go for docker://ubuntu, but if for example you will be using deep learning, you will probably want to use containers for Tensorflow or PyTorch, among others.

Provided that you have secured sudo privileges and identified a suitable starting image, (one of) the easiest way(s) to create a Singularity image is by running:

$ sudo singularity build --sandbox myContainerSandbox docker://ubuntu

After running this command, you may be surprised to find that, instead of a fancy binary file with .simg extension, Singularity has created a folder which looks very much like your root directory, with folders like /etc/, /opt/ and others. The reason is that we have created a sandbox environment: an image for interactive development.

Despite the strange-looking nature of a “container folder”, you can spin it up in exactly the same way as you would with a “.img/.simg file”:

$ sudo singularity shell myContainerSandbox

The advantage is that with this setup you can easily set up your environment as you please, by spinning up the container in “writable” mode, where all changes made to the environment will be persistent:

$ sudo singularity --writable shell myContainerSandbox

Quick note. Remember that, for all intents and purposes, you are the root user in the environment, so if you set up your software on the /home/root/ directory, you will be unable to use it in a different machine unless you have sudo privileges there. My recommendation is that you create some folders in the /opt/ directory, but feel free to do as you please!

In this case, let us set up a tools.simg container that allows you to reproduce the contents of the previous section:

$ sudo singularity build --sandbox toolsSandbox docker://ubuntu
$ sudo singularity shell --writable toolsSandbox
Singularity> apt update && apt upgrade && apt install wget
...
Singularity> wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.11.0+-x64-linux.tar.gz
...
Singularity> tar xf ncbi-blast-2.11.0+-x64-linux.tar.gz
Singularity> mv ncbi-blast-2.11.0+-x64-linux/ BLAST/
Singularity> rm -f ncbi-blast-2.11.0+-x64-linux.tar.gz

If you leave the container, say by pressing Ctrl+D, the changes that you made will not disappear, and you will be able to spin up your container in your set up environment.

Once we have finished setting up our environment, we can transform it into a read-only Singularity image simply by:

sudo singularity build myContainer.simg myContainerSandbox/

And now you are ready to use your container in whatever application you have in mind. Why don’t you try to create a container with BLAST, and reproduce the results of the previous section?

What is the best way to use Singularity in bioinformatics?

I have never thought myself the right person to provide advice, and yet I always find myself giving it out anyway, so this is not going to be any different. What follows is a well-intended reflection on how to best use Singularity for research in computational biology. Many of these ideas should also apply to other fields in computational research.

My first advice is that you spend some time deciding which tools will be commonly used by the group you are working with, and set up a common Singularity image for them. This is a good idea per se, if only to avoid tens or hundreds of BLAST installations under different home directories, but it also brings a lot of other benefits. For example, Linus’ law guarantees that a commonly used resource will be better maintained. Perhaps more importantly, you will guarantee that after someone leaves the other group members will not have to spend a week figuring out which intricate dependencies are needed for their code to run.

Of course, the tools that people are using will change with time, but that can easily be modified. Just choose a group member to periodically review the images, installing and/or upgrading as needed. For better performance, make sure said group member is provided with a copious amount of baked goods during or after their vital task.

If you are going to generate a number of Singularity images, you probably also want to invest in a good number of scripts. A large proportion of bioinformatics programs are self-contained and can be completely manipulated through the command line. I am thinking of good ol’ DSSP, which just takes a PDB file and spits out some information about secondary structure, solvent accessiblity and all that jazz. If you set up a simple dssp.sh script:

#!/bin/bash

singularity exec /opt/shared/simg/protein-structure-tools.simg /opt/dssp/dssp $@

Then, even the members of your group that have no time or desire to learn Singularity can inadvertently use it. In fact, if you rename the script as dssp, give it o+x permissions, and place it in the PATH… well, it works pretty much like the DSSP binary! You can easily include additional bind flags so the Singularity environment includes all of the special partitions in your system.

Finally, I would like to make the point that if you are releasing your software to the community, you should provide a Singularity image, either by uploading it to some hub, or preferably, by providing a container recipe. To the obvious benefits of reproducibility and ease of use — both excellent drivers of science, but also of citation count –, you must add that very few people will contact you asking how to set up your program. And, to those who do — just send them a link to this blog post!

Conclusion

Never, ever, set up your software unless you are using a Singularity image. You can create images in a matter of minutes, and afterwards you will be able to spin it up in any machine with guarantees of reliability and reproducibility. Paraphrasing that programming language everyone uses but nobody seems to like, “set up once, run anywhere”.

Author