Using SLURM a little bit more efficiently

Your research group slurmified their servers? You basically have two options now.

Either you install all your necessary things on one of the slurm nodes within an interactive session, e.g.:

srun -p funkyserver-debug --pty --nodes=1 --ntasks-per-node=1 -t 00:10:00 --wait=0 /bin/bash

and always specify this node by adding the ‘#SBATCH –nodelist=funkyserver.cpu.do.work’ line to your sbatch scripts or you set up some template scripts that will help you to install all your requirements on multiple nodes so you can enjoy the benefits of the slurm system.

Here is how I did it; comments and suggestions welcome!

Step 1: Create an sbatch template file (e.g. sbatch_job_on_server.template_sh) on the submission node that does what you want. In the ‘#SBATCH –partition’ or ‘–nodelist’ lines use a placeholder, e.g. ‘<server>’, instead of funkyserver. 

For example, for installing the same conda environment on all nodes that you want to work on:

#!/bin/bash                                                                                         
#SBATCH -J <server>_installer            # Job name
#SBATCH -A opig                          # Project Account                                           
#SBATCH --time=00:10:00                  # Walltime                                                  
#SBATCH --mem=1G
#SBATCH --ntasks=1                       # 1 tasks                                                   
#SBATCH --cpus-per-task=1                # number of cores per task                                 
#SBATCH --nodes=1                        # number of nodes                                           
#SBATCH --chdir=/your/fav/localhost/directory # From where you want the job to be run
#SBATCH --mail-user=dominik@do.work       # set email address  
#SBATCH --verbose
#SBATCH --partition=<server>-debug       # Select a debug partition of server
#S BATCH --nodelist=<server>.cpu.do.work # commented out, use partition or nodelist   
#SBATCH --output=/your/fav/localhost/directory/slurm_outs/slurm_%j_%N_%x_%A_%a.out   # Writes standard output to this file.        
#SBATCH --error=/your/fav/localhost/directory/slurm_outs/slurm_%j_%N_%x_%A_%a.out   # Writes error messages to this file. 

## install conda env from spec file
# copy spec file from (nonslurm) submission node
scp "nonslurmnode:~/Documents/my_env_specs.txt" .

# first source bashrc (with conda.sh), then conda can be used
source ~/.bashrc

# make sure conda base is activated
conda activate

# if already present the old env is removed
conda remove --name my_env --all
# and the updated one created from scratch
# (this is necessary if you want to ensure all environments are identical because conda update and conda create --force have a bug)
conda create --name my_name --file my_env_specs.txt

Step 2: Create a create_sbatch_from_template.sh that goes through a list of servers and runs a sed command for each, e.g.:

#!/bin/bash
### This script creates the individual sbatch install file for each server
# 18/02/2020 by Dominik

# define server list
server_list=("funkyserv01" "funkyserv02" "technoserv01")

# go through list
for server in "${server_list[@]}"; do
    # replace placeholders in script and script name
    sed "s/<server>/${server}/g" “~/scripts/templates/sbatch_job_on_server.template_sh" > "~/scripts/sbatch_job_on_${server}.sh"
done

Run it.

Step 3: Create a script that goes through the server list and submits the sbatch files created by Step 2. (This can also be done within the script of Step 2, directly after creating the sbatch file from the template.)

#!/bin/bash
### This script submits sbatch scripts, one for each slurm server that you want to work on
# 18/02/2020 by Dominik

# define server list
server_list=("funkyserv01" "funkyserv02" "technoserv01")

# submit sbatch for each slurm server
for server in "${server_list[@]}"; do
    sbatch "~/scripts/sbatch_job_on_${server}.sh"
done

Run it.

Additionally to the programs you need for your research, I find some things very useful to have on multiple servers: e.g. your bashrc, your environments, your ssh token, …

Miniconda is probably one of the first things you want to have on the nodes. It could be worth installing them on all nodes at the same time as different conda versions might behave differently. I had some nodes on 4.5 and others on 4.7 which caused me some confusion as 4.5 had a bug which prevented .conda files in the spec file to be parsed.

Also note that the ‘conda remove –name my_env –all’ followed by a creation from scratch is necessary if you want environments that are identical. Conda 4.4+ has a bug so that libraries not specified in the file but present earlier (e.g. through testing something on that node) will not be uninstalled when conda update (work around conda create –force doesn’t work yet, link to github issue missing, apologies).

Author