{"id":12546,"date":"2025-04-29T09:24:02","date_gmt":"2025-04-29T08:24:02","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=12546"},"modified":"2025-04-29T09:24:04","modified_gmt":"2025-04-29T08:24:04","slug":"slurm-and-snakemake-a-match-made-in-hpc-heaven","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2025\/04\/slurm-and-snakemake-a-match-made-in-hpc-heaven\/","title":{"rendered":"Slurm and Snakemake: a match made in HPC heaven"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Snakemake is an incredibly useful workflow management tool that allows you to run pipelines in an automated way. Simply put, it allows you to define inputs and outputs for different steps that depend on each other, Snakemake will then run jobs only when the required inputs have been generated by previous steps. A previous blog post by Tobias is a good introduction to it &#8211; <a href=\"https:\/\/www.blopig.com\/blog\/2021\/12\/snakemake-better-workflows-with-your-code\/\">https:\/\/www.blopig.com\/blog\/2021\/12\/snakemake-better-workflows-with-your-code\/<\/a>.\u00a0<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, often pipelines are computationally intense and we would like to run them on our HPC. Snakemake allows us to do this on slurm using an extension package called snakemake-executor-plugin-slurm.<\/p>\n\n\n\n<!--more-->\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"bash\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">pip install snakemake-executor-plugin-slurm<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Once set up, each part of the pipeline can be sent off as an individual slurm job. Therefore, the jobs can be run across as many CPUs or GPUs as you have access to.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As a toy example, we have two scripts that download a random number of SMILES strings from the ZINC database. Once these smiles have been accessed, a second script uses these SMILES and RDKit to generate 3D conformations for these compounds. In this example, the running of the second script depends on the running of the first script. We can define this in our .smk (the Snakemake file format) file:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import random\n# Generate 10 random numbers of SMILES between 5 and 50\ncounts = random.sample(range(5, 51), 10)\n\n# Export counts so Snakemake knows the valid values\nSMILES_COUNTS = [str(n) for n in counts]\n\n# Does the final checks to ensure that the outputs of the pipeline are generated\nrule all:\ninput:\nexpand(\"conf_{num}\/done.txt\", num=SMILES_COUNTS)\n\n# Does the check that individual parts of the pipeline are run or need to be run\nrule get_smiles:\noutput:\n\"smiles_{num}.txt\"\nshell:\n\"python -m scripts.get_smiles --num {wildcards.num} --output {output}\"\n\nrule generate_conformers:\ninput:\nsmiles=\"smiles_{num}.txt\"\noutput:\ndone=\"conf_{num}\/done.txt\"\nshell:\n\"python -m scripts.generate_conf --input_file {input.smiles} --outdir conf_{wildcards.num}\"<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The second script writes a done.txt file so that Snakemake can know that all conformers are generated.\u00a0<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This Snakemake pipeline can be run locally using the simple command:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"bash\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">snakemake -j 4 -s {snakemake_filename}.smk<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">However, to run this run on Slurm, we need a config.yaml file that we can define our variables to:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"yaml\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">max-status-checks-per-second: 0.01\nexecutor: slurm\nconda-prefix: {CONDA_ENVS_PATH}\njobscript: slurm_job.sh\n\ndefault-resources:\n runtime: 1h\n time: 30:00:00\n slurm_account: {USERNAME}\n mem: 4G\n cpus_per_task: 1\n slurm_partition: {PARTITION}\n clusters: {CLUSTER}\n\nprintshellcmds: true<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This defines the default resources for each slurm job and also defines the path for the conda-prefix. To allow the Snakemake pipeline to know which conda environment to use we additionally have to add the conda env yaml file to each rule. I haven\u2019t managed to get the conda environment to be automatically downloaded (which Snakemake claims is possible) but it would be possible to add it as a rule within the pipeline. I just installed the environment using the command line. An example of a single rule is below:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">rule get_smiles:\nconda: {PATH_TO_ENV_YAML}\noutput:\n\"smiles_{num}.txt\"\nshell:\n\"python -m scripts.get_smiles --num {wildcards.num} --output {output}\"<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Furthermore, resources can be defined for specific jobs, useful if some jobs need GPUs but others only need CPUs. For example, we can change one rule to need only 1G of memory:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">rule get_smiles:\nconda: {PATH_TO_ENV_YAML}\noutput:\n\"Smiles_{num}.txt\"\nresources:\nmem=\u201d1G\u201d\nshell:\n\"python -m scripts.get_smiles --num {wildcards.num} --output {output}\"<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We can run this all using the following command:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"bash\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">snakemake -s generate_random_mols.smk \u2013configfile config\/config.yaml \u2013jobs 100 \u2013latency-wait 10 \u2013keep-going \u2013rerun-incomplete<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We can define the maximum number of jobs to run, and how long Snakemake waits to check for files\u2019 existence. The last two ensure Snakemake keeps going even if a single job fails and reruns any that have been interrupted in previous iterations.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Happy snake-making on slurm!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Snakemake is an incredibly useful workflow management tool that allows you to run pipelines in an automated way. Simply put, it allows you to define inputs and outputs for different steps that depend on each other, Snakemake will then run jobs only when the required inputs have been generated by previous steps. A previous blog [&hellip;]<\/p>\n","protected":false},"author":80,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[739,361,227],"tags":[763,292,451],"ppma_author":[554],"class_list":["post-12546","post","type-post","status-publish","format-standard","hentry","category-cloud-computing","category-data-science","category-python-code","tag-hpc","tag-slurm","tag-snakemake"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":554,"user_id":80,"is_guest":0,"slug":"guy","display_name":"Guy Durant","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/012800a6259061320ac59c0ef0f953daa4c2a0ebc3538e9463c6c215d88ed479?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Durant","first_name":"Guy","job_title":"","description":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12546","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/80"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=12546"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12546\/revisions"}],"predecessor-version":[{"id":12554,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12546\/revisions\/12554"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=12546"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=12546"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=12546"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=12546"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}