{"id":7654,"date":"2021-12-12T15:21:44","date_gmt":"2021-12-12T15:21:44","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=7654"},"modified":"2022-02-08T16:32:32","modified_gmt":"2022-02-08T16:32:32","slug":"snakemake-better-workflows-with-your-code","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2021\/12\/snakemake-better-workflows-with-your-code\/","title":{"rendered":"snakeMAKE better workflows with your code"},"content":{"rendered":"\n<p>When developing your pipeline for processing, annotating and\/or analyzing data, you will probably find yourself needing to continuously re-run it, as you play around with your code. This can become a problem when working with long pipelines, large datasets and cpu\u2019s begging you not to run some pieces of code again. <\/p>\n\n\n\n<p>Luckily, you are not the first one to have been annoyed by this and other related struggles. Some people were actually so annoyed that they created <a href=\"https:\/\/snakemake.github.io\">Snakemake<\/a>. Snakemake can be used to create workflows and help solve problems, such as the one mentioned above. This is done using a Snakefile, which helps you split your pipeline into \u201crules\u201d. To illustrate how this helps you create a better workflow, we will be looking at the example below. <\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\" id=\"installing-snakemake\">Installing Snakemake<\/h2>\n\n\n\n<p>First we need to install Snakemake. Snakemake can be installed using either <code>pip<\/code> or conda. However, <code>conda<\/code> is recommended, as the full version of Snakemake includes non-Python dependencies not included with <code>pip<\/code>, meaning the <code>pip<\/code> version has limited functionality.\u00a0<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">conda install -c bioconda snakemake<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"creating-a-snakefile\">Creating a Snakefile<\/h2>\n\n\n\n<p>Next step is to create the Snakefile. In this file you need to define the rules of your workflow. A rule consists of a step in the workflow, and the input files needed and output files created for this step. Below is a simple example of a Snakefile with two rules. The Snakefile also works with wildcards, meaning <code>{csvdata}<\/code> will be adapted according to the name of the file you are working with. <\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"raw\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">rule add:\n    input:\n        \"{csvdata}.csv\"\n    output:\n        \"{csvdata}_added.csv\"\n    shell:\n        \"python .\/tools.py {input} {output} add\"\n\nrule multiply:\n    input:\n        \"{csvdata}_added.csv\"\n    output:\n        \"{csvdata}_multiplied.csv\"\n    shell:\n        \"python .\/tools.py {input} {output} multiply\"<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"running-snakemake\">Running Snakemake<\/h2>\n\n\n\n<p>Instead of the normal approach of you giving an input and receiving an output, with Snakemake, you tell it the output file you want to create. Snakemake will then execute the rule which returns the output file. <\/p>\n\n\n\n<p>Using the above Snakefile, if we told Snakemake to create <em><code>data_multiplied.csv<\/code><\/em>, it would match <em><code>data_multiplied.csv<\/code><\/em> with <em><code>{csvdata}_multiplied.csv<\/code> <\/em>(<code>{csvdata} <\/code>will now be replaced by &#8216;<code>data<\/code>&#8216;). However, to create <em><code>data_multiplied.csv<\/code><\/em> the file <em><code>data_added.csv<\/code><\/em> is needed. If this file does not exist, Snakemake will look for another rule which returns the needed file and execute that one first. In our example, the &#8216;<code>add<\/code>&#8216; rule returns <em><code>data_added.csv<\/code> <\/em>when given <em><code>data.csv<\/code><\/em>, which in our case is the data file we start with. Snakemake will therefore first execute the &#8216;<code>add<\/code>&#8216; rule to create <em><code>data_added.csv<\/code><\/em> and then the &#8216;<code>multiplied<\/code>&#8216; rule to create <em><code>data_multiplied.csv<\/code><\/em>. If we already had the <em><code>data_added.csv<\/code><\/em> file, Snakemake would only run the &#8216;<code>multiplied<\/code>&#8216; rule, saving us some computation.<\/p>\n\n\n\n<p>To get a better feeling of it, let us try and run our example. In order to do so, we first need to create the <code>tools.py<\/code> file.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import sys\nimport pandas as pd\n\ndef add(x):\n    return x+2\n\ndef multiply(x):\n    return x*2\n\nif __name__ == '__main__':\n    \n    inputFilename = sys.argv[1]\n    outputFilename = sys.argv[2]\n    method = sys.argv[3]\n       \n    a_input = pd.read_csv(inputFilename)\n    \n    if method == 'add':  \n        a_input.apply(add).to_csv(outputFilename, index=False)\n        \n    elif method == 'multiply':  \n        a_input.apply(multiply).to_csv(outputFilename, index=False)<\/pre>\n\n\n\n<p>And our <code>data.csv<\/code> file.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"raw\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">X\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10<\/pre>\n\n\n\n<p>To run Snakemake you need to be in a folder with the Snakefile, <code>tools.py<\/code> and <code>data.csv<\/code> files and then run the below command.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">snakemake data_multiplied.csv<\/pre>\n\n\n\n<p>This will first generate the <em>data_added.csv<\/em> file and then the <em>data_multiplied.csv<\/em> file. If you run the command again, it will tell you that there is nothing new to run. <\/p>\n\n\n\n<p>Snakemake also includes many useful options, such as &#8216;<code>-n<\/code>&#8216; which shows each step needed for creating the output file without running anything, &#8216;<code>-F<\/code>&#8216; which force runs all steps and &#8216;<code>-j<\/code>&#8216; which allows you to run multiple rules in parallel. This is barely the surface of what Snakemake has to offer, but I hope this short blog has illustrated the usefulness of this tool.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When developing your pipeline for processing, annotating and\/or analyzing data, you will probably find yourself needing to continuously re-run it, as you play around with your code. This can become a problem when working with long pipelines, large datasets and cpu\u2019s begging you not to run some pieces of code again. Luckily, you are not [&hellip;]<\/p>\n","protected":false},"author":79,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[361,296,227],"tags":[453,451,452],"ppma_author":[555],"class_list":["post-7654","post","type-post","status-publish","format-standard","hentry","category-data-science","category-hints-and-tips","category-python-code","tag-pipeline","tag-snakemake","tag-workflow"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":555,"user_id":79,"is_guest":0,"slug":"tobias","display_name":"Tobias Olsen","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/af2ae0464f59a7d5265c54f1a5ecb21434511b057f7af5971f4052dc60b7da11?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/7654","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/79"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=7654"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/7654\/revisions"}],"predecessor-version":[{"id":7787,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/7654\/revisions\/7787"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=7654"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=7654"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=7654"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=7654"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}