Wiki

Clone wiki

Snakemake / FAQ

Table of contents

What is the key idea of Snakemake workflows?

The key idea is very similar to GNU Make. The workflow is determined automatically from top (the files you want) to bottom (the files you have), by applying very general rules with wildcards you give to Snakemake:

Snakemake idea

Why Snakemake if there is Galaxy (or some other convenient tool) ?

Good question, especially since Snakemake does not offer a graphical user interface (GUI). Instead, Snakemake is like a simple programming language for workflows with direct Python integration. Consider for example Galaxy with a complex workflow of 20 steps on a dataset with 100 different samples, which is not too unusual these days. Galaxy will attempt to keep track of (and display) your analysis history, eat up lots of disk space and overwhelm you with irrelevant information. Also, you have to start the workflow 100 times by clicking somewhere. The goal of Snakemake and Snakefiles is that the final workflow (which will be written incrementally as you develop it) will produce all your output files from the input files (the raw data) just by running Snakemake in the appropriate directory without further interaction.

How do I run my rule on all files of a certain directory?

In Snakemake, similar to GNU Make, the workflow is determined from the top, i.e. from the target files. Imagine you have a directory with files 1.fastq, 2.fastq, 3.fastq, ..., and you want to produce files 1.bam, 2.bam, 3.bam, ... you should specify these as target files, using the ids 1,2,3,.... You could end up with at least two rules like this (or any number of intermediate steps):

IDS = "1 2 3 ...".split() # the list of desired ids

# a pseudo-rule that collects the target files
rule all:
   input:  expand("otherdir/{id}.bam", id=IDS)

# a general rule using wildcards that does the work
rule:
   input:  "thedir/{id}.fastq"
   output: "otherdir/{id}.bam
   shell:  "..."

Snakemake will then go down the line and determine which files it needs from your initial directory.

In order to infer the IDs from present files, version 2.4.8 of Snakemake provides the glob_wildcards function, e.g.

IDS, = glob_wildcards("thedir/{id}.fastq")

The function matches the given pattern against the files present in the filesystem and thereby infers the values for all wildcards in the pattern. A named tuple that contains a list of values for each wildcard is returned. Here, this named tuple has only one item, that is the list of values for the wildcard {id}.

Is it possible to pass variable values to the workflow via the command line?

Yes, this is possible since version 3.1. Have a look at this Section in the Documentation. Previously it was necessary to use environment variables like so: E.g. write

$ SAMPLES="1 2 3 4 5" snakemake

and have in the Snakefile some Python code that reads this environment variable, i.e.

SAMPLES = os.environ.get("SAMPLES", "10 20").split()

I get a NameError with my shell command. Are braces unsupported?

You can use the entire Python format minilanguage in shell commands. Braces in shell commands that are not intended to insert variable values thus have to be escaped by doubling them:

...
shell: "awk '{{print $1}}' {input}"

Here the double braces are escapes, i.e. there will remain single braces in the final command. In contrast, {input} is replaced with an input filename.

How do I incorporate files that do not follow a consistent naming scheme?

The best solution is to have a dictionary that translates a sample id to the inconsistently named files and use a function (see the section "functions as input files" in the documentation) to provide an input file like this:

FILENAME = dict(...)  # map sample ids to the irregular filenames here

rule:
  # use a function as input to delegate to the correct filename
  input: lambda wildcards: FILENAME[wildcards.sample]
  output: "somefolder/{sample}.csv"
  shell: ...

How do I force Snakemake to rerun all jobs from the rule I just edited?

This can be done by invoking Snakemake with the --forcerules or -R flag, followed by the rules that should be re-executed:

$ snakemake -R somerule

This will cause Snakemake to re-run all jobs of that rule and everything downstream (i.e. directly or indirectly depending on the rules output).

How do I enable syntax highlighting in Vim for Snakefiles?

Since Snakemake syntax is still close to Python, the Python syntax highlighting seems sufficient. To make sure that Vim views snakefiles as Python files, add the following line to your .vimrc:

au BufNewFile,BufRead Snakefile set syntax=python

I want to import some helper functions from another python file. Is that possible?

Yes, from version 2.4.8 on, Snakemake allows to import python modules (and also simple python files) from the same directory where the Snakefile resides.

How can I run Snakemake on a cluster where its main process is not allowed to run on the head node?

This can be achived by submitting the main Snakemake invocation as a job to the cluster. If it is not allowed to submit a job from a non-head cluster node, you can provide a submit command that goes back to the head node before submitting:

qsub -N PIPE -cwd -j yes python snakemake --cluster "ssh user@headnode_address 'qsub -N pipe_task -j yes -cwd -S /bin/sh ' " -j

This hint was provided by Inti Pedroso.

Can the output of a rule be a symlink?

Yes. However, you need to make sure that it is newer than the input file (since usually, symlinks share the modification date of the file they refer to). This can be done using a special flag of the unix touch command. Here is an example:

rule symlink:
    input:  "path/to/file"
    output: "path/to/symlink"
    shell:  "ln -s ../../{input} {output} && touch -h {output}"

I would like to receive a mail upon snakemake exit. How can this be achieved?

On unix, you can make use of the commonly pre-installed mail command:

snakemake 2> snakemake.log
mail -s "snakemake finished" youremail@provider.com < snakemake.log

In case your administrator does not provide you with a proper configuration of the sendmail framework, you can configure mail to work e.g. via Gmail (see here).

I want to pass variables between rules. Is that possible?

Because of the cluster support and the ability to resume a workflow where you stopped last time, Snakemake in general should be used in a way that information is stored in the output files of your jobs. Sometimes it might though be handy to have a kind of persistent storage for simple values between jobs and rules. Using plain python objects like a global dict for this will not work as each job is run in a separate process by snakemake. What helps here is the PersistentDict from the pytools package. Here is an example of a Snakemake workflow using this facility:

from pytools.persistent_dict import PersistentDict

storage = PersistentDict("mystorage")

rule a:
    input: "test.in"
    output: "test.out"
    run:
        myvar = storage.fetch("myvar")
        # do stuff

rule b:
    output: temp("test.in")
    run:
        storage.store("myvar", 3.14)

Here, the output rule b has to be temp in order to ensure that "myvar" is stored in each run of the workflow as rule a relies on it. In other words, the PersistentDict is persistent between the job processes, but not between different runs of this workflow. If you need to conserve information between different runs, use output files for them.

I want to configure the behavior of my shell for all rules. How can that be achieved with Snakemake?

You can set a prefix that will prepended to all shell commands by adding e.g.

shell.prefix("set -o pipefail; ")

to the top of your Snakefile. Make sure that the prefix ends with a semicolon, such that it will not interfere with the subsequent commands. To simulate a bash login shell, you can do the following:

shell.executable("/bin/bash")
shell.prefix("source ~/.bashrc; ")

Updated