Wiki

Clone wiki

Snakemake / Home

Build systems like GNU Make are frequently used to create complicated workflows, e.g. in bioinformatics. This project aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment, together with a clean and modern domain specific specification language (DSL) in python style:

rule targets:
    input:
        'plots/dataset1.pdf', 
        'plots/dataset2.pdf'

rule plot:
    input:
        'raw/{dataset}.csv'
    output:
        'plots/{dataset}.pdf'
    shell:
        'somecommand {input} {output}'

Like with GNU Make, in Snakemake you first specify targets in terms of a pseudo rule, and then how they are created via one or more steps of subsequent rule applications. Rules can be generalized via wildcards (here {dataset}). Everything is propagated top-down, i.e. here Snakemake determines that for the file "plots/dataset1.pdf" the rule plot has to be applied with wildcard {dataset} = dataset1 to the file raw/dataset1.csv. How the files are created is specified either with a shell command or python code. Further, Snakemake can interface with R to specify R code inside rules. Also see the FAQ to get an impression of the basic idea behind Snakemake.

Documentation

We provide a Documentation, an FAQ, a Tutorial for a particular bioinformatics application and further Examples. If you have further questions, please feel free to join our Forums.

Talks

Articles

This one provides a general introduction from a user perspective (please use this for citation):

Köster, Johannes and Rahmann, Sven. "Snakemake - A scalable bioinformatics workflow engine". Bioinformatics 2012.

This paper complements the one above with details on algorithms and ideas:

Köster, Johannes and Rahmann, Sven. "Building and Documenting Bioinformatics Workflows with Python-based Snakemake". Proceedings of the GCB 2012.

Algorithmic details about Snakemake can be found in my PhD thesis

Johannes Köster, "Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis", 2014

An incomplete list of articles making use of Snakemake can be found here. Please consider adding your own work.

Snakemake Workflow Repository

The Snakemake Workflow Repository provides a collection of high quality modularized and re-usable rules and workflows. The provided code should also serve as a best-practices of how to build production ready workflows with Snakemake. Everybody is invited to contribute.

External Resources

Snakemake Docker Image

For easy deployment of Snakemake, and for a controlled enviroment for your workflows, have a look at the Snakemake docker image.

Latest News

  • 18 July 2015: Release 3.4 of Snakemake. This release adds support for executing jobs on clusters in synchronous mode (e.g. qsub -sync). Thanks to David Alexander for implementing this. Further, there is now vim syntax highlighting support (thanks to Jay Hesselberth). Snakemake is now available as Conda package. Finally, lots of bugs have been fixed. Thanks go to e.g. David Koppstein, Marcel Martin, John Huddleston and Tao Wen for helping with useful reports and debugging.
  • 14 May 2015: Release 3.3 of Snakemake. Snakemake now supports YAML in addition to JSON as a config file format (thanks to David Koppstein). You can now provide a separate --cluster-config that provides rule-specific cluster parameters (thanks to Mattias Franberg). Target rules are now local in a cluster environment. Among various minor bug fixes for clustering support, a problem with too long filepaths was fixed. The Snakemake sources have been reformatted to comply with PEP8 style.
  • 25 Mar 2015: Maintenance release 3.2.2 of Snakemake. This release fixes some bugs, e.g. with config files and logging under Windows. Further, Snakemake now finally works properly if you use dynamic output files with multiple wildcards. In addition, rules without wildcards now beat other rules in case of ambiguity.
  • 22 Jan 2015: Maintenance release 3.2.1 of Snakemake. This release adds the onsuccess and onerror keywords, that allow to specify Python code that shall be executed after the workflow execution has finished. Further, a bug leading to spurious AmbiguousRuleExceptions has been fixed.
  • 9 Jan 2015: Release 3.2 of Snakemake. This version improves benchmark support and allows to specify and overwrite config parameters via the command line. Dependency resolution is now accelerated via a prefix tree based index data structure. The workdir and include behavior has been made more intuitive: especially, includes are now relative to the file they are specified in, which eases the hierarchical composition of workflows. Finally, numerous minor bugs have been fixed.
  • 24 Sep 2014: Maintenance release 3.1.1 of Snakemake. This version improves the traceback in case of errors occuring in input functions. Further, two small bugs were fixed.
  • 27 Aug 2014: Release 3.1 of Snakemake. This release adds support for configuring Snakemake workflows via JSON as well as via the command line (see Documentation. Further, Snakemake now supports benchmarking rules (see Documentation. These allow to create complex performance analysis workflows and to easily obtain reliable CPU and wall clock run times of individual workflow steps.
  • 3 Jul 2014: Release 3.0 of Snakemake. Snakemake now provides a browser-based GUI that can be activated with --gui. Further, Snakemake can now use the DRMAA 1.0 library for submitting and managing jobs on cluster and batch systems (using the --drmaa command line argument). This is meant as an alternative for the generic --cluster functionality and allows more control over your jobs. Further, the scheduling was refined to be more sensitive to priorities and several small bugs have been fixed.

Features

  • Define workflows in a textual way by writing rules how to create output files from input files in a simple python based syntax. In contrast to GNU make (which is primarily a build system), snakemake allows a rule to create multiple output files.
  • Snakemake automatically calculates which rules need to be executed to create the desired output.
  • Both shell based rules as well as full python syntax inside a rule is supported. Shell commands have direct access to all local and global python variables.
  • Like GNU make, snakemake can schedule parallel rule executions where possible. Further, inter rule parallelization can be combined with intra rule parallelization (e.g. threads) and snakemake ensures that the number of used cores does not exceed a given threshold.
  • Files can be marked as temporary (i.e. they can be deleted once not needed any more) or protected (i.e. they will be write protected after creation).
  • Input and output files can contain multiple named wildcards.
  • Input and output files can be given names to ease addressing them inside the rule.
  • A map-reduce like functionality is accomplished by using the easy to read python list comprehension syntax or a provided expand function.
  • Snakemake can run on a cluster by specifying the submit command (e.g. qsub for Sun Grid Engine) or using DRMAA.

Old News

  • 4 Apr 2014: Maintenance Release 2.5.2.2 of Snakemake, fixing two small bugs found in 2.5.2.
  • 3 Apr 2014: Maintenance Release 2.5.2 of Snakemake. This release provides some performance improvements for very large workflows, especially in the exception handling. The handling of custom resources has been improved. Among others, some bugs in the subworkflow implementation and the tracking of code changes have been fixed.
  • 9 Mar 2014: Release 2.5.1 of Snakemake. While it was a hidden feature for some time, bash completion can now be considered stable (see Documentation). The snakemake scheduler has been improved to better handle keyboard interrupts, resulting in a cleaner shutdown. Some performance improvements for very large workflows have been implemented. Several minor bugs and annoyances have been fixed.
  • 1 Feb 2014: Release 2.5 of Snakemake. Cluster support has been improved by using more compatible job script names (further, job script names can now be configured via the command line). This release is the first with official support for the snakemake API (see the new API Documentation). The API allows to execute snakemake workflows from within Python, without the need to invoke the command line tool. This should allow to easily write e.g. versatile web frontends for snakemake workflows. Complementing this, the logging subsystem has been rewritten in order to allow users of the snakemake API to provide custom log handlers.
  • 21 Dec 2013: Maintenance Release 2.4.9 of Snakemake. Fixed --cluster support for certain setups, especially in combination with Slurm. Fixed the handling of named lists of input files. Fixed the --dag and --rulegraph command line flags (they now print the graphs again). Fixed scheduling of multi-thread jobs on clusters (--cluster) when limiting the number of submitted jobs with -j. Improved the handling of missing files from subworkflows.
  • 5 Dec 2013: Release 2.4.8 of Snakemake. Snakemake now supports rule dependencies, i.e. to refer directly to the output of other rules when defining input files (see Documentation). This saves writing effort and allows to resolve ambiguities. Sub-workflow support (see Documentation) now allows to use wildcards when referring to an output file of the sub-workflow. Snakemake now allows imports of python modules from the directory where the Snakefile resides. Finally, the new function glob_wildcards can be used to infer wildcard values based on a pattern and matching files in the filesystem. This allows to use Snakemake in a more batch-like way on a bunch of files present in the filesystem.
  • 13 Oct 2013: Release 2.4.7.1 of Snakemake. Snakemake now supports sub-workflows. The subworkflow syntax allows you to refer to output files of other workflows. When executed, Snakemake first ensures that the reffered files are up to date (see section "Subworkflows" in the Documentation). Further, you can now define rules to be local (see section "Local rules" in the Documentation). If a workflow is executed on a cluster (i.e. using the --cluster flag), local rules won't be submitted but run on the local host instead. This is useful for target rules only collecting files (like the famous all rule). Finally, snakemake does now save job properties into the jobscript as json. There exists a parser function for easy use in any custom job submission script (snakemake.utils.read_job_properties, see Documentation).
  • 31 Aug 2013: Maintenance Release 2.4.6 of Snakemake. Improved support for output file flagging: temp dynamic and protected can now be mixed and also applied to lists, e.g. the output of expand. HTML reports (snakemake.utils.report) now allow to add arbitrary metadata (e.g. the author name and email) that are displayed at the bottom beside the creation date. The --ruledag functionality was replaced with a --rulegraph, that also works in any corner case were workflow branches are divergent. Several bug fixes regarding the cluster support and the immediate-submit parameter.
  • 31 Aug 2013: Maintenance Release 2.4.5 of Snakemake. Several small speed improvements, some bug fixes, and fixed windows support.
  • 1 Jul 2013: Maintenance Release 2.4.4 of Snakemake. The shell-function was improved to properly handle errors of the called process in iterable or read-mode. A bug with the --cleanup-metadata flag was fixed.
  • 17 Jun 2013: Release 2.4.3 of Snakemake. The scheduling mechanism has been completely rewritten based on a heuristic greedy algorithm that approximates a multi-criterial knapsack problem. Besides speed improvements, this allows the user to specify arbitrary resources in addition to threads, e.g. to make Snakemake aware of hybrid-computing architectures like GPGPU (see the section "Resources" in the Documentation). The scheduler ensures that they are not exceeded. A new flag --ruledag allows you to print a reduced DAG where similar jobs are collapsed to single nodes. This should help visualizing large workflows. Minor improvements: The output has been extended to display counts of jobs to be executed. Snakemake now prints debugging output when invoked with --debug. A typo has been fixed that prevented the use of multiprocessing instead of threading on posix systems. This should result in increased performance for workflows relying heavily on rules with inline python code. Params are now tracked for changes similar to input and code (e.g. for the use of --summary).
  • 26 Mar 2013: Release 2.3 of Snakemake. The parser has been rewritten completely to increase the flexibility. The first yield is support for nesting rules into conditional statements and for-loops (see here for an example). Second, the new locking mechanism has been improved to only report a lock when a running workflow uses conflicting files in the same directory. Further, incomplete files should be reported less often since cases where the file was correctly deleted by Snakemake after a failing job are now omitted. Finally some bugs were fixed and special thanks go to Marcel Martin and Hyeshik Chang for contributing patches.
  • 18 Feb 2013: Release 2.2.2 of Snakemake. A persistence framework has been implemented. This for now supports three new features: it tracks incompletely written files (e.g. due to power loss), sourcecode changes, and version changes of used tools. These features are exposed by new command line options, e.g. --summary, --list-version-changes and --list-code-changes. Further, the --cluster option has been extended to allow additional parameters to be passed to your favorite qsub command, e.g. snakemake --cluster "qsub -pe threaded {threads}". The check for missing output files after a job has been completed now waits a configurable time in case of failure, in order to deal with filesystem latency. Finally it is now allowed to use multiple wildcards of the same name in output files. Note that yesterdays release 2.2 contained two small bugs that I have fixed in this updated release 2.2.2.
  • 2 Jan 2013: Release 2.1.1 of Snakemake. Some bug fixes after the major rewrite, especially for dynamic rules. Importantly, snakemake.utils now provides a report function that can be used to easily create HTML reports for conducted analyes and an R function to specify R code inside rules. Various small improvements: a keep going feature (-k, similar to the one of GNU Make); multiple wildcards are now allowed for dynamic output files; input files can be grouped into named sublists.
  • 24 Nov 2012: Release 2.0 of Snakemake. The application core was rewritten and cleaned up substantially. Support for dynamic output of rules (i.e. rules where the output files are unknown at worklow start, see here) is considered stable now. It is now possible to assign numeric prioities to rules that will guide the Snakemake scheduler to prefer these rules. Further it is possible to set files or rules to highest priority upon Snakemake invocation, which will lead to the scheduler trying to complete these with all dependencies first (see here for more details). Finally, a mechanism to specify log files for rules is provided (see here, --stats produces more information, and rule docstrings are printed upon -l. This release also paves the way for other features requested by some users that I hope to complete soon, i.e. version tracking for used tools and scripts, directory locking and atomic creation of output files.
  • 24 Sep 2012: Release 1.2.3 of snakemake. Syntax is extended by the expand keyword that can be used to replace e.g. the "for sample in SAMPLES" expressions in rules (see Documentation). Further, the module snakemake.utils now provides some useful helper functions. Finally, this release provides initial support for dynamic files. This concept is explained in the Documentation as well.
  • 25 July 2012: Maintenance release 1.2.1 of snakemake. Improved error messages for ambiguous rules and in case of a wrong wildcard statement. Fixed some minor bugs.
  • 27 June 2012: Release 1.2 of snakemake. Improved support for ambiguous rules by allowing to ignore them or prioritize them with the ruleorder keyword.
  • 11 June 2012: Maintenance release 1.1.4 of snakemake. Fixed an issue where scheduling considered too many cores in some cases.
  • 6 June 2012: Maintenance release 1.1.3 of snakemake. On top of various bug fixes, the algorithm to determine if a rule needs to be run now correctly ignores intermediate files in certain cases. Further, when an error occurs upon parallel execution, all currently running jobs are finished properly before snakemake exits.
  • 15 May 2012: Release 1.1.2 of snakemake. Instead of using only plain strings, input files can now also be defined as functions or lambda expressions that return a string given the wildcards as an argument. Fixed hangups in parallel execution of a lot of jobs.
  • 15 Apr 2012: Maintenance release 1.0.2 of snakemake. Improved temporary file handling and error handling when running snakemake on clusters.
  • 9 Apr 2012: The first stable release (1.0.1) of snakemake.

Updated