Building pipelines

pipeline Module

BioLite borrows from Ruffus (http://code.google.com/p/ruffus/) the idea of using Python function decorators to delineate pipeline stages. Pipelines are created with a sequence of ordinary Python functions decorated by a pipeline object, which registers each function as a stage in the pipeline. The pipeline object maintains a persistent, global dictionary, called the state, and runs each stage by looking up the argument names in the stage function’s signature, and calling the function with the values in the state dictionary whose keys match the function’s argument names. This is implemented using the function inspection methods available from the inspect module in the Python standard library. If the stage function returns a dictionary, it is ingested into the pipeline’s state by adding values for any new keys and updating values for existing keys. Arguments passed on the command-line to the pipeline script form the initial data in the pipeline’s state.

As an example, the following code setups a pipeline with two command-line arguments and one stage. Note how the variable names in the stage function’s signature match the names of the arguments. The stage uses the ingest call to pull the output path into the pipeline’s state. This way, it is accessible to other stages that might be added to this pipeline.

from biolite.pipeline import BasePipeline
from biolite.wrappers import FilterIllumina

pipe = BasePipeline('filter', "Example pipeline")

pipe.add_argument('input', short='i',
      help="Input FASTA or FASTQ file to filter.")

pipe.add_argument('quality', short='q', type=int, metavar='MIN',
      default=28, help="Filter out reads that have a mean quality < MIN.")

@pipe.stage
def filter(input, quality):
      '''
      Filter out low-quality and adapter-contaminated reads
      '''
      output = input + '.filtered'
      FilterIllumina([input], [output], quality=quality)
      ingest('output')

if __name__ == "__main__":
      pipe.parse_args()
      pipe.run()

This script is available in examples/filter-pipeline.py and produces the following help message:

$ python examples/filter-pipeline.py -h
usage: filter-pipeline.py [-h] [--restart [CHK]] [--stage N] [--input INPUT]
                          [--quality MIN]

Example pipeline

optional arguments:
  -h, --help            show this help message and exit
  --restart [CHK]       Restart the pipeline from the last available
                        checkpoint, or from the specified checkpoint file CHK.
  --stage N             Start at stage number N. Note that some stages require
                        the output of previous stages, so starting in the
                        middle of a pipeline may not work.
  --input INPUT, -i INPUT
                        Input FASTA or FASTQ file to filter.
  --quality MIN, -q MIN
                        Filter out reads that have a mean quality < MIN. [28]

pipeline stages:
  0) [filter] 
      Filter out low-quality and adapter-contaminated reads

The pipeline module allows you to rapidly create full-featured pipeline scripts with help messages, checkpointing and restart capabilities, and integration with the BioLite diagnostics and catalog databases (using the Pipeline or IlluminaPipeline derived classes).

Meta-Pipelines

Modularity is a key design goal, and it is possible to reuse one or more stages of an existing pipeline when building a new pipeline. It is also possible to build meta-pipelines that connect together several sub-pipelines.

Checkpoints

The pipeline object also incorporates fault tolerance. At the end of each stage, the pipeline stores a checkpoint by dumping its current state to a binary file with the cPickle module. This way, if a run is interrupted, either due to an internal error or to external conditions, such as a kill signal from a batch system or a hardware failure, the run can be restarted from the last completed stage (or, optionally, from any previous stage in the checkpoint).

class biolite.pipeline.BasePipeline(name, desc='')[source]

BasePipeline is the more generic class. It is designed to be used independently of the BioLite diagnostics and catalog features.

import_stages(pipe, start=0)[source]
import_arguments(pipe, names=None)[source]
import_module(module, names=None, start=0)[source]

Imports another pipeline module. Adds the pipeline as a subpipeline and links to the module itself so that it can be referenced later.

import_pipeline(pipe, names=None, start=0)[source]

Imports another pipeline. This should only be used in cases where the pipeline is in the same file as another pipeline.

make_state(*args)[source]
get(key)[source]
stage(func)[source]

Decorator to add functions as stages of this pipeline.

add_stage(func)[source]
list_stages()[source]
size()[source]

Returns the size of the pipeline (the number of stages it contains).

parse_args()[source]

Reads values passed as arguments into the pipeline’s state.

add_argument(name, **kwargs)[source]

Adds an argument –name to the pipeline. The single character keyword argument ‘short’ is used as the short versino of the argument (e.g. short='n' for -n). All other keyword arguments are passed through to the ArgumentParser when parse_args is called.

checkpoint()[source]

Writes checkpoint file by making a deep copy of the pipeline’s current state and pickling it to the value of chkfile in the state (by default, this is the pipeline’s name followed by ‘.chk’ in the current working directory).

restart(chkfile)[source]

Restart the pipeline from the last stage written to the checkpoint file chkfile, which is unpickled and loaded as the current state using a deepcopy.

run()[source]

Starts the pipeline at the stage specified with –stage, or at stage 0 if no stage was specified.

rerun(state, start=0, stdout=None)[source]

Starts the pipeline without loading the command line arguments (e.g. for calling a full pipeline from within the stage of another pipeline), and instead using the provided state.

The pipeline’s stdout stream can be temporarily redirected to a log file using stdout.

run_stage(func)[source]

Runs the current stage (from self.nstage) by using the inspect module to read the function signature of the decorated stage function, then injecting values from the state where the key matches the variable name in the function signature.

ingest(*args)[source]

Called from inside a pipeline stage to ingest values back into the pipeline’s state. It uses the inspect module to get the calling functions (i.e. the stage function’s) local variable dictionary, and copies the variable names specified in the args list.

class biolite.pipeline.Pipeline(name, desc='')[source]

Bases: biolite.pipeline.BasePipeline

Extends BasePipeline to make use of the BioLite diagnostics and catalog databases.

set_outdir()[source]

Setup the output directory.

get_file()[source]

Returns the absolute path to the file that this pipeline was created in.

get_all_files()[source]

Returns a flat list of all the files this pipeline and subpipelines are created in.

run()[source]
finish(*args)[source]
add_stage(func)[source]
class biolite.pipeline.IlluminaPipeline(name, desc='')[source]

Bases: biolite.pipeline.Pipeline

An extension of Pipeline that assumes that the input model is a forward and reverse FASTQ pair, such as a paired-end Illumina data set.

import_stages(pipe, start=1)[source]

Table Of Contents

Previous topic

Diagnostics

Next topic

Generating reports

This Page