Clone wiki

biobakery / biobakery_workflows

bioBakery Workflows Tutorial

bioBakery workflows is a collection of workflows and tasks for executing common microbial community analyses using standardized, validated tools and parameters. Quality control and statistical summary reports are automatically generated for most data types, which include 16S amplicons, metagenomes, and metatranscriptomes.




1. Setup

1.1 Requirements

All of the core requirements for bioBakery workflows (including AnADAMA2) are automatically installed when installing bioBakery workflows. Depending on the workflows to be run additional software is required. For this tutorial, we will be running the whole metagenome shotgun workflow. It requires the following software.

  1. Kneaddata (v0.6.1+)
  2. MetaPhlAn2
  3. HUMAnN2
  4. StrainPhlAn

1.2 Installation

bioBakery workflows can be installed with Homebrew, Docker, or pip.

Please note, if you are using bioBakery (Vagrant VM or Google Cloud) you do not need to install bioBakery workflows because the tool and its dependencies are already installed.

1.2.1 Install with Homebrew

bioBakery workflows can be installed using Brew (HomeBrew for MacOS or LinuxBrew for Linux platforms). This will install almost all of the dependencies for all workflows (ie Kneaddata, MetaPhlan2, etc.) excluding those dependencies that have licenses. See the bioBakery Homebrew user manual for instructions on manually installing additional dependencies.

To install with brew:

$ brew install biobakery/biobakery/workflows

1.2.2 Install with Docker

bioBakery workflows can be run from a pre-built Docker image. This option is best if you are running in a computing environment where a container platform like Docker is installed or if you have the elevated permissions required to install Docker or a similar tool. The image will include all dependencies for all workflows (ie Kneaddata, MetaPhlan2, etc.) excluding those dependencies that have licenses.

To install and run with Docker:

$ docker run -it biobakery/workflows bash

1.2.3 Install with pip

bioBakery workflows can be installed with pip. This will only install the core software and dependencies. It will not install the dependencies for the workflows. Refer to the bioBakery workflows user manual on how to install the dependencies for each workflow.

To install with pip:

$ pip install biobakery_workflows

1.2.4 Install dependencies with licenses

The 16s workflow depends on usearch which requires a license. Obtain a license for usearch and install it manually if you will be running the 16s workflow.

1.2.5 Install databases

All of the data processing workflows require large databases. Due to their size, they are not installed with the Homebrew, Docker, or VM images. A demo set of databases will be used with this tutorial to reduce the run time.

To install the demo shotgun databases:

$ biobakery_workflows_databases --install wmgx_demo

To install the full shotgun databases:

$ biobakery_workflows_databases --install wmgx

To install the full 16s databases:

$ biobakery_workflows_databases --install 16s

2. Metagenome profiling

Raw metagenome shotgun sequencing reads can be processed through the workflow with a single command. The workflow will run quality control, taxonomic profiling, functional profiling, and strain profiling for the full set of samples.

https://bitbucket.org/repo/49y6o9/images/3637597813-wms_workflow.jpg

The raw files can be single or paired-end and can be formatted as fastq or fastq. They can also be compressed with gzip. By default fastq.gz files are expected. If paired-end the default pair identifier is ".R1". To change the default for these parameters use the options "--pair-identifier <R1>" and "--input-extension <fasta>".

2.1 Input files

For this tutorial, we will be using a set of sub-sampled files to reduce run time. These files were generated using reads obtained from a set of six healthy and diseased samples (Alterations of the human gut microbiome in liver cirrhosis. Nature. 2014 Sep 4;513(7516):59-64.)

Download all of the files to run the full tutorial. If you have limited time, only download one or two files. Place these files in a folder named input in your current working directory.

The sample names from the input files are used to differentiate tasks that are run for all samples along with their corresponding output files. For example, the sample name for input file HD42R4_subsample.fastq.gz is HD42R4_subsample. The name for the task to run MetaPhlAn2 on this input file is named metaphlan2____HD42R4_subsample and the output files generated from MetaPhlAn2 for this input file are named HD42R4_subsample_bowtie2.sam and HD42R4_subsample_taxonomic_profile.tsv.

2.2 Run the workflow

Run the following command to start the workflow:

$ biobakery_workflows wmgx --input input --output output_data --bypass-strain-profiling

All input files are located in the folder input and all output files will be written to the folder output_data.

The option to bypass strain profiling reduces the run time. If you have more time or cores to use to run the tutorial, see the sections that follow on different workflow options.

2.2.1 Scaling up

By default, the workflow will run a single task at a time and each task will be allowed a single core. If you have multiple cores or a grid computing environment, add the following options to your workflow to speed up the run.

  • Add the option --local-jobs 6 to run at most six tasks at once locally.
  • Add the option --threads 2 to give each task two cores.
  • If you are running in an environment where grid computing is available (using a SLURM or SGE system), add the option --grid-jobs 6 --threads 8 to run at most six grid jobs at once each with eight cores.

2.2.2 Expected run time

The run time will vary depending on the number of input files and the workflow options. All run time estimates are based on the use of a single core.

  • Run a single sample
    • 5 minutes: Run bypassing the strain profiling steps (option --bypass-strain-profiling)
    • 15 minutes: Run the full workflow.
  • Run all six samples
    • 30 minutes: Run bypassing the strain profiling steps (option --bypass-strain-profiling)
    • 60 minutes: Run the full workflow.

When running the subsampled files with strain profiling it is recommended to add the option --strain-profiling-options="--marker_in_clade 0.01". This will allow for strains to be identified with lower numbers of markers than the default to account for the subsampled file size.

2.2.3 Standard output

Running the workflow on a single sample, the output to your terminal will look something like the following:

(Jul 18 13:53:24) [ 0/21 -   0.00%] **Ready    ** Task  0: kneaddata____HD42R4_subsample
(Jul 18 13:53:24) [ 0/21 -   0.00%] **Started  ** Task  0: kneaddata____HD42R4_subsample
(Jul 18 13:53:26) [ 1/21 -   4.76%] **Completed** Task  0: kneaddata____HD42R4_subsample
(Jul 18 13:53:26) [ 1/21 -   4.76%] **Ready    ** Task  3: kneaddata_read_count_table
(Jul 18 13:53:26) [ 1/21 -   4.76%] **Started  ** Task  3: kneaddata_read_count_table
(Jul 18 13:53:26) [ 2/21 -   9.52%] **Completed** Task  3: kneaddata_read_count_table
(Jul 18 13:53:26) [ 2/21 -   9.52%] **Ready    ** Task  4: metaphlan2____HD42R4_subsample
(Jul 18 13:53:26) [ 2/21 -   9.52%] **Started  ** Task  4: metaphlan2____HD42R4_subsample
(Jul 18 13:54:18) [ 3/21 -  14.29%] **Completed** Task  4: metaphlan2____HD42R4_subsample
(Jul 18 13:54:18) [ 3/21 -  14.29%] **Ready    ** Task  8: humann2____HD42R4_subsample
(Jul 18 13:54:18) [ 3/21 -  14.29%] **Started  ** Task  8: humann2____HD42R4_subsample
(Jul 18 13:58:39) [ 4/21 -  19.05%] **Completed** Task  8: humann2____HD42R4_subsample
(Jul 18 13:58:39) [ 4/21 -  19.05%] **Ready    ** Task 10: humann2_count_alignments_species
(Jul 18 13:58:39) [ 4/21 -  19.05%] **Started  ** Task 10: humann2_count_alignments_species
(Jul 18 13:58:39) [ 5/21 -  23.81%] **Completed** Task 10: humann2_count_alignments_species
(Jul 18 13:58:39) [ 5/21 -  23.81%] **Ready    ** Task 11: humann2_regroup_UniRef2EC____HD42R4_subsample
(Jul 18 13:58:39) [ 5/21 -  23.81%] **Started  ** Task 11: humann2_regroup_UniRef2EC____HD42R4_subsample
(Jul 18 13:58:40) [ 6/21 -  28.57%] **Completed** Task 11: humann2_regroup_UniRef2EC____HD42R4_subsample
(Jul 18 13:58:40) [ 6/21 -  28.57%] **Ready    ** Task 16: humann2_renorm_ecs_relab____HD42R4_subsample
(Jul 18 13:58:40) [ 6/21 -  28.57%] **Started  ** Task 16: humann2_renorm_ecs_relab____HD42R4_subsample
(Jul 18 13:58:40) [ 7/21 -  33.33%] **Completed** Task 16: humann2_renorm_ecs_relab____HD42R4_subsample
(Jul 18 13:58:40) [ 7/21 -  33.33%] **Ready    ** Task 19: humann2_join_tables_ecs_relab
(Jul 18 13:58:40) [ 7/21 -  33.33%] **Started  ** Task 19: humann2_join_tables_ecs_relab
(Jul 18 13:58:40) [ 8/21 -  38.10%] **Completed** Task 19: humann2_join_tables_ecs_relab
(Jul 18 13:58:40) [ 8/21 -  38.10%] **Ready    ** Task 22: humann2_count_features_ecs
(Jul 18 13:58:40) [ 8/21 -  38.10%] **Started  ** Task 22: humann2_count_features_ecs
(Jul 18 13:58:41) [ 9/21 -  42.86%] **Completed** Task 22: humann2_count_features_ecs
(Jul 18 13:58:41) [ 9/21 -  42.86%] **Ready    ** Task 13: humann2_join_tables_ecs
(Jul 18 13:58:41) [ 9/21 -  42.86%] **Started  ** Task 13: humann2_join_tables_ecs
(Jul 18 13:58:41) [10/21 -  47.62%] **Completed** Task 13: humann2_join_tables_ecs
(Jul 18 13:58:41) [10/21 -  47.62%] **Ready    ** Task 12: humann2_join_tables_genefamilies
(Jul 18 13:58:41) [10/21 -  47.62%] **Started  ** Task 12: humann2_join_tables_genefamilies
(Jul 18 13:58:41) [11/21 -  52.38%] **Completed** Task 12: humann2_join_tables_genefamilies
(Jul 18 13:58:41) [11/21 -  52.38%] **Ready    ** Task 14: humann2_join_tables_pathabundance
(Jul 18 13:58:41) [11/21 -  52.38%] **Started  ** Task 14: humann2_join_tables_pathabundance
(Jul 18 13:58:41) [12/21 -  57.14%] **Completed** Task 14: humann2_join_tables_pathabundance
(Jul 18 13:58:41) [12/21 -  57.14%] **Ready    ** Task 15: humann2_renorm_genes_relab____HD42R4_subsample
(Jul 18 13:58:41) [12/21 -  57.14%] **Started  ** Task 15: humann2_renorm_genes_relab____HD42R4_subsample
(Jul 18 13:58:41) [13/21 -  61.90%] **Completed** Task 15: humann2_renorm_genes_relab____HD42R4_subsample
(Jul 18 13:58:41) [13/21 -  61.90%] **Ready    ** Task 18: humann2_join_tables_genes_relab
(Jul 18 13:58:41) [13/21 -  61.90%] **Started  ** Task 18: humann2_join_tables_genes_relab
(Jul 18 13:58:42) [14/21 -  66.67%] **Completed** Task 18: humann2_join_tables_genes_relab
(Jul 18 13:58:42) [14/21 -  66.67%] **Ready    ** Task 21: humann2_count_features_genes
(Jul 18 13:58:42) [14/21 -  66.67%] **Started  ** Task 21: humann2_count_features_genes
(Jul 18 13:58:42) [15/21 -  71.43%] **Completed** Task 21: humann2_count_features_genes
(Jul 18 13:58:42) [15/21 -  71.43%] **Ready    ** Task 17: humann2_renorm_pathways_relab____HD42R4_subsample
(Jul 18 13:58:42) [15/21 -  71.43%] **Started  ** Task 17: humann2_renorm_pathways_relab____HD42R4_subsample
(Jul 18 13:58:42) [16/21 -  76.19%] **Completed** Task 17: humann2_renorm_pathways_relab____HD42R4_subsample
(Jul 18 13:58:42) [16/21 -  76.19%] **Ready    ** Task 20: humann2_join_tables_pathways_relab
(Jul 18 13:58:42) [16/21 -  76.19%] **Started  ** Task 20: humann2_join_tables_pathways_relab
(Jul 18 13:58:42) [17/21 -  80.95%] **Completed** Task 20: humann2_join_tables_pathways_relab
(Jul 18 13:58:42) [17/21 -  80.95%] **Ready    ** Task 23: humann2_count_features_pathways
(Jul 18 13:58:42) [17/21 -  80.95%] **Started  ** Task 23: humann2_count_features_pathways
(Jul 18 13:58:42) [18/21 -  85.71%] **Completed** Task 23: humann2_count_features_pathways
(Jul 18 13:58:42) [18/21 -  85.71%] **Ready    ** Task 24: humann2_merge_feature_counts
(Jul 18 13:58:42) [18/21 -  85.71%] **Started  ** Task 24: humann2_merge_feature_counts
(Jul 18 13:58:42) [19/21 -  90.48%] **Completed** Task 24: humann2_merge_feature_counts
(Jul 18 13:58:42) [19/21 -  90.48%] **Ready    ** Task  6: metaphlan2_join_taxonomic_profiles
(Jul 18 13:58:42) [19/21 -  90.48%] **Started  ** Task  6: metaphlan2_join_taxonomic_profiles
(Jul 18 13:58:43) [20/21 -  95.24%] **Completed** Task  6: metaphlan2_join_taxonomic_profiles
(Jul 18 13:58:43) [20/21 -  95.24%] **Ready    ** Task  7: metaphlan2_count_species
(Jul 18 13:58:43) [20/21 -  95.24%] **Started  ** Task  7: metaphlan2_count_species
(Jul 18 13:58:43) [21/21 - 100.00%] **Completed** Task  7: metaphlan2_count_species
Run Finished

The output contains the following information.

  • Column 1: (Jul 18 13:58:42) : This is the time stamp for the status message.
  • Column 2: [19/21 -  90.48%] : This is the status of the workflow. The first number is the total tasks completed. The second number is the total number of tasks in the workflow. The last number is the percent of tasks completed.
  • Column 3: **Started  ** : This is the current status for the task. Tasks that are ready have all dependencies available and are waiting for computing resources. Tasks that have started are beginning to run locally. Tasks that are completed have finished executing.
  • Column 4: Task  6 : This is the task number associated with the status.
  • Column 5: metaphlan2_join_taxonomic_profiles : This is the name of the task. Tasks that are specific to a sample will include the sample name at the end of the task name.

A line will be written to standard output anytime a task changes status. When running on the grid, task status will be polled at minute intervals to track job progress in standard output.

All text written to standard output will also be written to the log.

2.3 Output files

All output files will be located in the output folder specified with --output <folder> when running the workflow. If you used the same command as included above, the output folder will be named output_data.

Run the command to list the contents of the output folder to see four sub-folders along with the workflow log.

$ ls output_data/
anadama.log
humann2
metaphlan2
kneaddata
strainphlan

Each folder contains data files generated by the corresponding tool. The main sub-folders for each tool contain the core outputs from running each tool on all samples.

2.3.1 Quality control data

Kneaddata is used for quality control. For more information on Kneaddata, see the Kneaddata user manual.

The kneaddata/main folder contains all of the main outputs from running Kneaddata on all samples. The files named kneaddata/main/sample_name.fastq, for example kneaddata/main/HD42R4_subsample.fastq from this tutorial, are the reads for a specific sample that have passed quality control. These are the reads that are used for the next steps in the workflow. For more information on the core output files, see the Kneaddata user manual.

$ output_data/kneaddata/main/
HD42R4_subsample.fastq
HD42R4_subsample_Homo_sapiens_demo_bowtie2_contam.fastq
HD42R4_subsample.log
HD42R4_subsample_SILVA_128_LSUParc_SSUParc_ribosomal_RNA_demo_bowtie2_contam.fastq
HD42R4_subsample.trimmed.fastq
HD48R4_subsample.fastq
HD48R4_subsample_Homo_sapiens_demo_bowtie2_contam.fastq
HD48R4_subsample.log
HD48R4_subsample_SILVA_128_LSUParc_SSUParc_ribosomal_RNA_demo_bowtie2_contam.fastq
HD48R4_subsample.trimmed.fastq
... ( a set of five files will be included in this folder for each input file )

The kneaddata/merged folder contains a single merged data file containing read counts for each step of the quality control process for each input file.

2.3.2 Taxonomic profiling data

MetaPhlAn2 is used for taxonomic profiling. For more information on MetaPhlAn2, see the MetaPhlAn2 user manual or the MetaPhlAn2 tutorial.

The metaphlan2/main folder contains all of the main outputs from running MetaPhlAn2 on all samples. The files named sample_name_taxonomic_profile.tsv contain the taxonomic profile for each sample while the files named sample_name_bowtie2.sam contain the alignments of the reads for a specific sample to the MetaPhlAn2 marker database. The taxonomic profiles are used for the functional profiling steps while the marker alignments are used for the strain profiling steps. For more information on the core output files, see the MetaPhlAn2 user manual.

$ ls output_data/metaphlan2/main/
HD42R4_subsample_bowtie2.sam
HD42R4_subsample_taxonomic_profile.tsv
HD48R4_subsample_bowtie2.sam
HD48R4_subsample_taxonomic_profile.tsv
... ( a set of two files will be included in this folder for each input file )

The metaphlan2/merged folder contains a file of the merged taxonomic profiles for all samples. It also contains a file of the total number of species identified for each sample.

2.3.3 Functional profiling data

HUMAnN2 is used for functional profiling. For more information on HUMAnN2, see the HUMANn2 user manual or the HUMAnN2 tutorial.

The humann2/main folder contains all of the output files from running the core HUMAnN2 software. Each input file will generate a file of gene family and pathway abundances, pathway coverage, a log, and a folder of intermediate output files. For details on each of the main HUMAnN2 output files, see the HUMAnN2 user manual section on output files.

$ ls output_data/humann2/main/
HD42R4_subsample_genefamilies.tsv
HD42R4_subsample_humann2_temp
HD42R4_subsample.log
HD42R4_subsample_pathabundance.tsv
HD42R4_subsample_pathcoverage.tsv
HD48R4_subsample_genefamilies.tsv
HD48R4_subsample_humann2_temp
HD48R4_subsample.log
HD48R4_subsample_pathabundance.tsv
HD48R4_subsample_pathcoverage.tsv
... ( a set of four files and a folder will be included in this folder for each input file )

The humann2/merged folder contains the gene families, ecs, and pathways files for all samples merged into a single file. The data sets normalized to relative abundance (relab) are included in this folder. The files for each individual sample are located in the humann2/regrouped and humann2/relab folders.

$ ls output_data/humann2/merged/
ecs_relab.tsv
ecs.tsv
genefamilies_relab.tsv
genefamilies.tsv
pathabundance_relab.tsv
pathabundance.tsv

The humann2/counts folder contains feature counts of gene families, ecs, and pathways for all samples. It also contains a file of total species identified after filtering and total reads aligning (for nucleotide and translated search) for each sample.

2.3.4 Strain profiling data

StrainPhlAn is used for strain profiling. For more information on StrainPhlAn, see the StrainPhlAn user manual or the StrainPhlAn tutorial.

The strainphlan folder contains all of the core outputs from running the StrainPhlAn software. For more information on the core output files, see the StrainPhlAn user manual. By default, at most ten species identified in the set of samples are profiled. The RAxML files contain the trees generated for each species. If the markers are not present in sufficient quantities for these ten species, there might not be any trees included in this folder.

2.3.5 Log file

The workflow log contains all of the details of the workflow execution including software versions, commands run, and task benchmarking information if the workflow was run in a grid computing environment. All of the command line options are captured in the log. The log appends so if you were to run the workflow twice you would see two log entries, one for each workflow run.

The beginning of a log file includes the workflow settings. It looks like the following:

2017-07-18 15:30:41,946 LoggerReporter  started INFO: Beginning AnADAMA run with 37 tasks.
2017-07-18 15:30:41,946 LoggerReporter  started INFO: Workflow configuration options
2017-07-18 15:30:41,946 LoggerReporter  started INFO: strain_profiling_options =
2017-07-18 15:30:41,946 LoggerReporter  started INFO: grid_benchmark = off
2017-07-18 15:30:41,946 LoggerReporter  started INFO: grid_partition =
2017-07-18 15:30:41,946 LoggerReporter  started INFO: pair_identifier = .R1
2017-07-18 15:30:41,947 LoggerReporter  started INFO: log_level = INFO
2017-07-18 15:30:41,947 LoggerReporter  started INFO: input_extension = fastq.gz
2017-07-18 15:30:41,947 LoggerReporter  started INFO: dry_run = False
2017-07-18 15:30:41,947 LoggerReporter  started INFO: exclude_target = None
2017-07-18 15:30:41,947 LoggerReporter  started INFO: until_task = None

Each log line includes a time stamp.

Here are two additional log file lines. The first captures the version of software used by the workflow and the second prints the command run for a task.

2017-07-18 15:30:42,988 LoggerReporter  task_command    INFO: Tracked executable version:  kneaddata v0.6.1
2017-07-18 15:30:42,988 LoggerReporter  task_command    INFO: Executing with shell:  kneaddata --input /code/demo/HD42R4_subsample.fastq.gz --output /code/demo/output/kneaddata/main --threads 1 --output-prefix HD42R4_subsample   --reference-db /code/demo/kneaddata_demo_db/Homo_sapiens_demo/

3. Visualization

Each bioBakery data processing workflow has a corresponding visualization workflow. The visualization workflow uses the products of the data processing workflow to generate a report in pdf or html format.

3.1 Input files

The input files to the visualization workflow are the output files from the data processing workflow. If you have finished running the workflow on the full set of subsampled files, you have all of the input files required to run the visualization workflow. If to save time you only ran with a few input files or you were not able to run the strain profiling tasks, you can download the output files from the data processing workflow output files section.

3.2 Run the workflow

The command to run the visualization workflow is similar to the command to run the data processing workflow. However, with this command you will also need to provide a name for the project. The project name will be included in the title section of the report generated.

Run the following command to start the workflow:

$ biobakery_workflows wmgx_vis --input output_data --output output_vis --project-name tutorial

All input files are located in the folder output_data and all output files will be written to the folder output_vis. The folder output_data contains the output files from running the data processing workflow.

3.3. Output files

The main output file is a report. By default the report is in pdf format. To generate a html report, add the option --output-format html. The sub-folder named figures in the output folder contains all of the figures in the report. All tables in the report are located in the sub-folder named data. The links in the report to data files all link to files in this folder. A zip archive will be generated with the same name as the output folder. This zip archive is useful if transferring the visualization products because the links in the report (pdf and html format) will be preserved.

3.3.1 Report

The report is 22 pages for this demo set. The report sections include 1) quality control 2) taxonomic profiling 3) functional profiling and 4) workflow information. The workflow information section contains all of the commands run by the workflow. This can be excluded from the report by adding the option --exclude-workflow-info. For workflows with hundreds of input files this section can be very long since hundreds of commands will be included.

Page 1

The first page of the report includes the project name and date. It also includes a table of contents with links to all of the sections of the report.

https://bitbucket.org/repo/49y6o9/images/2933735162-Screenshot%20from%202017-07-18%2019-34-36.png

Pages 2 through 5

The first section includes read counts for the quality control workflow tasks. A table of read counts for each sample for each filtering step is included along with a barplot of this data. Links to the data files appear below each table in the report.

https://bitbucket.org/repo/49y6o9/images/543288979-Screenshot%20from%202017-07-18%2019-39-22.png

Pages 5 through 9

The second section includes taxonomic profiling data. A table of species counts along with an ordination, heatmap and barplot of species abundances are included.

https://bitbucket.org/repo/49y6o9/images/1738102173-Screenshot%20from%202017-07-18%2019-40-30.png

Pages 10 through 15

The third section includes functional profiling information. Heatmaps and tables of pathway abundances are included along with scatter plots of features and alignment rates.

https://bitbucket.org/repo/49y6o9/images/1632855863-Screenshot%20from%202017-07-18%2019-42-02.png

Pages 15 through 22

The final section includes information from the workflow. A list of software versions along with all commands run are included in this section.

https://bitbucket.org/repo/49y6o9/images/894080452-Screenshot%20from%202017-07-18%2019-36-49.png

Updated