bioBakery Workflows Tutorial

bioBakery workflows is a collection of workflows and tasks for executing common microbial community analyses using standardized, validated tools and parameters. Quality control and statistical summary reports are automatically generated for most data types, which include 16S amplicons, metagenomes, and metatranscriptomes.

1. Setup

1.1 Requirements

All of the core requirements for bioBakery workflows (including AnADAMA2) are automatically installed when installing bioBakery workflows. Depending on the workflows to be run additional software is required. For this tutorial, we will be running the whole metagenome shotgun workflow. It requires the following software.

2. MetaPhlAn2
3. HUMAnN2
4. StrainPhlAn

1.2 Installation

bioBakery workflows can be installed with Homebrew, Docker, or pip.

Please note, if you are using bioBakery (Vagrant VM or Google Cloud) you do not need to install bioBakery workflows because the tool and its dependencies are already installed.

1.2.1 Install with Homebrew

bioBakery workflows can be installed using Brew (HomeBrew for MacOS or LinuxBrew for Linux platforms). This will install almost all of the dependencies for all workflows (ie Kneaddata, MetaPhlan2, etc.) excluding those dependencies that have licenses. See the bioBakery Homebrew user manual for instructions on manually installing additional dependencies.

To install with brew:



1.2.3 Install with pip

bioBakery workflows can be installed with pip. This will only install the core software and dependencies. It will not install the dependencies for the workflows. Refer to the bioBakery workflows user manual on how to install the dependencies for each workflow.

To install with pip:

$pip install biobakery_workflows  1.2.4 Install dependencies with licenses The 16s workflow optionally depends on usearch which requires a license. Obtain a license for usearch and install it manually if you will be running the 16s workflow. 1.2.5 Install databases All of the data processing workflows require large databases. Due to their size, they are not installed with the Homebrew, Docker, or VM images. A demo set of databases will be used with this tutorial to reduce the run time. To install the demo shotgun databases: $ biobakery_workflows_databases --install wmgx_demo


To install the full shotgun databases:

$biobakery_workflows_databases --install wmgx  To install the full 16s databases: $ biobakery_workflows_databases --install 16s


2. Metagenome profiling

Raw metagenome shotgun sequencing reads can be processed through the workflow with a single command. The workflow will run quality control, taxonomic profiling, functional profiling, and strain profiling for the full set of samples.

The raw files can be single or paired-end and can be formatted as fastq or fastq. They can also be compressed with gzip. By default fastq.gz files are expected. If paired-end the default pair identifier is ".R1". To change the default for these parameters use the options "--pair-identifier <R1>" and "--input-extension <fasta>".

2.1 Input files

For this tutorial, we will be using a set of sub-sampled files to reduce run time. These files were generated using reads obtained from a set of six healthy and diseased samples (Alterations of the human gut microbiome in liver cirrhosis. Nature. 2014 Sep 4;513(7516):59-64.)

Download all of the files to run the full tutorial. If you have limited time, only download one or two files. Place these files in a folder named input in your current working directory.

The sample names from the input files are used to differentiate tasks that are run for all samples along with their corresponding output files. For example, the sample name for input file HD42R4_subsample.fastq.gz is HD42R4_subsample. The name for the task to run MetaPhlAn2 on this input file is named metaphlan2____HD42R4_subsample and the output files generated from MetaPhlAn2 for this input file are named HD42R4_subsample_bowtie2.sam and HD42R4_subsample_taxonomic_profile.tsv.

2.2 Run the workflow

Run the following command to start the workflow:

biobakery_workflows wmgx --input input --output output_data --bypass-strain-profiling  All input files are located in the folder input and all output files will be written to the folder output_data. The option to bypass strain profiling reduces the run time. If you have more time or cores to use to run the tutorial, see the sections that follow on different workflow options. 2.2.1 Scaling up By default, the workflow will run a single task at a time and each task will be allowed a single core. If you have multiple cores or a grid computing environment, add the following options to your workflow to speed up the run. • Add the option --local-jobs 6 to run at most six tasks at once locally. • Add the option --threads 2 to give each task two cores. • If you are running in an environment where grid computing is available (using a SLURM or SGE system), add the option --grid-jobs 6 --threads 8 to run at most six grid jobs at once each with eight cores. 2.2.2 Expected run time The run time will vary depending on the number of input files and the workflow options. All run time estimates are based on the use of a single core. • Run a single sample • 5 minutes: Run bypassing the strain profiling steps (option --bypass-strain-profiling) • 15 minutes: Run the full workflow. • Run all six samples • 30 minutes: Run bypassing the strain profiling steps (option --bypass-strain-profiling) • 60 minutes: Run the full workflow. When running the subsampled files with strain profiling it is recommended to add the option --strain-profiling-options="--marker_in_clade 0.01". This will allow for strains to be identified with lower numbers of markers than the default to account for the subsampled file size. 2.2.3 Standard output Running the workflow on a single sample, the output to your terminal will look something like the following: (Jul 18 13:53:24) [ 0/21 - 0.00%] **Ready ** Task 0: kneaddata____HD42R4_subsample (Jul 18 13:53:24) [ 0/21 - 0.00%] **Started ** Task 0: kneaddata____HD42R4_subsample (Jul 18 13:53:26) [ 1/21 - 4.76%] **Completed** Task 0: kneaddata____HD42R4_subsample (Jul 18 13:53:26) [ 1/21 - 4.76%] **Ready ** Task 3: kneaddata_read_count_table (Jul 18 13:53:26) [ 1/21 - 4.76%] **Started ** Task 3: kneaddata_read_count_table (Jul 18 13:53:26) [ 2/21 - 9.52%] **Completed** Task 3: kneaddata_read_count_table (Jul 18 13:53:26) [ 2/21 - 9.52%] **Ready ** Task 4: metaphlan2____HD42R4_subsample (Jul 18 13:53:26) [ 2/21 - 9.52%] **Started ** Task 4: metaphlan2____HD42R4_subsample (Jul 18 13:54:18) [ 3/21 - 14.29%] **Completed** Task 4: metaphlan2____HD42R4_subsample (Jul 18 13:54:18) [ 3/21 - 14.29%] **Ready ** Task 8: humann2____HD42R4_subsample (Jul 18 13:54:18) [ 3/21 - 14.29%] **Started ** Task 8: humann2____HD42R4_subsample (Jul 18 13:58:39) [ 4/21 - 19.05%] **Completed** Task 8: humann2____HD42R4_subsample (Jul 18 13:58:39) [ 4/21 - 19.05%] **Ready ** Task 10: humann2_count_alignments_species (Jul 18 13:58:39) [ 4/21 - 19.05%] **Started ** Task 10: humann2_count_alignments_species (Jul 18 13:58:39) [ 5/21 - 23.81%] **Completed** Task 10: humann2_count_alignments_species (Jul 18 13:58:39) [ 5/21 - 23.81%] **Ready ** Task 11: humann2_regroup_UniRef2EC____HD42R4_subsample (Jul 18 13:58:39) [ 5/21 - 23.81%] **Started ** Task 11: humann2_regroup_UniRef2EC____HD42R4_subsample (Jul 18 13:58:40) [ 6/21 - 28.57%] **Completed** Task 11: humann2_regroup_UniRef2EC____HD42R4_subsample (Jul 18 13:58:40) [ 6/21 - 28.57%] **Ready ** Task 16: humann2_renorm_ecs_relab____HD42R4_subsample (Jul 18 13:58:40) [ 6/21 - 28.57%] **Started ** Task 16: humann2_renorm_ecs_relab____HD42R4_subsample (Jul 18 13:58:40) [ 7/21 - 33.33%] **Completed** Task 16: humann2_renorm_ecs_relab____HD42R4_subsample (Jul 18 13:58:40) [ 7/21 - 33.33%] **Ready ** Task 19: humann2_join_tables_ecs_relab (Jul 18 13:58:40) [ 7/21 - 33.33%] **Started ** Task 19: humann2_join_tables_ecs_relab (Jul 18 13:58:40) [ 8/21 - 38.10%] **Completed** Task 19: humann2_join_tables_ecs_relab (Jul 18 13:58:40) [ 8/21 - 38.10%] **Ready ** Task 22: humann2_count_features_ecs (Jul 18 13:58:40) [ 8/21 - 38.10%] **Started ** Task 22: humann2_count_features_ecs (Jul 18 13:58:41) [ 9/21 - 42.86%] **Completed** Task 22: humann2_count_features_ecs (Jul 18 13:58:41) [ 9/21 - 42.86%] **Ready ** Task 13: humann2_join_tables_ecs (Jul 18 13:58:41) [ 9/21 - 42.86%] **Started ** Task 13: humann2_join_tables_ecs (Jul 18 13:58:41) [10/21 - 47.62%] **Completed** Task 13: humann2_join_tables_ecs (Jul 18 13:58:41) [10/21 - 47.62%] **Ready ** Task 12: humann2_join_tables_genefamilies (Jul 18 13:58:41) [10/21 - 47.62%] **Started ** Task 12: humann2_join_tables_genefamilies (Jul 18 13:58:41) [11/21 - 52.38%] **Completed** Task 12: humann2_join_tables_genefamilies (Jul 18 13:58:41) [11/21 - 52.38%] **Ready ** Task 14: humann2_join_tables_pathabundance (Jul 18 13:58:41) [11/21 - 52.38%] **Started ** Task 14: humann2_join_tables_pathabundance (Jul 18 13:58:41) [12/21 - 57.14%] **Completed** Task 14: humann2_join_tables_pathabundance (Jul 18 13:58:41) [12/21 - 57.14%] **Ready ** Task 15: humann2_renorm_genes_relab____HD42R4_subsample (Jul 18 13:58:41) [12/21 - 57.14%] **Started ** Task 15: humann2_renorm_genes_relab____HD42R4_subsample (Jul 18 13:58:41) [13/21 - 61.90%] **Completed** Task 15: humann2_renorm_genes_relab____HD42R4_subsample (Jul 18 13:58:41) [13/21 - 61.90%] **Ready ** Task 18: humann2_join_tables_genes_relab (Jul 18 13:58:41) [13/21 - 61.90%] **Started ** Task 18: humann2_join_tables_genes_relab (Jul 18 13:58:42) [14/21 - 66.67%] **Completed** Task 18: humann2_join_tables_genes_relab (Jul 18 13:58:42) [14/21 - 66.67%] **Ready ** Task 21: humann2_count_features_genes (Jul 18 13:58:42) [14/21 - 66.67%] **Started ** Task 21: humann2_count_features_genes (Jul 18 13:58:42) [15/21 - 71.43%] **Completed** Task 21: humann2_count_features_genes (Jul 18 13:58:42) [15/21 - 71.43%] **Ready ** Task 17: humann2_renorm_pathways_relab____HD42R4_subsample (Jul 18 13:58:42) [15/21 - 71.43%] **Started ** Task 17: humann2_renorm_pathways_relab____HD42R4_subsample (Jul 18 13:58:42) [16/21 - 76.19%] **Completed** Task 17: humann2_renorm_pathways_relab____HD42R4_subsample (Jul 18 13:58:42) [16/21 - 76.19%] **Ready ** Task 20: humann2_join_tables_pathways_relab (Jul 18 13:58:42) [16/21 - 76.19%] **Started ** Task 20: humann2_join_tables_pathways_relab (Jul 18 13:58:42) [17/21 - 80.95%] **Completed** Task 20: humann2_join_tables_pathways_relab (Jul 18 13:58:42) [17/21 - 80.95%] **Ready ** Task 23: humann2_count_features_pathways (Jul 18 13:58:42) [17/21 - 80.95%] **Started ** Task 23: humann2_count_features_pathways (Jul 18 13:58:42) [18/21 - 85.71%] **Completed** Task 23: humann2_count_features_pathways (Jul 18 13:58:42) [18/21 - 85.71%] **Ready ** Task 24: humann2_merge_feature_counts (Jul 18 13:58:42) [18/21 - 85.71%] **Started ** Task 24: humann2_merge_feature_counts (Jul 18 13:58:42) [19/21 - 90.48%] **Completed** Task 24: humann2_merge_feature_counts (Jul 18 13:58:42) [19/21 - 90.48%] **Ready ** Task 6: metaphlan2_join_taxonomic_profiles (Jul 18 13:58:42) [19/21 - 90.48%] **Started ** Task 6: metaphlan2_join_taxonomic_profiles (Jul 18 13:58:43) [20/21 - 95.24%] **Completed** Task 6: metaphlan2_join_taxonomic_profiles (Jul 18 13:58:43) [20/21 - 95.24%] **Ready ** Task 7: metaphlan2_count_species (Jul 18 13:58:43) [20/21 - 95.24%] **Started ** Task 7: metaphlan2_count_species (Jul 18 13:58:43) [21/21 - 100.00%] **Completed** Task 7: metaphlan2_count_species Run Finished  The output contains the following information. • Column 1: (Jul 18 13:58:42) : This is the time stamp for the status message. • Column 2: [19/21 - 90.48%] : This is the status of the workflow. The first number is the total tasks completed. The second number is the total number of tasks in the workflow. The last number is the percent of tasks completed. • Column 3: **Started ** : This is the current status for the task. Tasks that are ready have all dependencies available and are waiting for computing resources. Tasks that have started are beginning to run locally. Tasks that are completed have finished executing. • Column 4: Task 6 : This is the task number associated with the status. • Column 5: metaphlan2_join_taxonomic_profiles : This is the name of the task. Tasks that are specific to a sample will include the sample name at the end of the task name. A line will be written to standard output anytime a task changes status. When running on the grid, task status will be polled at minute intervals to track job progress in standard output. All text written to standard output will also be written to the log. 2.3 Output files All output files will be located in the output folder specified with --output <folder> when running the workflow. If you used the same command as included above, the output folder will be named output_data. Run the command to list the contents of the output folder to see four sub-folders along with the workflow log.  ls output_data/
humann2
metaphlan2
strainphlan


Each folder contains data files generated by the corresponding tool. The main sub-folders for each tool contain the core outputs from running each tool on all samples.

2.3.1 Quality control data

The kneaddata/main folder contains all of the main outputs from running Kneaddata on all samples. The files named kneaddata/main/sample_name.fastq, for example kneaddata/main/HD42R4_subsample.fastq from this tutorial, are the reads for a specific sample that have passed quality control. These are the reads that are used for the next steps in the workflow. For more information on the core output files, see the Kneaddata user manual.

output_data/kneaddata/main/ HD42R4_subsample.fastq HD42R4_subsample_Homo_sapiens_demo_bowtie2_contam.fastq HD42R4_subsample.log HD42R4_subsample_SILVA_128_LSUParc_SSUParc_ribosomal_RNA_demo_bowtie2_contam.fastq HD42R4_subsample.trimmed.fastq HD48R4_subsample.fastq HD48R4_subsample_Homo_sapiens_demo_bowtie2_contam.fastq HD48R4_subsample.log HD48R4_subsample_SILVA_128_LSUParc_SSUParc_ribosomal_RNA_demo_bowtie2_contam.fastq HD48R4_subsample.trimmed.fastq ... ( a set of five files will be included in this folder for each input file )  The kneaddata/merged folder contains a single merged data file containing read counts for each step of the quality control process for each input file. 2.3.2 Taxonomic profiling data MetaPhlAn2 is used for taxonomic profiling. For more information on MetaPhlAn2, see the MetaPhlAn2 user manual or the MetaPhlAn2 tutorial. The metaphlan2/main folder contains all of the main outputs from running MetaPhlAn2 on all samples. The files named sample_name_taxonomic_profile.tsv contain the taxonomic profile for each sample while the files named sample_name_bowtie2.sam contain the alignments of the reads for a specific sample to the MetaPhlAn2 marker database. The taxonomic profiles are used for the functional profiling steps while the marker alignments are used for the strain profiling steps. For more information on the core output files, see the MetaPhlAn2 user manual.  ls output_data/metaphlan2/main/
HD42R4_subsample_bowtie2.sam
HD42R4_subsample_taxonomic_profile.tsv
HD48R4_subsample_bowtie2.sam
HD48R4_subsample_taxonomic_profile.tsv
... ( a set of two files will be included in this folder for each input file )


The metaphlan2/merged folder contains a file of the merged taxonomic profiles for all samples. It also contains a file of the total number of species identified for each sample.

2.3.3 Functional profiling data

HUMAnN2 is used for functional profiling. For more information on HUMAnN2, see the HUMANn2 user manual or the HUMAnN2 tutorial.

The humann2/main folder contains all of the output files from running the core HUMAnN2 software. Each input file will generate a file of gene family and pathway abundances, pathway coverage, a log, and a folder of intermediate output files. For details on each of the main HUMAnN2 output files, see the HUMAnN2 user manual section on output files.

$ls output_data/humann2/main/ HD42R4_subsample_genefamilies.tsv HD42R4_subsample_humann2_temp HD42R4_subsample.log HD42R4_subsample_pathabundance.tsv HD42R4_subsample_pathcoverage.tsv HD48R4_subsample_genefamilies.tsv HD48R4_subsample_humann2_temp HD48R4_subsample.log HD48R4_subsample_pathabundance.tsv HD48R4_subsample_pathcoverage.tsv ... ( a set of four files and a folder will be included in this folder for each input file )  The humann2/merged folder contains the gene families, ecs, and pathways files for all samples merged into a single file. The data sets normalized to relative abundance (relab) are included in this folder. The files for each individual sample are located in the humann2/regrouped and humann2/relab folders. $ ls output_data/humann2/merged/
ecs_relab.tsv
ecs.tsv
genefamilies_relab.tsv
genefamilies.tsv
pathabundance_relab.tsv
pathabundance.tsv


The humann2/counts folder contains feature counts of gene families, ecs, and pathways for all samples. It also contains a file of total species identified after filtering and total reads aligning (for nucleotide and translated search) for each sample.

2.3.4 Strain profiling data

StrainPhlAn is used for strain profiling. For more information on StrainPhlAn, see the StrainPhlAn user manual or the StrainPhlAn tutorial.

The strainphlan folder contains all of the core outputs from running the StrainPhlAn software. For more information on the core output files, see the StrainPhlAn user manual. By default, at most ten species identified in the set of samples are profiled. The RAxML files contain the trees generated for each species. If the markers are not present in sufficient quantities for these ten species, there might not be any trees included in this folder.

2.3.5 Log file

The workflow log contains all of the details of the workflow execution including software versions, commands run, and task benchmarking information if the workflow was run in a grid computing environment. All of the command line options are captured in the log. The log appends so if you were to run the workflow twice you would see two log entries, one for each workflow run.

The beginning of a log file includes the workflow settings. It looks like the following:

2017-07-18 15:30:41,946 LoggerReporter  started INFO: Beginning AnADAMA run with 37 tasks.
2017-07-18 15:30:41,946 LoggerReporter  started INFO: Workflow configuration options
2017-07-18 15:30:41,946 LoggerReporter  started INFO: strain_profiling_options =
2017-07-18 15:30:41,946 LoggerReporter  started INFO: grid_benchmark = off
2017-07-18 15:30:41,946 LoggerReporter  started INFO: grid_partition =
2017-07-18 15:30:41,946 LoggerReporter  started INFO: pair_identifier = .R1
2017-07-18 15:30:41,947 LoggerReporter  started INFO: log_level = INFO
2017-07-18 15:30:41,947 LoggerReporter  started INFO: input_extension = fastq.gz
2017-07-18 15:30:41,947 LoggerReporter  started INFO: dry_run = False
2017-07-18 15:30:41,947 LoggerReporter  started INFO: exclude_target = None
2017-07-18 15:30:41,947 LoggerReporter  started INFO: until_task = None


Each log line includes a time stamp.

Here are two additional log file lines. The first captures the version of software used by the workflow and the second prints the command run for a task.

2017-07-18 15:30:42,988 LoggerReporter  task_command    INFO: Tracked executable version:  kneaddata v0.6.1


2.4 Visualization

Each bioBakery data processing workflow has a corresponding visualization workflow. The visualization workflow uses the products of the data processing workflow to generate a report in pdf or html format.

2.4.1 Input files

The input files to the visualization workflow are the output files from the data processing workflow. If you have finished running the workflow on the full set of subsampled files, you have all of the input files required to run the visualization workflow. If to save time you only ran with a few input files or you were not able to run the strain profiling tasks, you can download the output files from the data processing workflow output files section.

2.4.2 Run the workflow

The command to run the visualization workflow is similar to the command to run the data processing workflow. However, with this command you will also need to provide a name for the project. The project name will be included in the title section of the report generated.

Run the following command to start the workflow:

biobakery_workflows wmgx_vis --input output_data --output output_vis --project-name tutorial  All input files are located in the folder output_data and all output files will be written to the folder output_vis. The folder output_data contains the output files from running the data processing workflow. 2.4.3 Output files The main output file is a report. By default the report is in pdf format. To generate a html report, add the option --output-format html. The sub-folder named figures in the output folder contains all of the figures in the report. All tables in the report are located in the sub-folder named data. The links in the report to data files all link to files in this folder. A zip archive will be generated with the same name as the output folder. This zip archive is useful if transferring the visualization products because the links in the report (pdf and html format) will be preserved. The report is over 20 pages for this demo set. The report sections include 1) quality control 2) taxonomic profiling 3) functional profiling and 4) workflow information. The workflow information section contains all of the commands run by the workflow. This can be excluded from the report by adding the option --exclude-workflow-info. For workflows with hundreds of input files this section can be very long since hundreds of commands will be included. Page 1 The first page of the report includes the project name and date. It also includes a table of contents with links to all of the sections of the report. Section 1 The first section includes read counts for the quality control workflow tasks. A table of read counts for each sample for each filtering step is included along with a barplot of this data. Links to the data files appear below each table in the report. Section 2 The second section includes taxonomic profiling data. A table of species counts along with an ordination, heatmap and barplot of species abundances are included. Section 3 The third section includes functional profiling information. Heatmaps and tables of pathway abundances are included along with scatter plots of features and alignment rates. Section 4 The final section includes information from the workflow. A list of software versions along with all commands run are included in this section. 3. 16s profiling 16s sequencing reads can be processed through the workflow with a single command. The workflow will run quality control, taxonomic profiling, and functional profiling for the full set of samples. There are two methods to select from for 16s data sets: UPARSE (VSEARCH or USEARCH) and DADA2. They both require the same type of input files and generate similar output files. You can select which method to run by adding the --method <vsearch/usearch/dada2> option to your command. 3.1 Input files For the 16s workflow the raw files can be single or paired-end and be multiplexed or de-multiplexed. The raw files can be fastq or fastq.gz formatted. By default fastq.gz files are expected. If paired-end the default pair identifier is _R1_001. To change the default for the pair identifiers use the option --pair-identifier <R1>. If the files are multiplexed, a barcode file must be included in the input file set by adding the option --barcode-file <file>. An index file is also required. To change the default index file identifier add the option --index-identifier <_I1_001>. For this tutorial, we will be using a set of sub-sampled data to reduce run time. These files are from the data set generated for the following study, Melanie Schirmer, Umer Z. Ijaz, Rosalinda D'Amore, Neil Hall, William T. Sloan, Christopher Quince; Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform, Nucleic Acids Research, Volume 43, Issue 6, 31 March 2015, Pages e37, https://doi.org/10.1093/nar/gku1341. Download all of the files to run the full tutorial. If you have limited time, only download one or two pairs of files. Place these files in a folder named input in your current working directory. 3.2 Run the workflow Run the following command to start the workflow:  biobakery_workflows 16s --input input --output output_data


All input files are located in the folder input and all output files will be written to the folder output_data.

By default the UPARSE method will be run. To run the DADA2 method, add the option --method dada2. By default the DADA2 method will use the SILVA databases. To select a different database add the option –dada-db <silva/gg/rdp>. If you have more time or cores to use to run the tutorial, see the prior section in this tutorial about scaling up.

See the prior section on standard output for the details on the output written to standard out for the bioBakery workflows.

3.2.1 Expected run time

The run time will vary depending on the number of input files and the workflow options. All run time estimates are based on the use of a single core.

• Run a single sample
• 5 minutes: UPARSE method
• 5 minutes: DADA2 method with green genes or rdp databases
• Run all samples
• 10 minutes: UPARSE method
• 10 minutes: DADA2 method with green genes or rdp databases
• 60 minutes: DADA2 method with silva database

3.3 UPARSE output files

The UPARSE method follows the UPARSE pipeline plus functional profiling with PICRUSt and phylogenetic tree generation with FastTree.

All output files will be located in the output folder specified with --output <folder> when running the workflow. If you used the same command as included above, the output folder will be named output_data.

If you used the same command as included above, the output folder will be named output_data. Run the command to list the contents of the output folder to see one sub-folder along with the workflow log.

ls output_data merged_renamed all_samples_categorize_by_function.biom all_samples_categorize_by_function.tsv all_samples_closed_reference.fasta all_samples_clustalo_aligned_closed_reference.fasta all_samples_clustalo_aligned_nonchimera.fasta all_samples_concatenated_discarded.fasta all_samples_concatenated_filtered.fasta all_samples_concatenated_truncated.fasta all_samples_concatenated.fasta all_samples_concatenated.fastq all_samples_denovo_otu_table.tsv all_samples_dereplicated.fasta all_samples_eestats2.txt all_samples_green_genes_mapping_results.tsv all_samples_green_genes_mapping_results.uc all_samples_normalize_by_copy_number.biom all_samples_open_reference.fasta all_samples_otu_mapping_results.tsv all_samples_otu_mapping_results.uc all_samples_otus_nonchimeras.fasta all_samples_predict_metagenomes.biom all_samples_predict_metagenomes.tsv all_samples_read_counts.tsv all_samples_sorted.fasta all_samples_taxonomy_closed_reference.biom all_samples_taxonomy_closed_reference.tsv all_samples_taxonomy_open_reference.tsv all_samples_uparse_otus.txt anadama.log closed_reference.tre  The merged_renamed folder contains merged pairs with sequences renamed to sample names. 3.3.1 Quality control data Quality of the reads are summarized in all_samples_eestats2.txt. Read counts of original, mapped, and unclassified reads are included in all_samples_read_counts.tsv. For quality control all reads from all samples are filtered and truncated. All reads with sequence ids corresponding to sample name are in all_samples_concatenated.fastq. all_samples_concatenated_discarded.fasta all_samples_concatenated_filtered.fasta all_samples_concatenated_truncated.fasta all_samples_concatenated.fasta all_samples_concatenated.fastq all_samples_dereplicated.fasta all_samples_sorted.fasta  3.3.2 OTU mapping data Then reads are clustered, chimeras are removed, and an OTU table is created. Details of mapping of reads to OTUs can be found in following files. all_samples_denovo_otu_table.tsv all_samples_otu_mapping_results.tsv all_samples_otu_mapping_results.uc all_samples_otus_nonchimeras.fasta all_samples_uparse_otus.txt  3.3.3 Taxonomic profiling data The following files are closed and open reference tables after assigning taxonomy using the green genes database. The closed reference table is converted to biom format as well for further processing by PICRUSt. all_samples_closed_reference.fasta all_samples_green_genes_mapping_results.tsv all_samples_green_genes_mapping_results.uc all_samples_open_reference.fasta all_samples_taxonomy_closed_reference.biom all_samples_taxonomy_closed_reference.tsv all_samples_taxonomy_open_reference.tsv  3.3.4 Phylogenetic tree Then Clustal Omega’s multiple sequence alignment algorithm is used to construct msa files and FastTree to construct phylogenetic tree. all_samples_clustalo_aligned_closed_reference.fasta all_samples_clustalo_aligned_nonchimera.fasta closed_reference.tre  3.3.5 Functional profiling data Metagenome functional predictions can be found in these files generated by PICRUSt’s predict metagenomes and categorize by function modules. all_samples_predict_metagenomes.biom all_samples_predict_metagenomes.tsv all_samples_categorize_by_function.biom all_samples_categorize_by_function.tsv  3.4 DADA2 output files The DADA2 method follows the DADA2 pipeline plus phylogenetic tree generation with FastTree. All output files will be located in the output folder specified with --output <folder> when running the workflow. If you used the same command as included above, the output folder will be named output_data. If you used the same command as included above, the output folder will be named output_data. Run the following command to list the contents of the output folder to see four sub-folders along with the workflow log.  ls output_data
Error_rates_per_sample_FWD.png
Error_rates_per_sample_REV.png
all_samples_SV-counts.tsv
all_samples_clustalo_aligned_nonchimera.fasta
all_samples_taxonomy_closed_reference.tsv
all_samples_taxonomy_closed_reference_taxcolumns.tsv
all_samples_taxonomy_closed_reference_withseqs.tsv
closed_reference.tre
error_ratesFWD.rds
error_ratesREV.rds
filtered_input
mergers.rds
seqtab_final.rds
sequences.fasta


3.4.1 Quality control data

Following plots describe quality of reads for each sample. They are learned error rates per sample for forward and reverse reads, and quality of reads per sample for forward and reverse reads.

Error_rates_per_sample_FWD.png
Error_rates_per_sample_REV.png


The workflow tracks read counts on each step of the processing. The counts are in following files

Read_counts_after_filtering.tsv


3.4.2 Taxonomic profiling data

The Amplicon Sequence Variant (ASV) table generated by DADA2 workflow uses sequences as ids. It is an equivalent of a high resolution OTU table.

all_samples_SV-counts.tsv


The 16s DADA2 workflow gives the option to assign taxonomy including species level using database of silva, green genes, or rdp. The following files are closed reference tables after assigning taxonomy in 3 different formats. DADA2 uses sequences as ids, and has separate column for each taxonomic level. For users convenience and consistence with other workflows, we generate a version of the table with all taxonomic levels combined in one column, and a version where ids are not sequences, but short strings ASV1, ASV2,… similar to OTU1, OTU2,….

all_samples_taxonomy_closed_reference.tsv
all_samples_taxonomy_closed_reference_taxcolumns.tsv
all_samples_taxonomy_closed_reference_withseqs.tsv


3.4.3 Phylogenetic tree

All sequences found in samples after merging pairs, de-replicating and removing chimeras are saved in fasta format.

sequences.fasta


Then Clustal Omega’s multiple sequence alignment algorithm is used to construct a msa file and FastTree to construct phylogenetic tree.

all_samples_clustalo_aligned_nonchimera.fasta
closed_reference.tre


3.4.4 RDS files

The following R native format RDS files are also available in the output folder:

Read_counts_filt.rds       read counts after filtering
error_ratesFWD.rds         error rates by sample, forward reads
error_ratesREV.rds         error rates by sample, reverse reads
mergers.rds                merged pairs
seqtab_final.rds           ASV table after merging, de-replicating and removing chimeras


3.5 Visualization

Each bioBakery data processing workflow has a corresponding visualization workflow. The visualization workflow uses the products of the data processing workflow to generate a report in pdf or html format.

3.5.1 Input files

The input files to the visualization workflow are the output files from the data processing workflow. If you have finished running the workflow on the full set of subsampled files, you have all of the input files required to run the visualization workflow. If to save time you only ran with a few input files or you were not able to run the strain profiling tasks, you can download the output files from the the prior section of this tutorial.

3.5.2 Run the workflow

The command to run the visualization workflow is similar to the command to run the data processing workflow. However, with this command you will also need to provide a name for the project. The project name will be included in the title section of the report generated.

Run the following command to start the workflow:

\$ biobakery_workflows 16s_vis --input output_data --output output_vis --project-name tutorial


All input files are located in the folder output_data and all output files will be written to the folder output_vis. The folder output_data contains the output files from running the data processing workflow.

The same command is run for both methods of processing the 16s data: USEARCH and DADA2. The report contents will differ slightly based on the two methods but will contain the same general core of figures and tables.

2.4.3 Output files

The main output file is a report. By default the report is in pdf format. To generate a html report, add the option --output-format html. The sub-folder named figures in the output folder contains all of the figures in the report. All tables in the report are located in the sub-folder named data. The links in the report to data files all link to files in this folder. A zip archive will be generated with the same name as the output folder. This zip archive is useful if transferring the visualization products because the links in the report (pdf and html format) will be preserved.

The report is over 20 pages for this demo set. The report sections include 1) quality control 2) taxonomic profiling and 3) workflow information. The workflow information section contains all of the commands run by the workflow. This can be excluded from the report by adding the option --exclude-workflow-info. For workflows with hundreds of input files this section can be very long since hundreds of commands will be included.

Page 1

The first page of the report includes the project name and date. It also includes a table of contents with links to all of the sections of the report.

Initial sections

The first couple sections includes read counts for the quality control workflow tasks.

Core sections

The core sections include taxonomic profiling data. A table of species counts along with an ordination, heatmap and barplot of species abundances are included.

Final section

The final section includes information from the workflow. A list of software versions along with all commands run are included in this section.

Updated