metagenomics-pipeline

The Metagenomics pipeline processes a 16s or ITS experiment and generates an analysis report.

The Pipeline

The pipeline performs the following steps:

Subsampling: Each fastq file is reduced to a specified number of reads in order to reduce processing time
Quality Control: Per-base and per-read quality score statistics are calculated for each fastq file
Trimming/Filtering/Converting
1. Overlapping paired-end reads: Read pairs are stitched together and amplicon primers are removed using PandaSeq. Sequence IDs are converted to Qiime format and fastq files are converted to fasta format.
2. Non-overlapping paired-end reads: Samples with paired end reads that don't overlap are treated like single-end reads; the second (R2) read is ignored
3. Single-end reads: 3' ends are quality trimmed and the amplicon primer is removed. Sequence IDs are converted to Qiime format and fastq files are converted to fasta format. (Qiime scripts convert_fasta_qual_fastq.py and split_libraries.py used)
Fasta merge: The individual sample fasta files are concatenated into one fasta file
Chimera Detection: Chimeras are detected using ChimeraSlayer's usearch61 method and removed.
Host Detection: Contaminating host sequence is identified using Bowtie2 and removed (if the --bowtie2index option is used)
Second Subsampling: Each sample is reduced to a specified number of reads in order to reduce processing time. This second subsampling allows you to ensure that the same number of reads are used from each sample in the OTU picking step
OTU Picking: Qiime's pick_open_reference_otus.py script is used to pick OTUs using usearch61.
Qiime Plots: A series of plots based on the OTU table are generated using Qiime
Beta Diversity: Beta diversity is estimated using Qiime
Alpha Diversity: Alpha diversity is estimated using Qiime

Known Issues

If you have a .qiime_config file in your home directory you may override some Qiime configuration options that the pipeline requires. I recommend moving (or renaming) your .qiime_config file so Qiime can't find it to ensure the pipeline runs correctly.
The pipeline should work with single-end read datasets and non-overlapping paired-end read datasets, however the pipeline hasn't been extensivly tested with these types of datasets, so errors may be encountered. Please report any problems.
The pipeline may occasionaly print out a verbose warning that begins: An MPI process has executed an operation involving a call to the fork() system call to create a child process". I have ben unable to determine the source of the warning, and I am not aware that the issue raised by the warning is affecting the pipeline

Input

Options for metagenomicsQC

`--fastqfolder folder`
	A folder containing fastq files to process
`--samplesheet file`
	A Qiime mapping file
`--outputfolder folder`
	A folder to deposit final results
`--samplespernode integer`
	Number of samples to process simultaneously on each node (default = 1)
`--threadspersample integer`
	Number of threads used by each sample
`--subsample integer`
	Subsample the specified number of reads from each sample. 0 = no subsampling (default = 0)
`--subsampletwo integer`
	Subsample the specified number of reads from each filtered sample. 0 = no subsampling (default = 0)
`--bowtie2index index`
	Bowtie index for (host) contamination detection
`--emp`	EMP protocol, or any other protocol where primers are not present in the reads
`--crop integer`	crop integer bases from the start of every read (neccessary for "IIS" library prep method)
`--referencefasta file`
	A fasta file containing reference sequences (default=greengenes 97_otus.fasta)
`--referencetaxonomy file`
	An id_to_taxonomy_fp file (default=greengenes 97_otu_taxonomy.txt), see Qiime documentation for details
`--referencealignedfasta file`
	a pynast_template_alignment_fp file (default=greengenes rep_set_aligned/97_otus.fasta), see Qiime documentatino for details
`--scratchfolder folder`
	A temporary/scratch folder
`--mockbowtie2index file`
	Mock community bowtie index - this option enables the mock analysis pipeline and skips OTU calling
`--pandaa`	enable the "-a" option in pandaseq to strip primers after assembly
`--pandamin integer`
	minimum fragment length (post-stitching)
`--pandamax integer`
	maximum fragment length (post-stitching)
`--pandathreshold real`
	Pandaseq threshold parameter, a number between 0 and 1. Default is 0.6
`--name string`	Name of the analysis, fastqfolder name is default name
`--help`	Print usage instructions and exit
`--verbose`	Print more information while running

Fastq file support: Overlapping and non-overlapping paired-end fastq files are supported, as well as single-end fastq files. Minimum read length of 70 is required. Fastq files must be named using the BMGC convention: sample_*_R1_*.fastq or sample_*_R1.fastq

Samplesheet

The samplesheet supplied to the program must be a valid Qiime mapping file containing these columns:

SampleID: Name of the sample
BarcodeSequence: This column should be blank
LinkerPrimerSequence: The forward (R1) 16s primer (blank for primerless protocols such as emp)
ReversePrimer: The reverse (R2) 16s primer (blank fr primerless protocols such as emp, omit this column for single-end read datasets)
fastqR1: Name of the R1 fastq file (just the name, not the full path)
fastqR1: Name of the R2 fastq file (just the name, not the full path, blank or omitted for single-end datasets)
Description: This must be the final column in the mapping file

If you don't supply the pipeline with a samplesheet using the --samplesheet option it will run the createsamplesheet.pl script for you and use the sampesheet it generates. You may wish to run createsamplesheet.pl on your own first and manually edit it to suit your needs.

Running the pipeline

It is recommended you run the pipeline interactively using the --subsample option to make sure the pipeline works correctly on a small sample of your data before submitting a job to process your entire dataset. This allows you to identify and solve problems quickly. A miseq run subsampled to 1000 reads per sample should complete within five minutes for simple (e.g. gut) samples, and withing an hour for complex (e.g. soil) samples.

Log in to MSI

Open a terminal window (OSX) or putty (Windows, www.putty.org)
Open an SSH connection to MSI (replace USERNAME with your MSI username):
```
$ ssh USERNAME@login.msi.umn.edu
```
Log on to the Mesabi supercomputer:
```
$ ssh mesabi
```

Lauch a Metagenomics analysis job

Note

Interactive and submitted jobs may start running immediatly, or if Mesabi is very busy a job may wait in line for several hours until resources are availble to run the job.

Interactive

Start an interactive job on Mesabi:

$ qsub -I -l walltime=8:00:00,nodes=1:ppn=24

Load necessary software modules:

$ module load gopher-pipelines

Run the script. You must specify the location of a folder containing fastq files to process using the "--fastqfolder" option:

$ metagenomics-pipeline --fastqfolder /path/to/fastq/folder

Submit Job

A PBS script can be submitted to a queue where it will run when resources are available. Create a pbs file named meta.pbs containing the following text:

#!/bin/bash -l
#PBS -l nodes=1:ppn=24,walltime=8:00:00
#PBS -m abe

cd $PBS_O_WORKDIR

module load gopher-pipelines

metagenomics-pipeline --fastqfolder /path/to/fastq/folder

Submit the pbs file to the job queue:

$ qsub meta.pbs

You can check the status of jobs by running qstat (replace USERNAME with your MSI username):

$ qstat -a -u USERNAME

Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
1023053.node1081.local  jgarbe      small    meta.pbs            --      1     24   50gb  12:00:00 Q       --

The S column indicates if a job is Running or Queued.

Review results

The results of the analysis are located in /panfs/roc/scratch/USERNAME-pipelines/metagenomics-RUNNAME/output. Download the entire output folder to your local computer and open up the html file to see a summary of the analysis results.

Support

You may contact John Garbe directly at jgarbe@umn.edu, response times may vary,

Wiki

gopher-pipelines / metagenomics-pipeline