Gopher-pipelines Documentation

Contents:

align-pipeline
- FastQC - (Trimmomatic) - BWA or Bowtie2 - Sort and index bam - (Remove duplicates)
rnaseq-pipeline
- FastQC - (Trimmomatic) - Hisat2 or Tophat2 - Subread featureCounts - Cuffquant - Cuffnorm
qiime2-pipeline
- Adapter trim - Read stitching - Chimera detection - Host contamination - Open reference OTUs - Alpha rarefaction - Beta diversity
shotgun-pipeline
- OTU table - Taxonomic classification - Alpha rarefaction - Beta diversity
createsamplesheet

Running a pipeline

Log in to MSI

Open a terminal window (OSX) or putty (Windows, www.putty.org)
Open an SSH connection to MSI (replace USERNAME with your MSI username):

$ ssh USERNAME@login.msi.umn.edu
Log on to the Mesabi supercomputer:

$ ssh mesabi

Input experimental metadata (optional)

Providing experimental metadata (information about each sample such as treatment, group, age, gender, individualID, collection date, etc) to the pipeline will result in a more informative output..

Load the gopher-pipelines module:

 $ module load umgc
 $ module load gopher-pipelines

Generate a samplesheet:

 $ createsamplesheet.pl -f /path/to/fastq/folder -o samplesheet.txt

Edit the tab-delimited samplesheet.txt with a text editor, add additional columns containing metadata about each sample.

When running the pipeline pass the samplesheet.txt to it using the --samplesheet option:

 --samplesheet samplesheet.txt

Select a reference genome (optional)

Gopher-pipelines comes with a selection of reference genomes and annotation from Ensembl which can be loaded with the "module load ensembl" command. Each species and genome build is available as a seperate module. Run "ensembl" to get a list of available reference genomes. Species are named Genus_species, and most have a common name alias. Each species has at least one genome build available. To use a reference genome with gopher-pipelines simply load the appropriate module before running a pipeline:

$ module load gopher-pipelines
$ module load ensembl
$ module load human
$ align-pipeline --fastqfolder /path/to/fastq/folder

These three module load commands are equivalent:

$ module load human
$ module load Homo_sapiens
$ module load Homo_sapiens/GRCh38

If you would like to specify a reference of your own, gopher-pipelines pipelines supports these options (not all pipelines require all options):

--referencefasta /path/to/reference/genome/fasta
--bwaindex /path/to/bwa/index
--bowtie2index /path/to/bowtie2/index
--hisat2index /path/to/hisat2/index
--gtffile /path/to/gtf/annotation/file

Lauch an analysis job

.. note:: Interactive and submitted jobs may start running immediatly, or if Mesabi is very busy a job may wait in line for several hours until resources are availble to run the job.

Interactive ...........

Start an interactive job on Mesabi:

 $ qsub -I -l walltime=8:00:00,nodes=1:ppn=24

Load necessary software modules:

 $ module load umgc
 $ module load human
 $ module load gopher-pipelines

Run the script. You must specify the location of a folder containing fastq files using the "-fastqfolder" option. Specify how many samples to process at a time using the "--samplespernode" option (recommended value for Mesabi: 8). Each pipeline may have additional parameters to specify, refer to pipeline-specific documentation for details. An align.pipeline command example is shown here:

$ align-pipeline --samplespernode 8 --fastqfolder /path/to/fastq/folder

Submit Job ..........

A PBS script can be submitted to a queue where it will run when resources are available. Create a pbs file named gpipes.pbs containing the following text (adjust the ppn value to request all cores on a node, the pipeline doesn't work well on partial nodes):

#!/bin/bash -l
#PBS -l nodes=1:ppn=24,walltime=24:00:00
#PBS -m abe

cd $PBS_O_WORKDIR

module load umgc
module load gopher-pipelines
module load ensembl
module load human

align-pipeline --samplespernode 8 --fastqfolder /path/to/fastq/folder

Submit the pbs file to the job queue::

   $ qsub gpipes.pbs

You can check the status of jobs by running qstat (replace USERNAME with your MSI username)::

$ qstat -a -u USERNAME

Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
1023053.node1081.local  jgarbe      batch    gpipes.pbs           --      1    24   50gb  12:00:00 Q       --

The S column indicates if a job is Running or Queued.

Review results

The results of the analysis are located in /panfs/roc/scratch/USERNAME-pipelines/align-RUNNAME/

Wiki

gopher-pipelines / Home