Wiki

Clone wiki

gopher-pipelines / createsamplesheet

Create Sample Sheet

The createsamplesheet.pl script automatically creates a sample sheet file for a folder of fastq files.

The Script

The script performs the following steps:

  1. Identifies fastq files in the fastq folder
  2. Determines if the files are uncompressed or gzip compressed
  3. Determines if the folder is paired-end or single-end
  4. For each sample, the sample name is parsed from the fastq filename
  5. Primers are read in from the primer file, bases other than AGCTagct are converted to N (in qiime mode)
  6. Cutadapt is used to determine how many times each primer is seen in the first 4000 reads of each file (can be configured using -s option) (in qiime mode)
  7. A samplesheet is created using the most frequent primer for each sample (in qiime mode)

Sample names are not modified to be compatible with Qiime.

Input

Options for createsamplesheet.pl

-f folder A folder containing fastq files to process
-o file Name of the output samplesheet file
-z fastq files are gzip compressed (filenames end with fastq.gz)
-h Print usage instructions and exit
-v Print more information while running (verbose)

Fastq file support: Folders with either Paired-end or single-end fastq files are supported. Compressed (.fastq.gz") or uncompressed files are supported, but not a mix. Fastq files must have Illumina formatted names, or formatted as: sample_*_R1_*.fastq or sample_*_R1.fastq. Otherwise this script cannot determine the sample name of each file, or determine which files are R1 reads and which are R2. In that case a samplesheet must be created by hand, or the files renamed to a parseable format.

Output

Sample sheet file
Named "samplesheet.txt" by default

Columns in the sample sheet file

#sample
The first column in the file, contains the sample name, as parsed from the fastq file name, with forbidden characters converted to "."
fastqR1
Name of the R1 fastq file
fastqR2
Name of the R2 fastq file, only present for paired-end datasets
BarcodeSequence
Qiime-specific empty column,
LinkerPrimerSequence
Qiime-specific column, contains R1 primer sequence
ReversePrimer
Qiime-specific column, contains R2 primer sequence, only present for paired-end datasets
Group
The first two characters of the sample name (often sufficient to split the samples into apropriate experimental groups)
Description
Contains the sample name before forbidden characters were removed

Running the program

Load necessary software modules:

$ module load riss_util

Run the script. You must specify the location of a folder containing fastq files to process using the "-f" option:

$ createsamplesheet.pl -f /path/to/fastq/folder

Support

If you are having issues, please contact John Garbe at jgarbe@umn.edu

Updated