Create Sample Sheet

The createsamplesheet.pl script automatically creates a sample sheet file for a folder of fastq files.

The Script

The script performs the following steps:

Identifies fastq files in the fastq folder
Determines if the files are uncompressed or gzip compressed
Determines if the folder is paired-end or single-end
For each sample, the sample name is parsed from the fastq filename
Primers are read in from the primer file, bases other than AGCTagct are converted to N (in qiime mode)
Cutadapt is used to determine how many times each primer is seen in the first 4000 reads of each file (can be configured using -s option) (in qiime mode)
A samplesheet is created using the most frequent primer for each sample (in qiime mode)

Sample names are not modified to be compatible with Qiime.

Input

Options for createsamplesheet.pl

`-f folder`	A folder containing fastq files to process
`-o file`	Name of the output samplesheet file
`-z`	fastq files are gzip compressed (filenames end with fastq.gz)
`-h`	Print usage instructions and exit
`-v`	Print more information while running (verbose)

Fastq file support: Folders with either Paired-end or single-end fastq files are supported. Compressed (.fastq.gz") or uncompressed files are supported, but not a mix. Fastq files must have Illumina formatted names, or formatted as: sample_*_R1_*.fastq or sample_*_R1.fastq. Otherwise this script cannot determine the sample name of each file, or determine which files are R1 reads and which are R2. In that case a samplesheet must be created by hand, or the files renamed to a parseable format.

Output

Sample sheet file: Named "samplesheet.txt" by default

Columns in the sample sheet file

#sample: The first column in the file, contains the sample name, as parsed from the fastq file name, with forbidden characters converted to "."
fastqR1: Name of the R1 fastq file
fastqR2: Name of the R2 fastq file, only present for paired-end datasets
BarcodeSequence: Qiime-specific empty column,
LinkerPrimerSequence: Qiime-specific column, contains R1 primer sequence
ReversePrimer: Qiime-specific column, contains R2 primer sequence, only present for paired-end datasets
Group: The first two characters of the sample name (often sufficient to split the samples into apropriate experimental groups)
Description: Contains the sample name before forbidden characters were removed

Running the program

Load necessary software modules:

$ module load riss_util

Run the script. You must specify the location of a folder containing fastq files to process using the "-f" option:

$ createsamplesheet.pl -f /path/to/fastq/folder

Support

If you are having issues, please contact John Garbe at jgarbe@umn.edu

Wiki

gopher-pipelines / createsamplesheet