Welcome to the Fast-GBS wiki!
This section should allow the user to better understand the terms used throughout the file preparation protocol.
FLOWCELL: Sequencing batch. Originally can contain up to 8 lanes
LANES: File containing multiplexed sequences. Several lanes can be associated to a flowcell.
TECHNOLOGY: Sequencing technology, either Illumina or IonTorrent
The main steps in using Fast-GBS are:
Go to or create the directory in which you wish to perform the analysis and store results
Run the script
./makeDir.shthat will create the following directories:
Prepare the reference genome (as detailed below) and place it in the
- Prepare the barcode files (detailed below) and place them in the
- Prepare the data (see details below) and place these in the
- Prepare the parameters file (see details below) and place it in the same directory in
which you launched the
- Run the script
In order to use Fast-GBS, you will need the following:
- Linux with parallel installed (http://www.gnu.org/software/parallel/)
- Python 2.7 or higher (https://www.python.org/)
- sabre (https://github.com/najoshi/sabre)
- cutadapt (https://github.com/marcelm/cutadapt/)
- bwa (https://github.com/lh3/bwa)
- samtools (http://www.htslib.org/)
- platypus (http://www.well.ox.ac.uk/platypus)
- vcf python module (https://github.com/jamescasbon/PyVCF)
- fastgbs.sh (this distribution)
- parameters.txt (this distribution)
- makeDir.sh (this distribution)
- makeBarcodeSabre.py (this distribution)
- vcf2txt.py (this distribution)
- txt2unix.sh (this distribution)
The fastqc software is not included in the pipeline but it is strongly recommended to perform a quality check on your sequence files with it before starting fastgbs (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
Preparation of the reference genome
Move your reference genome file in the
refgenome directory, move to that directory and
index it with the command:
bwa index -a bwtsw refgenome.fasta
The .fai file isn't created as part of "bwa index." To create it, run the command:
samtools faidx refgenome.fasta
That will create a file named refgenome.fasta.fai.
Finally, write the reference genome file name in the parameter file:
Preparation of the data
Move your sequence files in the
data directory. Each file must be in the form:
Suppose that you want to analyse 6 sequence files over 3 flowcells (HADES, HOMER and ULYSSE) and each flowcell has 2 lanes:
HADES: lanes 1 and 2
HOMER: lanes 3 and 4
ULYSSE: lanes 5 and 6
You will have to name your files in that way:
HADES_1.fq.gz HADES_2.fq.gz HOMER_3.fq.gz HOMER_4.fq.gz ULYSSE_5.fq.gz ULYSSE_6.fq.gz
Finally, write the flowcell and lane names at these places in the parameters file:
; FLOWCELL INFORMATION FLOWCELL=ULYSSE HOMER HADES ; LANES INFORMATION ULYSSE_LANES=5 6 HOMER_LANES=3 4 HADES_LANES=1 2
It is important to keep the extension .fq.gz because, for simplicity, it is hardcoded in the pipeline.
Preparation of the barcode files
You can use any spreadsheet software (openOffice, Excel) to format your barcodes files. If you do so, save the active sheet in text (.txt) format with tab separated columns. Then move your files to your Unix computer. It is possible that formatting (the carriage return) is not recognized by Unix system. To convert your file in ASCII, use the script txt2unix.sh available in this distribution:
./txt2unix.sh your_barcodes_file > new_barcodes_file
Continuing with the same example as above, you will have to create one barcode file per sequence file and name them as below:
barcodes_HADES_1 barcodes_HADES_2 barcodes_HOMER_3 barcodes_HOMER_4 barcodes_ULYSSE_5 barcodes_ULYSSE_6
Do not put any extension, like .txt, to the filenames.
Look at the content of one of these file:
CTATTA BC01 TAACGA BC02 CGTGTGGTB BC03
Format all your barcode files in the same manner, in a tab-separated column. Move these
files in the
Warning: Do not use special characters (such as: * / \ | $ " etc.) in the names given to your samples in the Barcode file. These may cause sabre to crash.
makeBarcodeSabre.py will be called by the pipeline fastgbs.sh and will create
this file for
sabre. The last column now contains the name of the file created by sabre
during the demultiplexing phase:
CTATTA HADES_1_BC01.fq TAACGA HADES_1_BC02.fq CGTGTGGT HADES_1_BC03.fq
Preparation of the parameters file
Verify the information in the file
If necessary, change the value given to the variables.
**It is very important not to change the words in capital letters (before the sign =). If you do this, you will get an error message because the pipeline will not find that variable.
Respect what is an integer and a string.
To run the pipeline, just enter the command:
Values of some variables in the pipeline
The parameter file allows the user to change the value of certain variables in the pipeline. However, the list of variables in the parameter file is far from exhaustive. It is therefore important to consider that several parameters of the software used in this pipeline, are hard-coded. We made this choice to make life easier for less experienced users. In line with this philosophy, when the user chooses the ILLUMINA technology in the parameter file, values are automatically given to the following variables in Platypus:
When the IONTORRENT technology is chosen:
These platypus options are hardcoded in the pipeline:
--maxHaplotypes: 70 --originalMaxHaplotypes: 70 --minFlank: 3 --maxVariants: 10 --badReadsWindow: 5 --largeWindows: 1 --longHaps: 1 --filterReadPairsWithSmallInserts: 0 --minVarFreq: 0.002 --assembleBrokenPairs: 1
However, advanced users can easily edit the fastgbs.sh code and insert the values of their choice.
A common problem that can occur with platypus is where the system says it can not find a
file (BAM) in the sample list to use. This problem is caused by the variable that defines
the number of files that can be opened simultaneously. To get the current value used by
the system, run the command
ulimit -a. This value can be of 1024.
To avoid this problem, we must consider the number of CPUs you give to platypus and the number of samples (bam file) you want to analyze. The number of files that must be opened simultaneously be: CPU X Nb Nb samples.
The number of CPUs determine the number of genomic regions analyzed simultaneously. For each analyzed genomic region, all bam files must be opened by the program. For example, if you have 180 samples and you give 10 CPUs, this is 1800 files to be opened simultaneously. If the limit of your OS is 1024, you will get an error message. In this case you can give 1024/180 = 5.6, so 5 CPUs. If you can raise the limit 2048 (ulimit -n 2048) you can give 2048/180 = 11.4, 11 CPUs.
To change the limit, enter this command:
ulimit -S -n 5120
The Fast-GBS steps
Initially, the software check:
1. if all variables have a value in the parameter file
2. if a checkpoint file exists.
Then, it proceeds with the 9 main steps:
1. Production of specific barcodes files with the script makeBarcodeSabre.py
2. Demultiplex samples with the software sabre
3. Removing adaptor with cutadapt
4. Alignment of reads with BWA-MEM
5. Convert sam files into bam files with samtools
6. Sort bam files with samtools
7. Index bam files with samtools
8. Production of the file containing the list of bam files to be process by platypus
9. Search the variants with platypus
After each step that completes correctly (exit code 0), the name of the step is written in the checkpoint file.
The name of this file is the name of your parameters file added to: checkpoint_
You can see an full example in the file checkpoint_var.txt:
BARCODES SABRE CUTADAP ALIGN SAM2BAM SORTBAM INDEXBAM BAMLIST PLATYPUS
Once a step name is written in the checkpoint file, it means that if you want to run the pipeline again, this step will be pass. In the same way, if you want to avoid some steps, for example you want to do the first steps and check for the results before continuing, you can write the step name you don't want to do in the checkpoint file.
Check your files!
Before running BWA (aligning your reads against a reference genome), we strongly recommend you to count the number of reads
in each sample file. Depending on the genome size, the enzyme(s), the sequencing technology and the number of barcodes used, you should
have an basic idea on know how much reads you should expect for each sample. If the yield is low and it is not what you may expect,
you can ask the sequencing platform to make a new run.
When you have all your samples demultiplexed, you can run this script to count the number of reads:
#!/bin/bash cd data for i in $(ls *.fq) do echo $i >> stat_fq_temp.txt cat $i | echo $(($(wc -l)/4)) >> stat_fq_temp.txt done perl -pe 's/fq\n/fq\t/' stat_fq_temp.txt > ../control_quality/stat_fq.txt rm *_temp.txt
It's up to you to determine a threshold for the number of reads accepted. It depends on the length distribution among all your samples. In an nearly ideal situation, you will see the number of reads dropping drastically for a few samples. As a rule of thumb, you can remove samples containing less than 1 or 2% of the number of reads for the sample that contains the maximum.
To count the number of reads that map or unmap on your reference genome, you can run this scripts:
#!/bin/bash cd data for i in $(ls *.sort.bam) do echo $i >> stat_mapped_temp.txt echo $i >> stat_unmapped_temp.txt map=$(samtools view -F4 -c $i) unmap=$(samtools view -f4 -c $i) echo $map >> stat_mapped_temp.txt echo $unmap >> stat_unmapped_temp.txt done perl -pe 's/bam\n/bam\t/' stat_mapped_temp.txt > ../control_quality/stat_mapped.txt perl -pe 's/bam\n/bam\t/' stat_unmapped_temp.txt > ../control_quality/stat_unmapped.txt rm *_temp.txt
Save your disk space by removing temporary files
During the process, several temporary files are created. If you have a lot of samples, these files can take up a lot of disk space. When you are ready to do the platypus step, you can delete these files. They have the following extensions:
.fq .unknown.barcodes .fastq .sam .temp.bam
You can also keep the .fq files to avoid running the demultiplexing step again.
Working on your vcf file
When the platyplus step step is completed, you may want to filter out some loci. Here is an example of vcftools command we used before using Beagle to impute the missing data:
vcftools --vcf your_platypus_file.vcf --remove-filtered-all --max-missing 0.2 --remove-indels --mac 1 --min-alleles 2 --max-alleles 2 --recode --out your_platypus_file
See the vcftools documentation page to learn more about the options of this command.
Also, you may want interested to produce a text file that you can import in your prefered spreadsheat creator software:
cd results grep "#CHROM" your_platypus_file.vcf > vcf_header ../vcf2txt.py your_platypus_file.vcf
Known bugs or unwant behavior of some softwares
The cutadapt software may end with code 1 because it produces multiple warnings of this kind:
WARNING: One or more of your adapter sequences may be incomplete. Please see the detailed output above.
This is not an error, just a warning but the exit code is 1 which is usually used in case of real error. But, Fast-GBS catch the exit code to see if it should end (code 1) or continue (code 0). In this case, the exit code 1 cause Fast-GBS to produce this message and quit :
!!! There is a problem in the cutadapt step
Just check in the log file to see what happen during the cutadapt step. If all the samples were processed, you can be confident that there is no errors because if a major error happen, the software stop. Habitually, this will happen for a specific sample. If you conclude that there is no problem, you can add manually the name CUTADAP in your checkpoint file and relaunch the pipeline.
If you use Fast-GBS in your research, please cite:
Torkamaneh D, Laroche J, Bastien M, Abed A, Belzile F. Fast-GBS: a new pipeline for the efficient and highly accurate calling of SNPs from genotyping-by-sequencing data. BMC Bioinformatics. 2017 Jan 3;18(1):5
O. Tange (2011): GNU Parallel - The Command-Line Power Tool: The USENIX Magazine, February 2011:42-47.
Fast-GBS is licensed under the GNU General Public Licence version 3 (GPL3). See the LICENCE file for more details.