Wiki

Clone wiki

Fast-GBS / Home

Welcome

Welcome to the Fast-GBS wiki!

Glossary

This section should allow the user to better understand the terms used throughout the file preparation protocol.

FLOWCELL: Sequencing batch. Originally can contain up to 8 lanes

LANES: File containing multiplexed sequences. Several lanes can be associated to a flowcell.

TECHNOLOGY: Sequencing technology, either Illumina or IonTorrent

Using Fast-GBS

The main steps in using Fast-GBS are:

  • Go to or create the directory in which you wish to perform the analysis and store results

  • Run the script ./makeDir.sh that will create the following directories:
    refgenome
    data
    barcodes
    results

  • Prepare the reference genome (as detailed below) and place it in the refgenome directory

  • Prepare the barcode files (detailed below) and place them in the barcodes directory
  • Prepare the data (see details below) and place these in the data directory
  • Prepare the parameters file (see details below) and place it in the same directory in which you launched the ./makeDir.sh script
  • Run the script fastgbs.sh

Dependencies

In order to use Fast-GBS, you will need the following:

Quality control

The fastqc software is not included in the pipeline but it is strongly recommended to perform a quality check on your sequence files with it before starting fastgbs (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

Preparation of the reference genome

Move your reference genome file in the refgenome directory, move to that directory and index it with the command:
bwa index -a bwtsw refgenome.fasta

The .fai file isn't created as part of "bwa index." To create it, run the command:
samtools faidx refgenome.fasta

That will create a file named refgenome.fasta.fai.

Finally, write the reference genome file name in the parameter file:

REFGEN=refgenome.fasta

Preparation of the data

Move your sequence files in the data directory. Each file must be in the form:

FLOWCELL_LANE.fq.gz

Suppose that you want to analyse 6 sequence files over 3 flowcells (HADES, HOMER and ULYSSE) and each flowcell has 2 lanes:

HADES: lanes 1 and 2
HOMER: lanes 3 and 4
ULYSSE: lanes 5 and 6

You will have to name your files in that way:

HADES_1.fq.gz  
HADES_2.fq.gz  
HOMER_3.fq.gz  
HOMER_4.fq.gz  
ULYSSE_5.fq.gz  
ULYSSE_6.fq.gz  

Finally, write the flowcell and lane names at these places in the parameters file:

; FLOWCELL INFORMATION
FLOWCELL=ULYSSE HOMER HADES

; LANES INFORMATION
ULYSSE_LANES=5 6
HOMER_LANES=3 4
HADES_LANES=1 2

It is important to keep the extension .fq.gz because, for simplicity, it is hardcoded in the pipeline.

Preparation of the barcode files

You can use any spreadsheet software (openOffice, Excel) to format your barcodes files. If you do so, save the active sheet in text (.txt) format with tab separated columns. Then move your files to your Unix computer. It is possible that formatting (the carriage return) is not recognized by Unix system. To convert your file in ASCII, use the script txt2unix.sh available in this distribution:

./txt2unix.sh your_barcodes_file > new_barcodes_file

Continuing with the same example as above, you will have to create one barcode file per sequence file and name them as below:

barcodes_HADES_1
barcodes_HADES_2
barcodes_HOMER_3
barcodes_HOMER_4
barcodes_ULYSSE_5
barcodes_ULYSSE_6 
Do not put any extension, like .txt, to the filenames.

Look at the content of one of these file:

CTATTA      BC01
TAACGA      BC02
CGTGTGGTB   BC03

Format all your barcode files in the same manner, in a tab-separated column. Move these files in the barcodes directory.

Warning: Do not use special characters (such as: * / \ | $ " etc.) in the names given to your samples in the Barcode file. These may cause sabre to crash.

The script makeBarcodeSabre.py will be called by the pipeline fastgbs.sh and will create this file for sabre. The last column now contains the name of the file created by sabre during the demultiplexing phase:

CTATTA HADES_1_BC01.fq
TAACGA HADES_1_BC02.fq
CGTGTGGT HADES_1_BC03.fq

Preparation of the parameters file

Verify the information in the file parameters.txt. If necessary, change the value given to the variables.

**It is very important not to change the words in capital letters (before the sign =). If you do this, you will get an error message because the pipeline will not find that variable.

Respect what is an integer and a string.

Running fastgbs

To run the pipeline, just enter the command: ./fastgbs.sh parameters.txt

Supplementary information

Values of some variables in the pipeline

The parameter file allows the user to change the value of certain variables in the pipeline. However, the list of variables in the parameter file is far from exhaustive. It is therefore important to consider that several parameters of the software used in this pipeline, are hard-coded. We made this choice to make life easier for less experienced users. In line with this philosophy, when the user chooses the ILLUMINA technology in the parameter file, values are automatically given to the following variables in Platypus:

--genIndels=1
--minMapQual=20
--minBaseQual=20

When the IONTORRENT technology is chosen:
--genIndels=0
--minMapQual=10
--minBaseQual=10

These platypus options are hardcoded in the pipeline:
--maxHaplotypes: 70
--originalMaxHaplotypes: 70
--minFlank: 3
--maxVariants: 10
--badReadsWindow: 5
--largeWindows: 1
--longHaps: 1
--filterReadPairsWithSmallInserts: 0
--minVarFreq: 0.002
--assembleBrokenPairs: 1

However, advanced users can easily edit the fastgbs.sh code and insert the values of their choice.

Platypus

A common problem that can occur with platypus is where the system says it can not find a file (BAM) in the sample list to use. This problem is caused by the variable that defines the number of files that can be opened simultaneously. To get the current value used by the system, run the command ulimit -a. This value can be of 1024.
To avoid this problem, we must consider the number of CPUs you give to platypus and the number of samples (bam file) you want to analyze. The number of files that must be opened simultaneously be: CPU X Nb Nb samples.
The number of CPUs determine the number of genomic regions analyzed simultaneously. For each analyzed genomic region, all bam files must be opened by the program. For example, if you have 180 samples and you give 10 CPUs, this is 1800 files to be opened simultaneously. If the limit of your OS is 1024, you will get an error message. In this case you can give 1024/180 = 5.6, so 5 CPUs. If you can raise the limit 2048 (ulimit -n 2048) you can give 2048/180 = 11.4, 11 CPUs.
To change the limit, enter this command:
ulimit -S -n 5120

The Fast-GBS steps

Initially, the software check:
1. if all variables have a value in the parameter file
2. if a checkpoint file exists.

Then, it proceeds with the 9 main steps:
1. Production of specific barcodes files with the script makeBarcodeSabre.py
2. Demultiplex samples with the software sabre
3. Removing adaptor with cutadapt
4. Alignment of reads with BWA-MEM
5. Convert sam files into bam files with samtools
6. Sort bam files with samtools
7. Index bam files with samtools
8. Production of the file containing the list of bam files to be process by platypus
9. Search the variants with platypus

Checkpoint file

After each step that completes correctly (exit code 0), the name of the step is written in the checkpoint file. The name of this file is the name of your parameters file added to: checkpoint_
You can see an full example in the file checkpoint_var.txt:

BARCODES
SABRE
CUTADAP
ALIGN
SAM2BAM
SORTBAM
INDEXBAM
BAMLIST
PLATYPUS

Once a step name is written in the checkpoint file, it means that if you want to run the pipeline again, this step will be pass. In the same way, if you want to avoid some steps, for example you want to do the first steps and check for the results before continuing, you can write the step name you don't want to do in the checkpoint file.

Check your files!

Before running BWA (aligning your reads against a reference genome), we strongly recommend you to count the number of reads in each sample file. Depending on the genome size, the enzyme(s), the sequencing technology and the number of barcodes used, you should have an basic idea on know how much reads you should expect for each sample. If the yield is low and it is not what you may expect, you can ask the sequencing platform to make a new run.
When you have all your samples demultiplexed, you can run this script to count the number of reads:

#!/bin/bash
cd data
            for i in $(ls *.fq)
            do
                echo $i >> stat_fq_temp.txt
                cat $i | echo $(($(wc -l)/4)) >> stat_fq_temp.txt
            done
perl -pe 's/fq\n/fq\t/' stat_fq_temp.txt > ../control_quality/stat_fq.txt

rm *_temp.txt
It's up to you to determine a threshold for the number of reads accepted. It depends on the length distribution among all your samples. In an nearly ideal situation, you will see the number of reads dropping drastically for a few samples. As a rule of thumb, you can remove samples containing less than 1 or 2% of the number of reads for the sample that contains the maximum.

To count the number of reads that map or unmap on your reference genome, you can run this scripts:

#!/bin/bash
cd data
            for i in $(ls *.sort.bam)
            do
                echo $i >> stat_mapped_temp.txt
                echo $i >> stat_unmapped_temp.txt
                map=$(samtools view -F4 -c $i)
                unmap=$(samtools view -f4 -c $i)
                echo $map >> stat_mapped_temp.txt
                echo $unmap >> stat_unmapped_temp.txt
            done
perl -pe 's/bam\n/bam\t/' stat_mapped_temp.txt > ../control_quality/stat_mapped.txt
perl -pe 's/bam\n/bam\t/' stat_unmapped_temp.txt > ../control_quality/stat_unmapped.txt

rm *_temp.txt

Save your disk space by removing temporary files

During the process, several temporary files are created. If you have a lot of samples, these files can take up a lot of disk space. When you are ready to do the platypus step, you can delete these files. They have the following extensions:

.fq
.unknown.barcodes
.fastq
.sam
.temp.bam
You can also keep the .fq files to avoid running the demultiplexing step again.

Working on your vcf file

When the platyplus step step is completed, you may want to filter out some loci. Here is an example of vcftools command we used before using Beagle to impute the missing data:

vcftools --vcf your_platypus_file.vcf --remove-filtered-all --max-missing 0.2 --remove-indels --mac 1 --min-alleles 2 --max-alleles 2 --recode --out your_platypus_file

See the vcftools documentation page to learn more about the options of this command.

Also, you may want interested to produce a text file that you can import in your prefered spreadsheat creator software:

cd results
grep "#CHROM" your_platypus_file.vcf > vcf_header
../vcf2txt.py your_platypus_file.vcf

Known bugs or unwant behavior of some softwares

The cutadapt software may end with code 1 because it produces multiple warnings of this kind:

WARNING:
    One or more of your adapter sequences may be incomplete.
    Please see the detailed output above.
This is not an error, just a warning but the exit code is 1 which is usually used in case of real error. But, Fast-GBS catch the exit code to see if it should end (code 1) or continue (code 0). In this case, the exit code 1 cause Fast-GBS to produce this message and quit :
!!! There is a problem in the cutadapt step
Just check in the log file to see what happen during the cutadapt step. If all the samples were processed, you can be confident that there is no errors because if a major error happen, the software stop. Habitually, this will happen for a specific sample. If you conclude that there is no problem, you can add manually the name CUTADAP in your checkpoint file and relaunch the pipeline.

Citing

If you use Fast-GBS in your research, please cite:

Torkamaneh, D., Laroche, J., Bastien, M., Abed, A., & Belzile, F. (2017). Fast-GBS: a new pipeline for the efficient and highly accurate calling of SNPs from genotyping-by-sequencing data. BMC bioinformatics, 18(1), 5.
doi:10.1186/s12859-016-1431-9

(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5210301/)

References

Torkamaneh D, Laroche J, Belzile F (2016) Genome-Wide SNP Calling from Genotyping by Sequencing (GBS) Data: A Comparison of Seven Pipelines and Two Sequencing Technologies. PLoS ONE 11(8): e0161333. https://doi.org/10.1371/journal.pone.0161333

Torkamaneh D, Belzile F (2015) Scanning and Filling: Ultra-Dense SNP Genotyping Combining Genotyping-By-Sequencing, SNP Array and Whole-Genome Resequencing Data. PLoS ONE 10(7): e0131533. https://doi.org/10.1371/journal.pone.0131533

O. Tange (2011): GNU Parallel - The Command-Line Power Tool: The USENIX Magazine, February 2011:42-47.

License

Fast-GBS is licensed under the GNU General Public Licence version 3 (GPL3). See the LICENCE file for more details.

Updated