Clone wiki

Tassel 5 Source / Tassel5GBSv2Pipeline

Tassel 5 GBS v2 Pipeline

Introduction

This document describes the GBSv2 pipeline available in TASSEL 5 for species with a reference genome.

GBSv2 Discovery/Production Pipeline Overview

The flow chart below shows the code/data flow for the new pipeline.

Tassel5GBSv2.png

Plugin Command Details

The GBSv2 analysis pipeline is an extension of the Java program TASSEL. For details on executing TASSEL-5 pipeline commands, please see TASSEL 5.0 Pipeline Command Line Interface.

The new pipeline stores data to an embedded SQLite database. All steps of the pipeline either read from or write to this database. It is initially created in the GBSSeqToTagDBPlugin step. When running this pipeline, each subsequent step utilizes/adds data from the database created in the first step. A diagram of the database is presented below. The tables are color-coded to match the pipeline step which populates the data for each table.

GBSv2DataBaseOverview.png

Encoding-Decoding SQLite taxa distribution blob

The plugins for the Discovery and Production pipelines are defined in detail in the Discovery PipeLine Overview and Production Pipeline Overview sections below.

Discovery Pipeline Overview

GBSSeqToTagDBPlugin() is the first step in the pipeline. It identifies tags from fastQ input files then stores these tags and the taxa in which they appear into a local SQLite database.

GBSv2 Key File

GBSSeqToTagDBPlugin

TagExportToFastqPlugin() is executed to pull distinct tags from the database and export them in the fastq format so that they can be aligned to the reference genome with various aligners (e.g. BWA or Bowtie).

TagExportToFastqPlugin

A biological sequence aligner must now run to align tags against a reference genome. GBSv2 supports the BWA and Bowtie2 alignment programs.

Run Alignment Program(s)

The .sam file created from the aligner program (either Bowtie or BWA) is run through SAMToGBSdbPlugin(). This plugin stores the position information for each aligned tag.

SAMToGBSdbPlugin

DiscoverySNPCallerPlugin() is then used to identify SNPs from aligned tags using the GBS DB.

DiscoverySNPCallerPluginV2

SNPQualityProfilerPlugin() is called to score all discovered SNPs for various coverage, depth and genotypic statistics for a given set of taxa (samples). This plugin takes a GBS DB and taxa annotation file in tab delimited format and outputs data to the database. If no taxa are specified, the plugin will run quality metrics on all the taxa stored in the database. If an output file is specified, the plugin will create a csv (comma separated file) containing the quality information stored for each position. This data may be used to create a quality score to be stored with the positions.

SNPQualityProfilerPlugin

UpdateSNPPositionQualityPlugin may be run to store a quality score for SNP positions identified in the database. Most commonly, the user would evaluate the stat file produced by the SNPQualityProfilerPlugin to create a quality score for desired positions. UpdateSNPPositionQualityPlugin takes as input parameters the name of a database file and the name of a quality score file.

UpdateSNPPositionQualityPlugin

SNPCutPosTagVerificationPlugin was created to aid debugging SNPs/tags stored in the database. When coverage of a particular position is in question, this plugin can be run to verify which tags have mapped to a specified position, identify the taxa in which a particular SNP appears, and identify the alleles called for a specified SNP position.

SNPCutPosTagVerificationPlugin

Tag data is stored in the GBSv2 SQLite database in "blob" format. GetTagSequenceFromDBPlugin was created to provide a means for the user to grab and print the tag sequences from the database. GetTagTaxaDistFromDBPlugin was created later to print both tags and their taxa distribution to a tab-delimited file.

GetTagSequenceFromDBPlugin

GetTagTaxaDistFromDBPlugin

Production Pipeline Overview

Once a large-scale, species-wide Discovery Pipeline has been run it is possible to use the data on variants now stored in the database to quickly call known SNPs in newly sequenced samples. This is the Production Pipeline. The GBSv2 Production Pipeline has one plugin which is described below.

Production SNP Caller

Updated