Overview

HTTPS SSH
#Introduction

TIGRA is a computer program that performs targeted local assembly of structural variant (SV) breakpoints from next generation sequencing short-read data. It takes as input a list of putative SV calls and a set of bam files that contain reads mapped to a reference genome such as NCBI build36.  For each SV call, it assembles the set of reads that were mapped or partially mapped to the region of interest (ROI) in the corresponding bam files. Instead of outputing a single consensus sequence, tigra attempts to construct all the alternative alleles in the ROI as long as they received sufficient sequence coverage (usually >= 2x).  It also utilizes the variant type information in the input files to select reads for assembly.  Tigra_sv is effective at improving the SV prediction accuracy and resolution in short reads analysis and can produce accurate breakpoint sequences that are useful to understand the origin, mechanism and pathology underlying the SVs.

TIGRA was initially developed at the Genome Institute of Washington University in St. Louis and was further developed at the University of Texas MD Anderson Cancer Center.

This is a beta-release version 0.4.3. (Compatible with samtools 1.3.1) 

#Install

	Download and compile samtools (version 1.3.1 and above) (http://samtools.sourceforge.net/) on your system 
	Modify the Makefile to point to the samtools folder on your system. Type "make" and enter
    Add $PATH/htslib-1.3.1 to your $LD_LIBRARY_PATH, in which $PATH is where you install samtools 1.3.1.

#Usage:

        Tigra_sv version-0.4.3
        Arguments: 

       ./tigra_sv [options] <SV file> [<a.bam> <b.bam> ...]


        Options: 
        -l INT  Assembly [500] bp from the SV breakpoints
        -a INT  Assembly [50] bp into the SV breakpoints
        -k STR  Comma separated list of kmers [15,25]
        -c STR  Only assemble calls on chromosome [STR]
        -o FILE Save assembly contigs to [FILE]
        -s INT  Only output contigs longer than [50] bp
        -R FILE Path to the wildtype reference fasta
        -r FILE Create pair-wise local reference sequence fastas in [FILE]
        -w INT  Pad local reference by additional [200] bp on both ends
        -q INT  Only assemble reads with mapping quality > [1]
        -N INT  Highlight segments supported by SVReads that differ from reference by at least [5] mismatches
        -p INT  Ignore cases that have average read depth greater than [10000]
        -d      Dump reads by case into fasta files
        -I STR  Save reads fasta into an existing directory
        -b      The input file is in breakdancer format
        -f      Provide a text file containing rows of sample:bam mapping
        -M INT  Skip SVs shorter than [3] bp
        -h INT  Skip complex contig graphs with more than [100] nodes
        -m      Add mate for assembly, speed might be twice slower when this option is on.
        -S STR  Spec_file:Read_file from the last run with -d turned on. Facilitates quick debug without extracting reads from bam. Spec_file is in the format of stderr.

#Input:

	The minimally required input is a SV file.
	As shown in the usage, a group of bam files can be specified in the commandline, or using the -f option.

	TIGRA currently recognizes two types of input:

	1. The 1000 Genomes format
	   The SV calls must be recorded in a tab-delimited format with the following columns: 
		CHR     
		START_OUTER     
		START_INNER     
		END_INNER       
		END_OUTER       
		TYPE_OF_EVENT   
		SIZE_PREDICTION 
		MAPPING_ALGORITHM       
		SEQUENCING_TECHNOLOGY   
		SAMPLEs 
		TYPE_OF_COMPUTATIONAL_APPROACH  
		GROUP   
		OPTIONAL_ID

	   It is critical to have accurate information in CHR,START_INNER,END_INNER,TYPE_OF_EVENT, SIZE_PREDICTION, and SAMPLES.
	   SAMPLEs should be the sample names separated by comma. 
	   For example:
	   1       829757  829757  829865  829865  DEL       116     MAQ     SLX     NA19238,NA19240    RP      WashU
   
	   To let the program know the association between location of the bam files and the sample names, use the -f option to specify a bam_list_file in the following key:value pairs:
	 	sample_name:bam_file_location 
	   with no space in between. 
	   For example:
	   	NA19238:1000genomes/ftp/data/NA19238/alignment/NA19238.chrom1.SLX.SRP000032.2009_07.bam
	   	NA19240:1000genomes/ftp/data/NA19238/alignment/NA19240.chrom1.SLX.SRP000032.2009_07.bam
	   Each row can only declare only one sample.
	   Only the samples and the bams that contain the SV will be assembled.

	2. BreakDancer format
	   Please use option -b to declare the BreakDancer format.
	   You can use either the long format, e.g.,

           10      89690279        +       10      89702321        +       DEL     12042   99      16      example1|16     0.01    2.37
           10      85512695        +       10      85513886        +       DEL     1191    99      18      example1|11:example2|7  0.02    0.35

	   or the short format, e.g.,

           10      89690279        +       10      89702321        +       DEL     12042
           10      85512695        +       10      85513886        +       DEL     1191

	   In the long format, column 11 must contain a list of samples and number of SV supporting reads in each sample, separated by ":" and by "|". The numbers of supporting reads can just be placeholders when they are not available, but the sample names (example1, example2) must be meaningful and match exactly with the sample names in the bam list file
	   e.g.,
		example1:example1.bam
		example2:example2.bam
	   Column 11 is used in conjunction with the bam list to selectively assemble the subset of bams that may contain the predicted SV. For example, the first deletion above will be assembled using reads from example1 and the second using reads from both example1 and example2. 
	   When the short format is used, all the bams specified in the bamlist and by the command line argument will be used. The set of bams specified in the bamlist have higher precedence than those specified through the commandline arguments.

	
#SV types

	TIGRA can currently assemble the following type of SVs:
	
	DEL: deletion;
	INS: insertion;
	ITX: tandem duplication;
	CTX: transchromosomal translocation.
	
	Notice that in the BreakDancer file, the SVs are already recorded in these 3-letter abbreviations. For the 1000 Genomes format, please ensure that the TYPE columns use the same terminologies.


#Additional comments on the usages:
-S
                If this option is turned on, bam files will not be used for read extraction (current version will still take in a bam file list, which does not physically read each bam). It is mainly used for debug with quick read retrieval. However, users with reads in a fasta file can use this option as well for SV breakpoint assembly. The following explains the usage for these two purposes. 1. For debug, in the last run, turn on -d option, and pipe the error into a file, supposedly named spec.csv. In this run, feed extra option "-S spec.csv:${prefix}.fa", in which ${prefix} is composed of the SV coordinates, in the form of "$chr1:$start:$chr2:$end:$type:$qual:$ori.fa". There might be four kinds of $ori, ++, --, +-, and --, depending on the SV types. 2. For users who are interested in feeding tigra with fasta reads directly, a spec.csv file needs to be made manually. In a text editor, add a line with the following information concatenated by tab:#Reads:${num_reads_in_your_fasta}, #SVReads:${estimated_num_SV_reads_in_your_fasta}, RegionSize:${SV_region_size}, and AvgCoverage:${#Reads/RegionSize}. The following is an example of this line. #Reads:57359    #SVReads:25117  RegionSize:1174 AvgCoverage:4206.42. Save this file with the name spec.csv, and the rest steps are stated in 1, with ${prefix} your fasta file name. 

-R

		If you would like to see if part of the contigs are novel relative to the reference, i.e., supported by unmapped or poorly mapped reads, please provide the program with the samtools faidxed reference file with -R option followed by the full path. The novel part of the contigs will be in CAPITAL letters, while the parts identical to the reference will be in lower case.  This feature facilitates consistency analysis with split-reads type of algorithm (such as Pindel) that directly examines unmapped or poorly mapped reads. It could also help genotyping algorithms observing reads spanning the breakpoints.

-N

		Use in conjunction with -R to define the set of poorly mapped reads

-c

		If you would like to parallelize the jobs by chromosome, please use option -c followed by the chromosome id, so that the program will skip the other chromosomes for this job.  Please make sure that the bams in bam_list_file contains the chromosome of interest.

-r
		
                Use in conjunction with -R. This is useful when you want to obtain a fasta file that contains a matched set of local wild-type sequences for breakpoint annotation. 
        

#Example commands:

	1. Assemble SVs using example1.bam
        tigra -b -R NCBI36.example.fa -o output1.fa example.breakdancer.sv example1.bam 2> output1.log

	2. Assemble SVs from a list of bam files
        tigra -b -R NCBI36.example.fa -o output2.fa -f example.bamlist example.breakdancer.sv  2> output2.log

#Version control:
	0.4.2: Applicable to chromosome names with dot (contig name now has have ^ instead of . to avoid chromosome region parse error); boundary protection added for -m option. Tested on TCGA data.
        0.4.1: Fix the bug of node number being zero and thus the denominator being zero. Tested on 1000 Genomes Projects data. Added option -S for future debug without bam files.
        0.4.0: Compatible with samtools-0.1.19.
        0.3.9: Cluster mate positions before retrieving mates, speeding up the program. 
        0.3.8: Add -m for mate extraction control.
        0.3.7: Stable version without mate extraction. Compatible with samtools-0.1.6.

#Contact
        Ken Chen (kchen3@mdanderson.org)

#Acknowledgement
        MD Anderson Cancer Center
        Washington University in St. Louis
        1000 Genomes project