ReDTandem : Region duplicate in Tandem
This repository holds the code source of the software ReDTandem.
The ReD Tandem application is made available here as Perl wrapper script together with source files. It is assumed that the following dependencies will be satisfied:
- C and C++ compilers (gcc/g++)
- NCBI Blast with 2 executables available in the binaries path:
The executables are installed in the directory.
The "ReDtandem.pl" script executes the ReD Tandem pipeline. To execute it, just type:
The following arguments are expected:
--dnafile -> the DNA file --species -> the name of the species (as in the DNA file, see below) --maxchaindist -> maximum distance between chains : arabidopsis thaliana : 150000 mouse and human : 3000000 --maxanchordist -> maximum distance between anchors : arabidopsis thaliana : 40000 mouse and human : 300000 --ratioanchordist -> ratio between anchor's score and distance's score : mean anchor's score = ratio * mean distance's score arabidopsis thaliana : 1.2 mouse and human : 1.4 --centro -> a file describing the position of centromeres (optional)
The DNA file should contain the genomic sequence considered with one FASTA header per chromosome sequence. The FASTA header should be formatted as ">ath1_1-9639975" where:
* ath : identifies the species considered * 1 : is the chromosome number * 1-9639975 : positions of the first and last nucleotide in the sequence.
perl ReDtandem.pl --species ath --dnafile ./example/dna.fa --centro ./example/centro.out
The final output file will be available in the directory where ReDtandem was executed as redtandem.outXXXX (where XXXX may vary). This result file is a tabulated file with column headers formatted with one predicted Tandem Array (TA) per line. For each TA, the different columns are:
- chrom: the number of the chromosome where the TA appears
- start: the position of the first base of the TA
- end: the position of the last base of the TA
- u_start: the position of the first base of the reference unit for the TA
- u_end: the position of the last base of the reference unit for the TA
- numDupli: the number of detected duplicated regions in the TA
- dupli: the positions of every Tandem Unit (TU) in the TA. Each TU is described by the position of its first and last bases separated by "..". Then all TUs are separated using a comma.
For example, the following line:
#chrom start end u_start u_end numDupli dupli 1 147972 156372 148332 149762 2 147972..150120,154224..156372,
represents a TA appearing on chromosome 1, starting at position 147972 and ending at 156372. The reference Tandem Unit detected appears at position 148332-149762. The TA contains just two TUs, respectively at 147972..150120 and 154224..156372.
For licensing information see license.html
The ReDTandem script produces intermediary files stored in a temporary directory. If you want to keep and have a look to these files, you can use the "--noclean" flag with ReDtandem.pl script. The following files will be available in the temporary directory:
- glint.out: output from the genome aligner/anchor detection software glint.
- mdust.out: output of mdust.
- glint_chrom.out: glint output translated in ReD format (an extra "chrom" column is added).
- glint_chrom.red: previous file, filtered with diust and possibly for centromeric regions, ready for ReD.
- red.align.out: list of all anchors that have been used by ReD in the graph.
- red.alignUse.out: list of all anchors which have been chained by ReD.
- red.chain.out: list of chains produced by ReD.
- unit.out: a file that contains all detected TA (Tandem Arrays) with the associated reference unit, built from "red.chain.out" and "red.align.out".