ReDTandem : Region duplicate in Tandem
This repository holds the code source of the software ReDTandem.
The ReD Tandem application is made available here as Perl wrapper script together with source files. It is assumed that the following dependencies will be satisfied:
- C and C++ compilers (gcc/g++)
- NCBI Blast with 2 executables available in the binaries path:
The executables are installed in the directory.
The "ReDtandem.pl" script executes the ReD Tandem pipeline. To execute
it, just type:
The following arguments are expected:
--dnafile -> the DNA file --species -> the name of the species (as in the DNA file, see below) --maxchaindist -> maximum distance between chains : arabidopsis thaliana : 150000 mouse and human : 3000000 --maxanchordist -> maximum distance between anchors : arabidopsis thaliana : 40000 mouse and human : 300000 --ratioanchordist -> ratio between anchor's score and distance's score : mean anchor's score = ratio * mean distance's score arabidopsis thaliana : 1.2 mouse and human : 1.4 --centro -> a file describing the position of centromeres (optional)
The DNA file should contain the genomic sequence considered with one
FASTA header per chromosome sequence. The FASTA header should be
formatted as ">ath1_1-9639975" where:
* ath : identifies the species considered * 1 : is the chromosome number * 1-9639975 : positions of the first and last nucleotide in the sequence.
perl ReDtandem.pl --species ath --dnafile ./example/dna.fa --centro ./example/centro.out
The final output file will be available in the directory where
ReDtandem was executed as redtandem.outXXXX (where XXXX may vary). This
result file is a tabulated file with column headers formatted with one
predicted Tandem Array (TA) per line. For each TA, the different
- chrom: the number of the chromosome where the TA appears
- start: the position of the first base of the TA
- end: the position of the last base of the TA
- u_start: the position of the first base of the reference unit for
- u_end: the position of the last base of the reference unit for
- numDupli: the number of detected duplicated regions in the TA
- dupli: the positions of every Tandem Unit (TU) in the TA. Each TU
is described by the position of its first and last bases
separated by "..". Then all TUs are separated using a comma.
For example, the following line:
#chrom start end u_start u_end numDupli dupli
1 147972 156372 148332 149762 2 147972..150120,154224..156372,
represents a TA appearing on chromosome 1, starting at position 147972
and ending at 156372. The reference Tandem Unit detected appears at
position 148332-149762. The TA contains just two TUs, respectively at
147972..150120 and 154224..156372.
For licensing information see license.html
The ReDTandem script produces intermediary files stored in a temporary
directory. If you want to keep and have a look to these files, you can
use the "--noclean" flag with ReDtandem.pl script. The following files
will be available in the temporary directory:
- glint.out: output from the genome aligner/anchor detection software
- mdust.out: output of mdust.
- glint_chrom.out: glint output translated in ReD format (an extra
"chrom" column is added).
- glint_chrom.red: previous file, filtered with diust and possibly
for centromeric regions, ready for ReD.
- red.align.out: list of all anchors that have been used by ReD in
- red.alignUse.out: list of all anchors which have been chained by
- red.chain.out: list of chains produced by ReD.
- unit.out: a file that contains all detected TA (Tandem Arrays) with
the associated reference unit, built from "red.chain.out" and