HTTPS SSH

dd_detection

Description

dd_detection is a tool for finding dispersed duplications (DDs) in high throughput sequencing data. This is the stable standalone tool for finding DDs, its functionality will be incorporated into future versions of Pindel (see: https://github.com/genome/pindel)

Installation

dd_detection depends on samtools (available from http://sourceforge.net/projects/samtools). Compiling can be done with make and a C++ compiler with OMP support. To help make find your samtools files during the first use, you should run it with:

make SAMTOOLS=/path/to/samtools

This will generate a local file (Makefile.local). Running make normally will generate the dd_detection executable in the src folder. Executing dd_detection without any command line parameters will show a description of how to use the program.

Usage

After successful installation, one can run dd_detection from the command line with the following parameters:

Synopsis:

dd_detection -f <reference.fasta> -i <input_config.txt> -c <region> 
    -o <output_prefix> [--MAX_DD_BREAKPOINT_DISTANCE] 
    [--MAX_DISTANCE_CLUSTER_READS] [--MIN_DD_CLUSTER_SIZE] 
    [--MIN_DD_BREAKPOINT_SUPPORT] [--MIN_DD_MAP_DISTANCE]
    [--DD_REPORT_DUPLICATION_READS]

Mandatory parameters:

-f <reference.fasta> Location of the reference genome in fasta format.

-i <input_config.txt> Location of the configuration file for the alignments. <input_config.txt> can be any text file containing the path to the BAM/SAM file, the insert size and the sample name, separated by whitespace. E.g.: "path/to/file.bam 500 sample"

-c <region> Region of the genome to analyze in format chr:start-end. -c ALL will analyze the whole genome.

Optional parameters:

--MAX_DD_BREAKPOINT_DISTANCE Maximum distance between dispersed duplication breakpoints to assume they refer to the same event. One may increase this value to increase sensitivity in case of low coverage, or decrease this value to decrease potential false positives. (default: 350)

--MAX_DISTANCE_CLUSTER_READS Maximum distance between reads for them to provide evidence for a single breakpoint for dispersed duplications. Increase this value to let the algorithm be more lenient in clustering far-away mapped reads, this may improve performance in low coverage data. (default: 100)

--MIN_DD_CLUSTER_SIZE Minimum number of reads needed for calling a breakpoint for dispersed duplications. Increase this value to lower the chance of false positives, decreasing this value may increase the false positive rate due to alignment errors ocurring in the input data. (default: 3)

--MIN_DD_BREAKPOINT_SUPPORT Minimum number of split reads for calling an exact breakpoint for dispersed duplications. Increase this value in case of false positives, decrease this value to let the algorithm be more sensitive for breakpoint calling based on split reads. (default: 3)

--MIN_DD_MAP_DISTANCE Minimum mapping distance of read pairs for them to be considered discordant. This parameter sets a lower bound on how dispersed the found duplications must be. Decreasing this value too much will generate false positives due to other types of variation (e.g. tandem duplications) (default: 8000)

--DD_REPORT_DUPLICATION_READS Report discordant sequences and positions for mates of reads mapping inside dispersed duplications. (default: false)

Output file format

DD events are reported in a file post-fixed with "_DD". An example output for DD detection looks like this:

####################################################################################################
1   DD  reference   7068    7069    48  19  5   19  5
# Dispersed Duplication insertion (DD) found on chromosome 'reference', breakpoint at 7068 (estimated from + strand), 7069 (estimated from - strand)
# Found 48 supporting reads, of which 19 discordant reads and 5 split reads at 5' end, 19 discordant reads and 5 split reads at 3' end.
# Supporting reads for insertion location (5' end):
# Reference: TCGCCTATCTCACGATCGCCTCAATGCACCCGACGATAGGGCTCCCGTTGACCTTCAACAGCTTCGGTGGCTACTAGATACTCtattaaagggtcattggcgaaaaggcatagttgccgagggctcatggaagccagattcttcgtagattacacgacacagttcgc
#            TCGCCTATCTCACGATCGCCTCAATGCACCCGACGATAGGGCTCCCGTTGACCTTCAACAGCTTCGGTGGCTACTAGATACTCCAATCCTGGCTAATCTC (name: @read_393/2 sample: sample1) 
#                             GCCTCAATGCACCCGACGATAGGGCTCCCGTTGACCTTCAACAGCTTCGGTGGCTACTAGATACTCCAATCCTGGCTAATCTCTCATACCGGCACCGCTC (name: @read_394/2 sample: sample1) 
#                                              GATAGGGCTCCCGTTGACCTTCAACAGCTTCGGTGGCTACTAGATACTCCAATCCTGGCTAATCTCTCATACCGGCACCGCTCTGTCGGTCGCGAAATGC (name: @read_395/2 sample: sample1) 
#                                                              ACCTTCAACAGCTTCGGTGGCTACTAGATACTCCAATCCTGGCTAATCTCTCATACCGGCACCGCTCTGTCGGTCGCGAAATGCAACGCCCACGTTATGG (name: @read_396/2 sample: sample1) 
#                                                                               TGGCTACTAGATACTCCAATCCTGGCTAATCTCTCATACCGGCACCGCTCTGTCGGTCGCGAAATGCAACGCCCACGTTATGGTGGGAGGCTTCCGCAGC (name: @read_397/2 sample: sample1) 
# Supporting reads for insertion location (3' end):
# Reference: atctcacgatcgcctcaatgcacccgacgatagggctcccgttgaccttcaacagcttcggtggctactagatactcTATTAAAGGGTCATTGGCGAAAAGGCATAGTTGCCGAGGGCTCATGGAAGCCAGATTCTTCGTAGATTACACGACACAGTTCGCCACAGC
#                                                                               TCGCGGCATTTATTAAAGGGTCATTGGCGAAAAGGCATAGTTGCCGAGGGCTCATGGAAGCCAGATTCTTCGTAGATTACACGACACAGTTCGCCACAGC (name: @read_457/1 sample: sample1) 
#                                                              TGTTCCCCACACAGCGCTCGCGGCATTTATTAAAGGGTCATTGGCGAAAAGGCATAGTTGCCGAGGGCTCATGGAAGCCAGATTCTTCGTAGATTACACG (name: @read_456/1 sample: sample1) 
#                                             ATAGGATTGGCTCAAACTGTTCCCCACACAGCGCTCGCGGCATTTATTAAAGGGTCATTGGCGAAAAGGCATAGTTGCCGAGGGCTCATGGAAGCCAGAT (name: @read_455/1 sample: sample1) 
#                            ATCCAGCTGGTGTTAATATAGGATTGGCTCAAACTGTTCCCCACACAGCGCTCGCGGCATTTATTAAAGGGTCATTGGCGAAAAGGCATAGTTGCCGAGG (name: @read_454/1 sample: sample1) 
#            TGACCCTCTATCTCAAATCCAGCTGGTGTTAATATAGGATTGGCTCAAACTGTTCCCCACACAGCGCTCGCGGCATTTATTAAAGGGTCATTGGCGAAAA (name: @read_453/1 sample: sample1) 
# All supporting sequences for this insertion (i.e. sequences that map inside the inserted element):
?   ?   ?   @read_457/1 sample1 -   TCGCGGCATT
?   ?   ?   @read_456/1 sample1 -   TGTTCCCCACACAGCGCTCGCGGCATT
?   ?   ?   @read_455/1 sample1 -   ATAGGATTGGCTCAAACTGTTCCCCACACAGCGCTCGCGGCATT
?   ?   ?   @read_454/1 sample1 -   ATCCAGCTGGTGTTAATATAGGATTGGCTCAAACTGTTCCCCACACAGCGCTCGCGGCATT
?   ?   ?   @read_453/1 sample1 -   TGACCCTCTATCTCAAATCCAGCTGGTGTTAATATAGGATTGGCTCAAACTGTTCCCCACACAGCGCTCGCGGCATT
?   ?   ?   @read_393/2 sample1 +   CAATCCTGGCTAATCTC
?   ?   ?   @read_394/2 sample1 +   CAATCCTGGCTAATCTCTCATACCGGCACCGCTC
?   ?   ?   @read_395/2 sample1 +   CAATCCTGGCTAATCTCTCATACCGGCACCGCTCTGTCGGTCGCGAAATGC
?   ?   ?   @read_396/2 sample1 +   CAATCCTGGCTAATCTCTCATACCGGCACCGCTCTGTCGGTCGCGAAATGCAACGCCCACGTTATGG
?   ?   ?   @read_397/2 sample1 +   CAATCCTGGCTAATCTCTCATACCGGCACCGCTCTGTCGGTCGCGAAATGCAACGCCCACGTTATGGTGGGAGGCTTCCGCAGC
reference   136603  -   @read_452/1 sample1 -   TTAATAAATGCCGCGAGCGCTGTGTGGGGAACAGTTTGAGCCAATCCTATATTAACACCAGCTGGATTTGAGATAGAGGGTCAATCGGGTGCCCTGTGAC
reference   136620  -   @read_451/1 sample1 -   CGCTGTGTGGGGAACAGTTTGAGCCAATCCTATATTAACACCAGCTGGATTTGAGATAGAGGGTCAATCGGGTGCCCTGTGACCCCGTAGCATGGGCATA
reference   136637  -   @read_450/1 sample1 -   TTTGAGCCAATCCTATATTAACACCAGCTGGATTTGAGATAGAGGGTCAATCGGGTGCCCTGTGACCCCGTAGCATGGGCATAGGTAAGCTGAGCCTCAT
reference   136654  -   @read_449/1 sample1 -   TTAACACCAGCTGGATTTGAGATAGAGGGTCAATCGGGTGCCCTGTGACCCCGTAGCATGGGCATAGGTAAGCTGAGCCTCATCGTCCGAACTTCCGTCA
reference   136670  -   @read_448/1 sample1 -   TTGAGATAGAGGGTCAATCGGGTGCCCTGTGACCCCGTAGCATGGGCATAGGTAAGCTGAGCCTCATCGTCCGAACTTCCGTCAGGATAAAGGCTGGAAG
reference   136687  -   @read_447/1 sample1 -   TCGGGTGCCCTGTGACCCCGTAGCATGGGCATAGGTAAGCTGAGCCTCATCGTCCGAACTTCCGTCAGGATAAAGGCTGGAAGAAGTTCAGGTTCGCTAG
reference   136704  -   @read_446/1 sample1 -   CCGTAGCATGGGCATAGGTAAGCTGAGCCTCATCGTCCGAACTTCCGTCAGGATAAAGGCTGGAAGAAGTTCAGGTTCGCTAGTGCGGGGAGAAGCGTTC
reference   136721  -   @read_445/1 sample1 -   GTAAGCTGAGCCTCATCGTCCGAACTTCCGTCAGGATAAAGGCTGGAAGAAGTTCAGGTTCGCTAGTGCGGGGAGAAGCGTTCTTCGGCCCAACTAGGAC
reference   136737  -   @read_444/1 sample1 -   CGTCCGAACTTCCGTCAGGATAAAGGCTGGAAGAAGTTCAGGTTCGCTAGTGCGGGGAGAAGCGTTCTTCGGCCCAACTAGGACTCCTCGTTAACTGCCG
reference   136754  -   @read_443/1 sample1 -   GGATAAAGGCTGGAAGAAGTTCAGGTTCGCTAGTGCGGGGAGAAGCGTTCTTCGGCCCAACTAGGACTCCTCGTTAACTGCCGTGCCTCTTTGATTTTTA
reference   136771  -   @read_442/1 sample1 -   AGTTCAGGTTCGCTAGTGCGGGGAGAAGCGTTCTTCGGCCCAACTAGGACTCCTCGTTAACTGCCGTGCCTCTTTGATTTTTATGACGCTGAGAGGCTCG
reference   136788  -   @read_441/1 sample1 -   GCGGGGAGAAGCGTTCTTCGGCCCAACTAGGACTCCTCGTTAACTGCCGTGCCTCTTTGATTTTTATGACGCTGAGAGGCTCGATGATCACTCATATGTC
reference   136804  -   @read_440/1 sample1 -   TTCGGCCCAACTAGGACTCCTCGTTAACTGCCGTGCCTCTTTGATTTTTATGACGCTGAGAGGCTCGATGATCACTCATATGTCCGACGTTGCCACAAGG
reference   136807  +   @read_416/2 sample1 +   GGCCCAACTAGGACTCCTCGTTAACTGCCGTGCCTCTTTGATTTTTATGACGCTGAGAGGCTCGATGATCACTCATATGTCCGACGTTGCCACAAGGTGG
reference   136821  -   @read_439/1 sample1 -   TCCTCGTTAACTGCCGTGCCTCTTTGATTTTTATGACGCTGAGAGGCTCGATGATCACTCATATGTCCGACGTTGCCACAAGGTGGCTAGATCATTTCCC
reference   136823  +   @read_415/2 sample1 +   CTCGTTAACTGCCGTGCCTCTTTGATTTTTATGACGCTGAGAGGCTCGATGATCACTCATATGTCCGACGTTGCCACAAGGTGGCTAGATCATTTCCCGC
reference   136838  -   @read_438/1 sample1 -   GCCTCTTTGATTTTTATGACGCTGAGAGGCTCGATGATCACTCATATGTCCGACGTTGCCACAAGGTGGCTAGATCATTTCCCGCACGCAGGTCATATTG
reference   136840  +   @read_414/2 sample1 +   CTCTTTGATTTTTATGACGCTGAGAGGCTCGATGATCACTCATATGTCCGACGTTGCCACAAGGTGGCTAGATCATTTCCCGCACGCAGGTCATATTGCA
reference   136855  -   @read_437/1 sample1 -   GACGCTGAGAGGCTCGATGATCACTCATATGTCCGACGTTGCCACAAGGTGGCTAGATCATTTCCCGCACGCAGGTCATATTGCATCGTGTGCCAGTAGT
reference   136857  +   @read_413/2 sample1 +   CGCTGAGAGGCTCGATGATCACTCATATGTCCGACGTTGCCACAAGGTGGCTAGATCATTTCCCGCACGCAGGTCATATTGCATCGTGTGCCAGTAGTGT
reference   136871  -   @read_436/1 sample1 -   ATGATCACTCATATGTCCGACGTTGCCACAAGGTGGCTAGATCATTTCCCGCACGCAGGTCATATTGCATCGTGTGCCAGTAGTGTGGCGTATGGCTCGC
reference   136874  +   @read_412/2 sample1 +   ATCACTCATATGTCCGACGTTGCCACAAGGTGGCTAGATCATTTCCCGCACGCAGGTCATATTGCATCGTGTGCCAGTAGTGTGGCGTATGGCTCGCTTC
reference   136888  -   @read_435/1 sample1 -   CGACGTTGCCACAAGGTGGCTAGATCATTTCCCGCACGCAGGTCATATTGCATCGTGTGCCAGTAGTGTGGCGTATGGCTCGCTTCAGGCCTGAGCAAGC
reference   136890  +   @read_411/2 sample1 +   ACGTTGCCACAAGGTGGCTAGATCATTTCCCGCACGCAGGTCATATTGCATCGTGTGCCAGTAGTGTGGCGTATGGCTCGCTTCAGGCCTGAGCAAGCCG
reference   136905  -   @read_434/1 sample1 -   GGCTAGATCATTTCCCGCACGCAGGTCATATTGCATCGTGTGCCAGTAGTGTGGCGTATGGCTCGCTTCAGGCCTGAGCAAGCCGAGCACCGTCACAATC
reference   136907  +   @read_410/2 sample1 +   CTAGATCATTTCCCGCACGCAGGTCATATTGCATCGTGTGCCAGTAGTGTGGCGTATGGCTCGCTTCAGGCCTGAGCAAGCCGAGCACCGTCACAATCAA
reference   136924  +   @read_409/2 sample1 +   CGCAGGTCATATTGCATCGTGTGCCAGTAGTGTGGCGTATGGCTCGCTTCAGGCCTGAGCAAGCCGAGCACCGTCACAATCAATTGCAGTACAAAATTCG
reference   136941  +   @read_408/2 sample1 +   CGTGTGCCAGTAGTGTGGCGTATGGCTCGCTTCAGGCCTGAGCAAGCCGAGCACCGTCACAATCAATTGCAGTACAAAATTCGTGACCGGTCGTCGTATC
reference   136957  +   @read_407/2 sample1 +   GGCGTATGGCTCGCTTCAGGCCTGAGCAAGCCGAGCACCGTCACAATCAATTGCAGTACAAAATTCGTGACCGGTCGTCGTATCACATGGAGCTGTAATG
reference   136974  +   @read_406/2 sample1 +   AGGCCTGAGCAAGCCGAGCACCGTCACAATCAATTGCAGTACAAAATTCGTGACCGGTCGTCGTATCACATGGAGCTGTAATGAGCCGAATCGGTAGCAG
reference   136991  +   @read_405/2 sample1 +   GCACCGTCACAATCAATTGCAGTACAAAATTCGTGACCGGTCGTCGTATCACATGGAGCTGTAATGAGCCGAATCGGTAGCAGTAGCGCTATCCAGGGTC
reference   137008  +   @read_404/2 sample1 +   TGCAGTACAAAATTCGTGACCGGTCGTCGTATCACATGGAGCTGTAATGAGCCGAATCGGTAGCAGTAGCGCTATCCAGGGTCTCAGACGACCCCACAAC
reference   137024  +   @read_403/2 sample1 +   TGACCGGTCGTCGTATCACATGGAGCTGTAATGAGCCGAATCGGTAGCAGTAGCGCTATCCAGGGTCTCAGACGACCCCACAACACTCAACGACGACTGA
reference   137041  +   @read_402/2 sample1 +   ACATGGAGCTGTAATGAGCCGAATCGGTAGCAGTAGCGCTATCCAGGGTCTCAGACGACCCCACAACACTCAACGACGACTGATGCTGCGGAAGCCTCCC
reference   137058  +   @read_401/2 sample1 +   GCCGAATCGGTAGCAGTAGCGCTATCCAGGGTCTCAGACGACCCCACAACACTCAACGACGACTGATGCTGCGGAAGCCTCCCACCATAACGTGGGCGTT
reference   137075  +   @read_400/2 sample1 +   AGCGCTATCCAGGGTCTCAGACGACCCCACAACACTCAACGACGACTGATGCTGCGGAAGCCTCCCACCATAACGTGGGCGTTGCATTTCGCGACCGACA
reference   137091  +   @read_399/2 sample1 +   TCAGACGACCCCACAACACTCAACGACGACTGATGCTGCGGAAGCCTCCCACCATAACGTGGGCGTTGCATTTCGCGACCGACAGAGCGGTGCCGGTATG
reference   137108  +   @read_398/2 sample1 +   ACTCAACGACGACTGATGCTGCGGAAGCCTCCCACCATAACGTGGGCGTTGCATTTCGCGACCGACAGAGCGGTGCCGGTATGAGAGATTAGCCAGGATT

DD calls are separated with a string of hash characters (#). Each DD call starts with a line of tab-separated values summarizing the event. The values are as follows (in this order):

  1. Event identification number (integer).
  2. Type of event (currently simply "DD" for all events).
  3. Sequence name on which DD event is located.
  4. Location of DD event as estimated from evidence based on the forward strand.
  5. Location of DD event as estimated from evidence based on the reverse strand.
  6. Total number of reads (both split and discordant reads) supporting the event.
  7. Number of supporting discordant reads on forward strand.
  8. Number of supporting split reads on forward strand.
  9. Number of supporting discordant reads on reverse strand.
  10. Number of supporting split reads on reverse strand.

The next few lines for each event are prefixed with a hash character (#) and show the event summary in a human readable way. If possible, the supporting split reads are also shown here aligned to the local reference sequence depicting the possible exact breakpoint position of the event.

Finally a number of lines per event are printed that give information on the read ends that (partly) map the duplicated segment's sequence. These are the following tab-separated values:

  1. Sequence name to which the read end was alternatively mapped ("?" for split reads).
  2. Location on sequence to which the read end was alternatively mapped ("?" for split reads).
  3. Strand to which the read end was alternatively mapped (forward "+", reverse "-", again "?" in case of a split read).
  4. The read name.
  5. Name of sample where read originated from.
  6. Strand to which the mate of this read end mapped (forward "+" or reverse "-").
  7. (Part of) the sequence that maps inside the duplicated segment.

Authors

This software package is the result of efforts made by M. Kroon, K. Ye, E.W. Lameijer, N. Lakenberg, J.Y. Hehir-Kwa, D.T. Thung, P.E. Slagboom and J. Kok. (Contacting author: kye@genome.wustl.edu)

This publication was supported by the Dutch national program COMMIT. http://commit-nl.nl