A python utility for the detection of mosaic structural variants
Istituto di Genomica Applicata (IGA) & IGA Technology Services
via J. Linussio 51 - 33100 - Udine - Italy
Authors: Fabio Marroni, Davide Scaglione
Contacts: firstname.lastname@example.org, email@example.com
Date of creation: 10-02-14
The software is aimed to the detection of mosaic structural variants (>100Kbp) from NGS data. Mosaic structural variants (SVs) cause unbalances in allele frequencies in the mutated sample sample compared to the wild-type sample. χ-scan uses SNP genotypes obtained by Next Generation Sequencing (NGS) to detect unbalanced presence of one allele in one sample compared to the other, and thus presence of structural variations. It therefore requires a test sample and a control sample originated from the same meiotic event. (e.g. tumor vs normal or a plant clone vs a mutant clone)
χ-scan implements several methods to determine significantly unbalanced allele frequencies over windows of SNP sites, either relying on chi-square test (chi_reads method) or A meta-analysis of Fisher's exact tests (phased_fisher method). The first assumes that the reference genome sequence used during SNP calling has extensive haplotype sharing with the sample in the order of window size (typically 200-400 SNP sites). If this is not the case, the phased_fisher methods relies on pre-computed haplotype blocks. χ-scan accepts phase blocks generated from HAPCUT software [https://sites.google.com/site/vibansal/software/hapcut] which assemble them by sequence information. Therefore the method is only powerful when heterozygous SNP density is lower than average fragment size (single reads) or pair distance (paired-ends). Future release will try to deal with phases computed genetically on population linkage-disequilibrium.
Compatible with Python 2.7
To run χ-scan the following Python modules must be installed:
- numpy >= 1.7.0
- scipy >= 0.11.0
- pyvcf >= 0.6.4
- matplotlib >= 1.3.1
A VCF file (version 4.0 or greater) containing at least two samples (a control sample and a test sample).
If SNP calling was performed using GATK Unified Genotyper, additional fields will be taken into account during filtration of SNP site (i.e.
SB from the INFO field of the VCF).
sample_data folder contains a small VCFfile which simulates a deletion at dilution of 10%.
<Optional> User can provide information of reconstructed haplotype blocks from HAPCUT software. This will trigger a different behavior of the software (see Options). Instead of simply testing alternative alleles (ALT) versus reference alleles (REF) as provided from VCF file in any given window by a chi-square test (chi_reads method), the phased_fisher method is invoked by performing a Fisher's exact test on each known (i.e. reconstructed ) set of phased nucleotides within a window. Several Fisher's tests are therefore performed within a window and they are meta-analyzed using the weighted Stouffer's method, using as weights the length of each individual phase block. For more detail please refer to the paper. These allow to analyzed sample where the reference genome does not carry any haplotype sharing with sample under analyses. However, phase blocks are reconstructed by HAPCUT using sequence information and it works as long as SNP density is higher then the average span that such physical information can provide (fragment length or insert size). Such approach will fail in inbred lines or SNP-poor genomes like human. We are working to develop a version of the code which permit the utilization of phase reconstructed at population scale by linkage-disequilibrium analysis.
A very quick example command:
Xscan.py --vcf PB05_PN04_delPN01__PN05_PB04_delPB01_0.10dilution.vcf --sample1 PB05_PN04_delPN01 --sample2 PN05_PB04_delPB01 --out-folder sampledata_output --method chi_reads phased_fisher --hapcut sample_data.hapcut
NOTE: For every test each window p-value is adjusted for multiple testing with Benjamini-Hochberg method.
The input multi-sample VCF file. It can contain many sample, at least the two you want to compare.
--sample1 The name of one sample as reported in the VCF header (sample order does not matter)
--sample2 The name of one sample as reported in the VCF header (sample order does not matter)
The minimum per-sample coverage a SNP site must have to be used while populating SNPs window (default 10).
--method A list (separated by space) of methods the user want to invoke (available: chi_reads, chi_count, wmw_alt; default: chi_reads)
--out-folder A destination folder will be created. For each invoked method a sub folder will contain result files.
--SNPwindow The number of SNP sites collected per window (default 200, 100 in permutation mode)
--overlap Number of overlapping SNPs between windows (default 80, 30 in permutation mode)
--cov-ratio The main script will inspect some SNP positions to estimated the average coverage of each sample. This coefficient will then return the maximum coverage threshold for each sample (default 2.5).
--min-one-het At least one sample should reach this minor allele frequency to retain the position in the informative SNP sites catalog (a.k.a. the SNPs window, default 0.30)
--min-wins Minimum number of continuous positive windows to trigger a positive region. As default it is calculated as
(SNPwindow /SNPwindow - overlap) + 1
--min-adj-pvalue Minimum Benjamini-Hochberg-corrected pvalue to flag a window as positive (default 0.001, 0.0001 in permutation mode).
--refine Perform refinement of ranges by inspection of pvalue. A routine will be invoked to look at slopes of pvalues in the termini of candidate regions and where possible shave off positive windows that are considerably inflating the span of the region, i.e. rejection of null hypothesis by half of SNP sites in the window (Default: False).
--max-gap Gapfill phase 1: Gap between positive windows will be filled if the gap is not longer than max-gap (default: floor(min-wins/2)
--gap-ratio Gapfill phase 2: In a second stage of gap filling the algorithm will try to merge any two positive regions if a gap which is not longer than gap-ratio * the-shorter-region (default 0.50, 0.75 in permutation mode)
--no-gapfill Skip any gap filling routine.
--select-refs Defined set of reference sequences to process (space-separated)
--min-ref-length Minimum length to process reference sequence; override by --select-refs (default: 1000000)
PHASE-BLOCK special options
--hapcut HAPCUT phase output. Please refer to its own documentation. It should be generated from alignment and single-sample SNP calling of the same individual under analysis (either just by using the control sample or by merging reads of several samples coming from the same genotype, thus sharing phases, to increase coverage and gain in phase reconstruction)
--min-phased-collect Min contiguous phase information to consider while collecting HAPCUT data (deafult: 10); this will defined the set of sampled SNPs for windows generation
--blocks-thresholds These values will drive the phased_fisher meta-analyses. Three space-separated integers. [min_block_size(SNPs): Only execute Fisher's exact test on block with this minimum amount of SNPs ; min_block_num(#): Perform meta-analysis on windows with these many independent blocks; min_SNP_perwin(SNPs): execute meta-analysis if these many SNP sites are considered within the window] (deafult: 10 1 30)
dbSNP file In the main output folder a dbSNP file containing all SNP sites used for statistical testing will be dumped.
<prefix>_all_windows.txt In each method sub folder, a file containing results for each overlapping window.
<prefix>_sig_ranges.txt Final positive ranges after filtering, gap filling, and refinement.
<prefix>.svg Plot of allele frequencies with highlighting of final positive ranges. Please consider that using permutation, alternative alleles are scrambled across windows and allele frequencies lose their correct phase to that regard.
We thank Vittorio Zamboni and Alessandro Gervaso for precious help in code optimization. We thank Prof. Alberto Policriti for useful discussions and suggestions.
- Defined the deployment methods: chi_reads and phased_fisher
- Removed multi-thread to increase compatibility and memory efficiency
- Included rules for processing haplotype blocks
- Able to parse and collect phase block info from HAPCUT
- Permutation engine implemented
- Chi_reads method implemented
- Implemented window analysis
- Bug fix in chi_count method (excluded tie scenario from counting)
- dbSNP format dump, removed old SNP table output with window pvalue
- min heterozygosity for at least one sample set to default 0.30
- coefficient to calculate max coverage from median changed to 2.5
- refinement of boundaries routine based on pvalue analysis (--refine)
- color coding of ranges fixed on the basis of absolute value of delta AF