Wiki
Clone wikiATLAS / Variant Discovery Tools: allelePresence
Overview
allelePresence is a bayesian method for calling haploid genotypes, i.e. determining which allele is the most likely to be present. As a prior it uses Felsenstein's substitution model (1981), which describes the genotype probabilities as a function of the \(\theta\) (heterozygosity) and \(\boldsymbol{\pi}\) (base frequencies). The estimates for θ are either estimates for each genomic window on-the-fly or a fixed value can be provided. If you have a file with known alleles at specific sites, you can provide it and the calls will only be made at those sites and only for the specified alleles.
Disclaimer: This caller treats the genomes as haploid, which may create problems with downstream programs. Concretely, if an individual has the true genotype "C/T" at a given position and both alleles are equally represented in the sequencing reads, both alleles will have the same posterior probability and a random generator will decide which one will be called. Therefore, two runs of allelePresence for the same individual will not give identical results, unless the same seed is used. In addition, the posterior probabilities of the alleles do not sum to 1, which may cause problems with downstream programs that combine allelePresence calls by taking calling quality into account.
Input
BAM file
for output in vcf format: FASTA file
Base quality score recalibration files
PMD pattern file
optional: sites file with known alleles with the following tab-separated format
chr position ref_allele alt_allele
Output
The output is a gzipped file containing the calls either in .txt or .vcf format.
In .txt format, each line represents a position in the genome. The information given by the columns is the following:
- Chromosome
- Position on chromosome
- Coverage
- Posterior probability of all alleles
- Allele with highest posterior probability (MAP)
- Quality of the call (posterior probability of MAP)
The format of the output using vcf corresponds to an adaptation of the haploid VCFv4.2 format. Since the genomes called are in reality diploid, caution is advised when merging these files. For example, the quality of the calls is calculated as -10 x log(1.0 - posterior probability of the MAP)), or in other words as the phred-scaled probability of making an erroneous call. Since more than two alleles are correct at each position, these probabilities do not add up to one in contrast to what would be expected in a true haploid.
If the parameter sites was used, a conflicts.txt file may be produced. This file contains the positions, where the reference is not equal to either of the provided alleles. These conflicting positions are not called.
Usage Example
The required syntax to launch ATLAS as the allelePresence caller is:
./atlas task=allelePresence bam=example.bam
Example taking post-mortem damage and the recalibration parameters into account and outputting as vcf:
./atlas task=allelePresence bam=example.bam fasta=example.fa window=1000000 theta=0.001 pmdFile=example_PMD_input_Empiric.txt BQSRQuality=example_BQSR_ReadGroup_Quality_Table.txt BQSRPosition=example_BQSR_ReadGroup_Position_Table.txt BQSRPositionReverse=example_BQSR_ReadGroup_Position_Reverse_Table.txt BQSRContext=example_BQSR_ReadGroup_Context_Table.txt vcf verbose
or:
./atlas task=allelePresence bam=example.bam window=1000000 pmdFile=example_PMD_input_Empiric.txt recal=example_recalibrationEM.txt vcf verbose
Specific Arguments
- printAll : Print all sites, also the ones with coverage 0.
- sites : Specify input sites file that specifies the sites for which variants should be called and the known SNPs at those sites.
- theta : Set theta to a certain value. If you do not specify it the window-specific theta will be estimated on-the-fly.
- vcf : Print output in vcf format. Default = a flat .txt file
- noAltIfHomoRef : Do not print the most likely alternative allele when the haplotype is equal to the reference (0). This option was implemented because some downstream programs seem to be confused by the alternative alleles when the genotype is 0.
Engine Parameters
Engine parameters that are common to all tasks can be found [here](https://bitbucket.org/WegmannLab/atlas/wiki/Engine%20Parameters).
Method
allelePresence uses the same prior as our diploid bayesian caller callBayes. The methods are therefore identical except that the posterior probability of an allele is the sum of the posterior probabilities of all genotypes that contain the allele in question.
Updated