Wiki

Clone wiki

ATLAS / Variant Discovery Tools: allelePresence

Overview

allelePresence is a bayesian method for calling haploid genotypes, i.e. determining which allele is the most likely to be present. As a prior it uses Felsenstein's substitution model (1981), which describes the genotype probabilities as a function of the \(\theta\) (heterozygosity) and \(\boldsymbol{\pi}\) (base frequencies). The estimates for θ are either estimates for each genomic window on-the-fly or a fixed value can be provided. If you have a file with known alleles at specific sites, you can provide it and the calls will only be made at those sites and only for the specified alleles.

Disclaimer: This caller treats the genomes as haploid, which may create problems with downstream programs. Concretely, if an individual has the true genotype "C/T" at a given position and both alleles are equally represented in the sequencing reads, both alleles will have the same posterior probability and a random generator will decide which one will be called. Therefore, two runs of allelePresence for the same individual will not give identical results, unless the same seed is used. In addition, the posterior probabilities of the alleles do not sum to 1, which may cause problems with downstream programs that combine allelePresence calls by taking calling quality into account.

Input

  • BAM file

  • for output in vcf format: FASTA file

  • Base quality score recalibration files

  • PMD pattern file

  • optional: sites file with known alleles with the following tab-separated format

    chr position ref_allele alt_allele

Output

The output is a gzipped file containing the calls either in .txt or .vcf format.

In .txt format, each line represents a position in the genome. The information given by the columns is the following:

  • Chromosome
  • Position on chromosome
  • Coverage
  • Posterior probability of all alleles
  • Allele with highest posterior probability (MAP)
  • Quality of the call (posterior probability of MAP)

The format of the output using vcf corresponds to an adaptation of the haploid VCFv4.2 format. Since the genomes called are in reality diploid, caution is advised when merging these files. For example, the quality of the calls is calculated as -10 x log(1.0 - posterior probability of the MAP)), or in other words as the phred-scaled probability of making an erroneous call. Since more than two alleles are correct at each position, these probabilities do not add up to one in contrast to what would be expected in a true haploid.

If the parameter sites was used, a conflicts.txt file may be produced. This file contains the positions, where the reference is not equal to either of the provided alleles. These conflicting positions are not called.

Usage Example

The required syntax to launch ATLAS as the allelePresence caller is:

./atlas task=allelePresence bam=example.bam

Example taking post-mortem damage and the recalibration parameters into account and outputting as vcf:

./atlas task=allelePresence bam=example.bam fasta=example.fa window=1000000 theta=0.001 pmdFile=example_PMD_input_Empiric.txt BQSRQuality=example_BQSR_ReadGroup_Quality_Table.txt BQSRPosition=example_BQSR_ReadGroup_Position_Table.txt BQSRPositionReverse=example_BQSR_ReadGroup_Position_Reverse_Table.txt BQSRContext=example_BQSR_ReadGroup_Context_Table.txt vcf verbose

or:

./atlas task=allelePresence bam=example.bam window=1000000 pmdFile=example_PMD_input_Empiric.txt recal=example_recalibrationEM.txt vcf verbose

Specific Arguments

  • printAll : Print all sites, also the ones with coverage 0.
  • sites : Specify input sites file that specifies the sites for which variants should be called and the known SNPs at those sites.
  • theta : Set theta to a certain value. If you do not specify it the window-specific theta will be estimated on-the-fly.
  • vcf : Print output in vcf format. Default = a flat .txt file
  • noAltIfHomoRef : Do not print the most likely alternative allele when the haplotype is equal to the reference (0). This option was implemented because some downstream programs seem to be confused by the alternative alleles when the genotype is 0.

Engine Parameters

Engine parameters that are common to all tasks can be found [here](https://bitbucket.org/WegmannLab/atlas/wiki/Engine%20Parameters).

Method

allelePresence uses the same prior as our diploid bayesian caller callBayes. The methods are therefore identical except that the posterior probability of an allele is the sum of the posterior probabilities of all genotypes that contain the allele in question.

Updated