Wiki

Overview

This caller is a bayesian genotype caller. As a prior it uses the Felsenstein substitution model (1981), which describes the genotype probabilities as a function of \(\theta\) (heterozygosity) and \(\boldsymbol{\pi}\) (the base frequencies). The estimates for \(\theta\) are either estimates for each genomic window on-the-fly or a fixed value can be provided. If you have a file with known alleles at specific sites, you can provide it and the calls will only be made at those sites and only for the specified alleles.

Input

BAM file
for output in vcf format: FASTA file
Base quality score recalibration files
PMD pattern file
optional: sites file with known alleles with the following tab-separated format

chr position ref_allele alt_allele

Output

The output is a gzipped table either in .txt or .vcf format.

In .txt format, each line represents a position in the genome. The information given by the columns is the following:

Chromosome
Position on chromosome
Coverage
Posterior probability of all genotypes
Genotype with highest posterior probability (MAP)
Quality of the call (posterior probability of MAP)

The .vcf format corresponds to the standard VCFv4.2 format, where the genotype quality (FORMAT field GQ) is equal to (1.0 - posterior probability of the most probable genotype) in phred format. By default, the genotype probabilities of the three most probable genotypes (FORMAT field GP) are in the same format.

If the parameter sites was used, a conflicts.txt file may be produced. This file contains the positions, where the reference is not equal to either of the provided alleles. These positions are not called.

Usage Example

The required syntax to launch ATLAS as the bayesian genotype caller is:

./atlas task=callBayes bam=example.bam

Example taking post-mortem damage and the recalibration parameters into account and outputting as vcf:

./atlas task=callBayes bam=example.bam fasta=example.fa theta=0.001 pmdFile=example_PMD_input_Empiric.txt BQSRQuality=example_BQSR_ReadGroup_Quality_Table.txt BQSRPosition=example_BQSR_ReadGroup_Position_Table.txt BQSRPositionReverse=example_BQSR_ReadGroup_Position_Reverse_Table.txt BQSRContext=example_BQSR_ReadGroup_Context_Table.txt vcf verbose

or:

./atlas task=callBayes bam=example.bam fasta=example.fa pmdFile=example_PMD_input_Empiric.txt recal=example_recalibrationEM.txt vcf verbose

Specific Arguments

printAll : Print all sites, also the ones with coverage 0.
sites : Specify input sites file that specifies the sites for which variants should be called and the known SNPs at those sites.
theta : Set theta to a certain value. If you do not specify it the window-specific theta will be estimated on-the-fly.
vcf : Print output in vcf format. Default = a flat .txt file
noAltIfHomoRef : Do not print the most likely alternative allele when the haplotype is equal to the reference (0/0).
printPP : Output the genotype probabilities of all 10 genotypes in the INFO field.
onlyPhredGP : Output the genotype probabilities of the three most probable genotypes (Format field GP) in simple phred format (will create larger files)

Engine Parameters

Engine parameters that are common to all tasks can be found here.

Method

Let \(l\) and \(k\) be the alleles of the diploid genotype g and \(\pi_l\) and \(\pi_k\) their respective base frequencies within a genomic window. We infer the base frequencies and theta for each genomic window as described in estimating heterozygosity.

Following Felsenstein's substitution model (1981) the genotype's prior probability is defined as:

\begin{equation*} \mathbb{P}(g_i = kl | \theta, \boldsymbol{\pi}) = \begin{cases} \pi_k (e^{-\theta} + \pi_k (1-e^{-\theta})) &\mbox{if } k=l,\\ \pi_k \pi_l (1-e^{-\theta}) &\mbox{if } k \neq l, \end{cases}, \end{equation*}

where \(e^{-\theta}\) is the probability that there was no mutation on the phylogenetic tree connecting the two alleles l and k and \(g_i\) is the genotype at that site and \(g_i\in\{AA,AC,...,TT\}\) is one of all possible 10 diploid genotypes.

The posterior probability of the genotype at one site is thus defined as:

\begin{equation*} \mathbb{P}(g_i = kl | \boldsymbol{d_i}, \theta, \boldsymbol{\pi}) = \frac{\mathbb{P}(\boldsymbol{d_i} | g_i) \mathbb{P}(g_i | \theta, \boldsymbol{\pi})}{\sum_g \left[ \mathbb{P}(\boldsymbol{d_i} | g_i) \mathbb{P}(\theta, \boldsymbol{\pi}) \right]}. \end{equation*}

The posterior is calculated for all possible genotypes at each site and the genotype with the highest posterior is called.