Wiki

Clone wiki

ATLAS / Variant Discovery Tools: call

Overview

We have different callers which all produce a VCF file. Some are maximum likelihood callers, like the "MLE" caller, and others are bayesian callers, like "Bayesian" and "allelePresence". For the bayesian callers we define the prior based on \(\theta\) and the allele frequencies (see below for more details). The parameters of the priors can either be estimated in a first step and then in a second step assumed to be known when making the genotype call, or they can be fixed to a value.

Input

  • BAM file

  • reference FASTA file

  • optional: recalibration and PMD input file

  • optional: file with known alleles (1-based!) with the following tab-separated format

    chr position ref_allele alt_allele

Output

  • A VCF file

Usage example

Use MLE caller and print all possible info and format tags

./atlas task=call method=MLE bam=example.bam fasta=example.fasta infoFields=DP formatFields=GT,AD,AB,AI,DP,GQ,PL recal=example_recalibrationEM.txt pmdFile=example_PMD_input_Empiric.txt

Use bayesian caller and print all possible info and format tags. Use fixed prior of a fixed theta=0.001 and all base freq = 0.25

./atlas task=call method=Bayesian prior=theta fixedTheta=0.001  bam=example.bam fasta=example.fasta  formatFields=GT,AD,AB,AI,DP,GQ,GP infoFields=DP equalBaseFreq recal=example_recalibrationEM.txt pmdFile=example_PMD_input_Empiric.txt

Task specific parameters

method:

  • MLE
  • Bayesian
  • allelePresence
  • randomBase
  • majorityBase

infoFields:

  • DP: sequencing depth

formatFields:

  • GT: genotype string
  • DP: sequencing depth
  • GQ: genotype quality
  • AD: allelic depths for all alleles in call in order listed
  • AP: Phred-scaled allelic posterior probabilities for the four alleles A, C, G and T
  • GL: normalized genotype likelihoods
  • PL: phred-scaled normalized genotype likelihoods
  • GP: Genotype posterior probabilities (phred-scaled)
  • AB: alleleic imbalance
  • AI: Binomial probability of allelic imbalance if Hz site

prior:

For the callers allelePresence and Bayesian, the default behavior is that \(theta\) and the base frequencies are estimated in the same way as in task estimateTheta.

  • fixedTheta: provide fixed theta value instead of estimating it for every window
  • equalBaseFreq: assume all base frequencies to be 0.25 instead of estimating them for every window
  • defaultTheta: provide a fixed theta value to be used for all windows for which \(theta\) can not be estimated due to lack of data (algorithm does not converge)

other:

  • printAll: print all sites, also invariant ones
  • noAltIfHomoRef: do not print alternative alleles if call is homozygous reference, but still use them to calculate call quality
  • noTriallelic: only allow one alternative allele
  • sampleName : Define a sample name for the header of the vcf file. Default = prefix specified with the out parameter.
  • alleles: Provide file that specifies the sites for which variants should be called and the known alleles at those sites (see input section)

Methods

GQ

MLE: Let gMLE the genotype with the highest likelihood given the sequencing data and g2MLE the genotype with the second highest likelihood. In method=MLE the GQ is calculated as round(phred(likelihood of g2MLE / gMLE)).

Bayesian: Let gPP be the genotype with the highest posterior probability. The GQ is calculated as the round(phred(1.0 - posterior probability of gPP)).

allelePresence: Let aPP be the allele with the highest posterior probability. The GQ is calculated as the round(phred(1.0 - posterior probability of aPP)).

AB

Let ref, alt, alt1, alt2 be the amount of reference alleles, first alternative alleles, second alternative alleles, respectively

only one alt allele present: ref / (ref + alt1)

Otherwise: alt1 / (alt1 + alt2)

method=MLE

Consider that at each site \(i\) there are \(n_i\) reads. We denote by \(d_{ij}\), \(j=1,\ldots, n_i\) the base of read \(j\) covering site \(i\).

The likelihood function for one site is:

\begin{equation*} \mathbb{P}(\boldsymbol{d_i} | g_i) = \displaystyle \prod\limits_{j=1}^{n_i} \mathbb{P}(d_{ij}|g_i). \end{equation*}

This likelihood is calculated for each possible genotype \(g_i\) and the genotype with the highest likelihood is called.

The probabilities \(\mathbb{P}(\boldsymbol{d_i}|g = g_i)\) are given by our genotyping model. They take into account the recalibrated error rates and the post-mortem damage patterns if the necessary parameter values are provided.

method=Bayesian

Let \(l\) and \(k\) be the alleles of the diploid genotype g and \(\pi_l\) and \(\pi_k\) their respective base frequencies within a genomic window. We infer the base frequencies and theta for each genomic window as described in estimating heterozygosity.

Following Felsenstein's substitution model (1981) the genotype's prior probability is defined as:

\begin{equation*} \mathbb{P}(g_i = kl | \theta, \boldsymbol{\pi}) = \begin{cases} \pi_k (e^{-\theta} + \pi_k (1-e^{-\theta})) &\mbox{if } k=l,\\ \pi_k \pi_l (1-e^{-\theta}) &\mbox{if } k \neq l, \end{cases}, \end{equation*}

where \(e^{-\theta}\) is the probability that there was no mutation on the phylogenetic tree connecting the two alleles l and k and \(g_i\) is the genotype at that site and \(g_i\in\{AA,AC,...,TT\}\) is one of all possible 10 diploid genotypes.

The posterior probability of the genotype at one site is thus defined as:

\begin{equation*} \mathbb{P}(g_i = kl | \boldsymbol{d_i}, \theta, \boldsymbol{\pi}) = \frac{\mathbb{P}(\boldsymbol{d_i} | g_i) \mathbb{P}(g_i | \theta, \boldsymbol{\pi})}{\sum_g \left[ \mathbb{P}(\boldsymbol{d_i} | g_i) \mathbb{P}(\theta, \boldsymbol{\pi}) \right]}. \end{equation*}

The posterior is calculated for all possible genotypes at each site and the genotype with the highest posterior is called.

method=allelePresence

allelePresence is a bayesian method for calling haploid genotypes, i.e. determining which allele is the most likely to be present. As a prior it uses Felsenstein's substitution model (1981), which describes the genotype probabilities as a function of the \(\theta\) (heterozygosity) and \(\boldsymbol{\pi}\) (base frequencies). The estimates for θ are either estimates for each genomic window on-the-fly or a fixed value can be provided. If you have a file with known alleles at specific sites, you can provide it and the calls will only be made at those sites and only for the specified alleles. The posterior probability of an allele is the sum of the posterior probabilities of all genotypes that contain the allele in question.

Disclaimer: This caller treats the genomes as haploid, which may create problems with downstream programs. Concretely, if an individual has the true genotype "C/T" at a given position and both alleles are equally represented in the sequencing reads, both alleles will have the same posterior probability and a random generator will decide which one will be called. Therefore, two runs of allelePresence for the same individual will not give identical results, unless the same seed is used. In addition, the posterior probabilities of the alleles do not sum to 1, which may cause problems with downstream programs that combine allelePresence calls by taking calling quality into account.

method=majorityBase

This caller chooses the base with the most occurences out of all sequenced bases covering given site of the genome. If two bases are equally current, one of them is chosen at random. This caller produces haploid calls, which can subsequently be turned into "pseudo-diploid" calls by duplicating the random base. This type of calls is usually used to reduce reference bias. ATLAS is not affected by reference bias and we suggest to use our more sophisticated allelePresence caller if the sequencing data is too scarce for diploid calls.

method=randomBase

This caller chooses one base at random out of all sequenced bases covering given site of the genome. This caller produces haploid calls, which can subsequently be turned into "pseudo-diploid" calls by duplicating the random base. This type of calls is usually used to reduce reference bias. ATLAS is not affected by reference bias and we suggest to use our more sophisticated allelePresence caller if the sequencing data is too scarce for diploid calls.

method=gVCF

This caller is identical to the MLE caller expect that the format of the output using gVCF corresponds to the gVCF format as defined by GATK. Output this format if you later want to use GATK's genotypeGVCFs to perform joint genotyping.

Updated