Wiki

Clone wiki

ATLAS / Variant Discovery Tools: callMLE

Overview

callMLE is a maximum likelihood approach to determine the diploid genotype likelihoods while taking post-mortem damage into account and to call the most likely genotype. If you have a file with known alleles at specific sites, you can provide it and the calls will only be made at those sites and only for the specified alleles.

Input

  • BAM file

  • FASTA file

  • optional: sites file with known alleles with the following tab-separated format

    chr position ref_allele alt_allele

Output

The output is a gzipped table either in .txt or .vcf format.

In .txt format, each line represents a position in the genome and the columns provide the following information:

  • Chromosome
  • Position on chromosome
  • Reference base (if no reference file was passed the reference base will always be 'N')
  • Coverage at position
  • Likelihoods of all possible genotypes (in phred format and normalized)
  • The genotype with the maximum likelihood
  • Quality of the call (difference between the normalized phred-scaled likelihoods of the first and the second most likely genotypes)

The format of the likelihood and the call quality columns follows the logic of GATK's haplotype caller. A detailed description can be found here.

The format of the output using vcf corresponds to the standard VCFv4.2 format.

The format of the output using gVCF corresponds to the gVCF format as defined by GATK. Output this format if you later want to use GATK's genotypeGVCFs to perform joint genotyping.

If the parameter sites was used, a conflicts.txt file may be produced. This file contains the positions, where the reference is not equal to either of the provided alleles. No genotype is called at these positions.

Usage Example

The required syntax to launch ATLAS as the MLE caller is:

./atlas task=callMLE bam=example.bam

Example taking post-mortem damage and the recalibration parameters into account and outputting as a vcf:

./atlas task=callMLE bam=example.bam fasta=example.fa pmdFile=example_PMD_input_Empiric.txt BQSRQuality=example_BQSR_ReadGroup_Quality_Table.txt BQSRPosition=example_BQSR_ReadGroup_Position_Table.txt BQSRPositionReverse=example_BQSR_ReadGroup_Position_Reverse_Table.txt BQSRContext=example_BQSR_ReadGroup_Context_Table.txt vcf verbose

or:

./atlas task=callMLE bam=example.bam fasta=example.fa pmdFile=example_PMD_input_Empiric.txt recal=example_recalibrationEM.txt vcf verbose

It is also possible to create the calls in BEAGLE format:

./atlas task=callMLE bam=example.bam fasta=example.fa pmdFile=example_PMD_input_Empiric.txt recal=example_recalibrationEM.txt vcf beagle sites=known_alleles.sites verbose

Specific Arguments

  • fasta : The reference file is necessary for producing vcf or gVCF. It is optional for producing flat files. If it is passed in this case the reference bases will be printed instead of the placeholder 'N'.
  • printAll : Print all sites, also the ones with coverage 0.
  • sites : Specify input sites file that specifies the sites for which variants should be called and the known SNPs at those sites.
  • vcf : Print output in vcf format
  • gVCF : Print output in gVCF format as defined by GATK
  • noAltIfHomoRef : Do not print the most likely alternative allele when genotype is homozygous reference (0/0). This option was implemented because some downstream programs seem to be confused by the alternative alleles when the genotype is 0/0.
  • sampleName : Define a sample name for the header of the vcf file. Default = prefix specified with the out parameter.
  • beagle : Output the genotype likelihoods in BEAGLE format

Engine Parameters

Engine parameters that are common to all tasks can be found here.

Method

Consider that at each site \(i\) there are \(n_i\) reads. We denote by \(d_{ij}\), \(j=1,\ldots, n_i\) the base of read \(j\) covering site \(i\).

The likelihood function for one site is:

\begin{equation*} \mathbb{P}(\boldsymbol{d_i} | g_i) = \displaystyle \prod\limits_{j=1}^{n_i} \mathbb{P}(d_{ij}|g_i). \end{equation*}

This likelihood is calculated for each possible genotype \(g_i\) and the genotype with the highest likelihood is called.

The probabilities \(\mathbb{P}(\boldsymbol{d_i}|g = g_i)\) are given by our genotyping model. They take into account the recalibrated error rates and the post-mortem damage patterns if the necessary parameter values are provided.

Updated