# ATLAS / majorMinor

## Overview

This task infers the major and minor alleles from a population sample and outputs the genotype likelihoods for those in a vcf file. This tasks requires the sample-specific genotype likelihoods in glf format, which can be created with the ATLAS task glf. The resulting vcf file can be used as an input to ANGSD

The major and minor alleles can be estimated the method described with Skotte et al. (2012) or with the MLE method. The MLE method estimates the genotype frequencies simultaneously with the two alleles present at a site and is thus slower.

The variant quality is the likelihood ratio of a model with variants and a model without variants.

## Input

• glf files: one for every sample of the population.

The names of the input glf files can be provided as a comma-separated list or on separate lines in a user-created text file.

## Output

• vcf file: multi-sample, containing the likelihoods of the genotypes consisting of the major and minor allele

## Usage Example

Provide list of input glf's:

./atlas task=majorMinor glf=Sample1.glf.gz,Sample2.glf.gz,Sample3.glf.gz


Provide input glf names in a file:

./atlas task=majorMinor glf=glf_list.txt


## Specific Arguments

• method: use either MLE or Skotte
• glf: a comma separated list of glf files or a text file (with .txt in the name) containing the names of the glf files on separate lines
• maxF: default = 0.00001
• phredLik: write genotype likelihoods in phred format. This will save space but lead to loss of precision and thus power. default = false
• minSamplesWithData: do not write sites with lower number of samples with data to file. default = 0
• minVariantQual: do not write sites with lower variant quality to file. default = 0
• limitSites: write up to a certain input position

## Methods

Finding MLE allelic combinations

Let i=individual, $$m$$ and $$M$$ alleles of diploid genotype $$g$$ at one site. The goal is to find allelic combination $$c \in \{AC,AG,AT,CG,CT,GT\}$$ that maximizes

\begin{equation*} P(\boldsymbol{d}|c) = \prod_i \sum_{g \in \{mm,mM,MM\}} P(\boldsymbol{d}_i | g) P(g|c) \end{equation*}

MLE method

The genotype weights are estimated with an EM-algorithm, i.e. one EM is run per allelic combination. The allelic combination $$c$$ with the highest likelihood, calculated with the equation above, is chosen.

Skotte method

Assuming that the allele frequency = 0.5, the genotype weights are defined as 0.25 for $$g=mm$$ and $$g=MM$$ and 0.5 for $$g=mM$$. Using these weights, the allelic combination $$c$$ with the highest likelihood, calculated with the equation above, is chosen. Based on chosen $$c$$, the genotype frequencies are estimated with the same EM used in the MLE method.

This last step (the EM) is a modification to Skotte et al. (2012), where genotype frequencies for the best $$c$$ are found by determining the MLE genotypes for all individuals and then simply counting.

Finding the major and minor alleles

The major allele of a site is then defined as $$m$$ if the genotype frequency of $$mm$$ is higher than the frequency of $$MM$$ and vice versa.

Finding variant quality

Let $$M$$ correspond to the major allele, $$m$$ to the minor allele, and $$f$$ to the frequency of $$m$$. The variant quality is defined as: phred(likelihood of model where $$f=0$$) - phred(likelihood of model where $$f\geq0$$)

likelihood of frequency of m equal to 0

\begin{equation*} P(\boldsymbol{d}|f=0) = \prod_i P(\boldsymbol{d}_i | g=MM) \end{equation*}

likelihood of frequency of m being equal or larger than 0

\begin{equation*} P(\boldsymbol{d}|f \geq 0,c) = \prod_i \sum_{g \in \{mm,mM,MM\}} P(\boldsymbol{d}_i | g) P(g|c) \end{equation*}

Updated