# ATLAS / Population Genetic Parameters: theta

## Overview

This task infers the stationary base frequencies $$\boldsymbol{\pi} = \{\pi_A, \pi_C, \pi_G, \pi_T\}$$, along with the rate of substitutions $$\theta=2T\mu$$ along the genealogy connecting the two alleles of an individual within a genomic window (default=non overlapping windows of 1Mbp). Here, T corresponds to the time to the most recent common ancestor of the two lineages and $$\mu$$ to the mutation rate per base pair per generation. It is not possible to infer $$T$$ and $$\mu$$ independently, and we therefore only estimate the compound substitution rate $$\theta$$ from the data. To estimate $$\theta$$, we use Felsenstein’s 1981 model of substitutions:

\begin{equation*} \mathbb{P}(g_i = kl | \theta, \boldsymbol{\pi}) = \begin{cases} \pi_k (e^{-\theta} + \pi_k (1-e^{-\theta})) &\mbox{if } k=l,\\ \pi_k \pi_l (1-e^{-\theta}) &\mbox{if } k \neq l, \end{cases} \end{equation*}

where $$k$$ and $$l$$ are the two alleles of a diploid individual and e^{-theta} is the probability of no mutation having happened in the time to their coalescence. See the methods section below for details on how we extend Felsensteins model to account for the uncertainty in the local genotypes. Base-specific rates of sequencing errors and PMD are taken into account as known constants.

Please note that our method results in slightly different estimates for $$\theta$$ than the $$\theta$$ statistics calculated based on the infinite sites model (implemented e.g. in ANGSD), as Felsenstein's model allows for back mutations. You can transform our $$\theta$$ estimate into e.g. Watterson's $$\theta_W$$ statistic with

\begin{equation*} \theta_W = 1-\sum_k{\pi_k}(e^{-\theta} + {\pi_k}(1-e^{-\theta})) \end{equation*}

For samples with very low sequencing depth, we implemented a way to consider all sites with data to produce a single genome-wide estimate for $$\theta$$. Pass the argument thetaGenomeWide to use this functionality. We generally use this in combination with minDepth=2, i.e. only use sites that contain information about $$\theta$$ to speed up the process and limit the resource requirements. In order to provide some information about the variability of the genome-wide estimate we have implemented bootstrapping. First we sample from a binomial distribution how many sites there are with data and then we randomly select as many sites as had data originally to provide bootstrapped genome-wide estimates.

## Input

• A BAM file

• optional: Base quality score recalibration files (produced by recal)

• optional: PMD file

• optional: a 0-based BED file with positions to mask, use parameter mask

• optional: a 0-based BED file with the coordinates of user-defined windows for which theta should be calculated

• optional: a 0-based BED file with positions based which one single $$\theta$$ will be estimated. If the number of sites to include is high, this task is memory-intensive and you should use the user-defined windows option instead.

Example:

 chromosome start_position end_position

The end_position is not taken into account. Additional columns are ignored.

## Output

A text file with following information:

• chromosome
• window start and end
• coverage in window
• proportion of missing data
• proportion of sites covered at least twice
• the nucleotide frequencies
• the maximum likelihood estimate for $$\theta$$
• the Fisher confidence intervals for $$\theta$$
• the likelihood of the MLE $$\theta$$ estimate

Some information about the Fisher CI: The confidence intervals are calculated by using the second derivatives of the log-likelihood function around the MLE to fit a quadratic function to the peak. The CI are then the theta values found when descending by 2-log-likelihood units on either side of the MLE.

## Usage Example

./atlas task=theta bam=example.bam recal=example_recalibrationEM.txt


or

./atlas task=theta bam=example.bam pmdFile=example_PMD_input_Empiric.txt recal=example_recalibrationEM.txt


## Specific Arguments

• regions: specify 0-based BED file with custom regions for which one single $$\theta$$ should be estimated
• thetaGenomeWide: only produce one single θ estimate for all sites
• bootstraps: specify the number of bootstrap replicates to generate (only considered when inferring theta genome-wide). Bootstraps are generated by sampling sites from the genome with replacement.
• mask: specify 0-based BED file with positions that should not be considered when estimating the $$\theta$$ for each genomic window
• window: specify a 0-based BED file with the coordinates of user-defined windows for which theta should be calculated
• minSitesWithData: change the minimum amount of sites with data required to run theta estimation
• extraVerbose: print current $$\theta$$ estimates for each iteration of EM-algorithm.
• iterations: amount of full updating iterations. Each full iteration contains one iteration where theta and the base frequencies are updated, and several iterations where only theta is updated. Default = 100
• numThetaOnlyUpdates: It's faster to update only theta and not the base frequencies. In addition, it is easy to initialize the base frequencies to values close to the truth, while theta is harder to estimate. Each full iteration therefore contains numThetaOnlyUpdates iterations where we only update theta. Default = 10
• NRiterations: Specify the maximum number of iterations of the Newton-Raphson algorithm. The Newton-Raphson algorithm is used in the Maximization step of the EM algorithm. Default = 10

## Engine Parameters

Engine parameters that are common to all tasks can be found here.

## Plot theta figure

This script takes two arguments: the first defines the chromosomes you want to plot. Use e.g. 1:22 for all autosomes. The second argument is the prefix of the theta estimate output files of the individuals that you want to plot.

args <- commandArgs(TRUE)
tot=length(args)
legendvec=vector(length=tot-1)
chr=eval(parse(text=args[1]))

pdf(paste(args[2], "_theta_plot.pdf", sep=""),height=50,width=15)
par(mfrow=c(22,1))
for(c in chr){
colors=c("#000000",rainbow(tot-2))
plot('', type="n", xlim=c(0, 2.5e+8), ylim=c(1e-5,0.025),log='y',xlab="Chromosome position",main=paste("Chr=",c,sep=''), yaxt='n') #ylab=expression("Estimated "*theta)
labelsY1=parse(text=paste(c(1,1,1,1),"%*%","10^",c(-5,-4,-3,-2), sep=""))
axis(2, at=c(10^-5, 10^-4, 10^-3, 10^-2), labels=labelsY1, las=2)

for (i in (2:tot)){
a0<-na.omit(a0)
a0.1<-subset(a0,Chr==c|Chr==paste("chr",c,sep=''))
lines(a0.1$start,a0.1$theta_MLE,type='l',col=colors[i-1])
abline(h=median(subset(a0.1,theta_MLE>0)\$theta_MLE),lty=2,col=colors[i-1])
base=basename(args[i])
legendvec[i-1]=strsplit(base,split="[_.]+")[[1]][1]
}

legend("bottomleft",legend=legendvec,col=colors,lty=rep(1,tot-1),lwd=rep(1.5,tot-1),horiz=T)
}
dev.off()


## Method

Consider that at each site $$i$$ there are $$n_i$$ reads. We denote by $$d_{ij}$$, $$j=1,\ldots, n_i$$ the base of read $$j$$ covering site $$i$$.

The likelihood function for estimating $$\theta$$ in a genomic window of length $$I$$ is the following:

\begin{equation*} \mathbb{P}(\boldsymbol{d}|\theta, \boldsymbol{\pi}) = \prod_{i=1}^I \sum_g \prod\limits_{j=1}^{n_i} \mathbb{P}(d_{ij}|g_i=g)\mathbb{P}(g_i=g|\theta,\boldsymbol{\pi}), \end{equation*}

where $$g_i$$ is the genotype at that site and $$g_i\in\{AA,AC,...,TT\}$$ is one of all possible 10 diploid genotypes.

$$\mathbb{P}(g_i|\theta,\boldsymbol{\pi})$$ is given by Felsenstein's substitution model 1981 (see Overview section) and the probabilities $$\mathbb{P}(d_i|g_i=g)$$ are given by our Genotyping Model, which takes into account the recalibrated error rates and the post-mortem damage patterns (if the necessary parameter values are provided).

$$\theta$$ and $$\boldsymbol{\pi}$$ are inferred using a standard EM-algorithm, described in detail in Kousathanas, A. et al. (2017). Inferring Heterozygosity from Ancient and Low Coverage Genomes. Genetics, 205(1), 317–332 (page 3: Inference using Expectation-Maximization).

Updated