Wiki

Overview

This task infers the stationary base frequencies \(\boldsymbol{\pi} = \{\pi_A, \pi_C, \pi_G, \pi_T\}\), along with the rate of substitutions \(\theta=2T\mu\) along the genealogy connecting the two alleles of an individual within a genomic window (default=non overlapping windows of 1Mbp). Here, T corresponds to the time to the most recent common ancestor of the two lineages and \(\mu\) to the mutation rate per base pair per generation. It is not possible to infer \(T\) and \(\mu\) independently, and we therefore only estimate the compound substitution rate \(\theta\) from the data. To estimate \(\theta\), we use Felsenstein’s 1981 model of substitutions:

\begin{equation*} \mathbb{P}(g_i = kl | \theta, \boldsymbol{\pi}) = \begin{cases} \pi_k (e^{-\theta} + \pi_k (1-e^{-\theta})) &\mbox{if } k=l,\\ \pi_k \pi_l (1-e^{-\theta}) &\mbox{if } k \neq l, \end{cases} \end{equation*}

where \(k\) and \(l\) are the two alleles of a diploid individual and e^{-theta} is the probability of no mutation having happened in the time to their coalescence. See the methods section below for details on how we extend Felsensteins model to account for the uncertainty in the local genotypes. Base-specific rates of sequencing errors and PMD are taken into account as known constants.

Please note that our method results in slightly different estimates for \(\theta\) than the \(\theta\) statistics calculated based on the infinite sites model (implemented e.g. in ANGSD), as Felsenstein's model allows for back mutations. You can transform our \(\theta\) estimate into e.g. Watterson's \(\theta_W\) statistic with

\begin{equation*} \theta_W = 1-\sum_k{\pi_k}(e^{-\theta} + {\pi_k}(1-e^{-\theta})) \end{equation*}

For samples with very low sequencing depth, we implemented a way to consider all sites with data to produce a single genome-wide estimate for \(\theta\). Pass the argument thetaGenomeWide to use this functionality. We generally use this in combination with minDepth=2, i.e. only use sites that contain information about \(\theta\) to speed up the process and limit the resource requirements. In order to provide some information about the variability of the genome-wide estimate we have implemented bootstrapping. First we sample from a binomial distribution how many sites there are with data and then we randomly select as many sites as had data originally to provide bootstrapped genome-wide estimates.

Input

A BAM file
optional: Base quality score recalibration files (produced by recal)
optional: PMD file
optional: a 0-based BED file with positions to mask, use parameter mask
optional: a 0-based BED file with the coordinates of user-defined windows for which theta should be calculated
optional: a 0-based BED file with positions based which one single \(\theta\) will be estimated. If the number of sites to include is high, this task is memory-intensive and you should use the user-defined windows option instead.

Example:

chromosome start_position end_position

The end_position is not taken into account. Additional columns are ignored.

Output

A text file with following information:

chromosome
window start and end
coverage in window
proportion of missing data
proportion of sites covered at least twice
the nucleotide frequencies
the maximum likelihood estimate for \(\theta\)
the Fisher confidence intervals for \(\theta\)
the likelihood of the MLE \(\theta\) estimate

Some information about the Fisher CI: The confidence intervals are calculated by using the second derivatives of the log-likelihood function around the MLE to fit a quadratic function to the peak. The CI are then the theta values found when descending by 2-log-likelihood units on either side of the MLE.

Usage Example

./atlas task=theta bam=example.bam recal=example_recalibrationEM.txt

or

./atlas task=theta bam=example.bam pmdFile=example_PMD_input_Empiric.txt recal=example_recalibrationEM.txt

Specific Arguments

regions: specify 0-based BED file with custom regions for which one single \(\theta\) should be estimated
thetaGenomeWide: only produce one single θ estimate for all sites
bootstraps: specify the number of bootstrap replicates to generate (only considered when inferring theta genome-wide). Bootstraps are generated by sampling sites from the genome with replacement.
mask: specify 0-based BED file with positions that should not be considered when estimating the \(\theta\) for each genomic window
window: specify a 0-based BED file with the coordinates of user-defined windows for which theta should be calculated
minSitesWithData: change the minimum amount of sites with data required to run theta estimation
extraVerbose: print current \(\theta\) estimates for each iteration of EM-algorithm.
iterations: amount of full updating iterations. Each full iteration contains one iteration where theta and the base frequencies are updated, and several iterations where only theta is updated. Default = 100
numThetaOnlyUpdates: It's faster to update only theta and not the base frequencies. In addition, it is easy to initialize the base frequencies to values close to the truth, while theta is harder to estimate. Each full iteration therefore contains numThetaOnlyUpdates iterations where we only update theta. Default = 10
NRiterations: Specify the maximum number of iterations of the Newton-Raphson algorithm. The Newton-Raphson algorithm is used in the Maximization step of the EM algorithm. Default = 10

Engine Parameters

Engine parameters that are common to all tasks can be found here.

Plot theta figure

This script takes two arguments: the first defines the chromosomes you want to plot. Use e.g. 1:22 for all autosomes. The second argument is the prefix of the theta estimate output files of the individuals that you want to plot.

args <- commandArgs(TRUE)
tot=length(args)
legendvec=vector(length=tot-1)
chr=eval(parse(text=args[1]))

pdf(paste(args[2], "_theta_plot.pdf", sep=""),height=50,width=15)
par(mfrow=c(22,1))
for(c in chr){
    colors=c("#000000",rainbow(tot-2))
    plot('', type="n", xlim=c(0, 2.5e+8), ylim=c(1e-5,0.025),log='y',xlab="Chromosome position",main=paste("Chr=",c,sep=''), yaxt='n') #ylab=expression("Estimated "*theta)
    labelsY1=parse(text=paste(c(1,1,1,1),"%*%","10^",c(-5,-4,-3,-2), sep=""))
    axis(2, at=c(10^-5, 10^-4, 10^-3, 10^-2), labels=labelsY1, las=2)

    for (i in (2:tot)){
            a0<-read.delim(paste(args[i], "_theta_estimates.txt", sep=''))
            a0<-na.omit(a0)
            a0.1<-subset(a0,Chr==c|Chr==paste("chr",c,sep=''))
            lines(a0.1$start,a0.1$theta_MLE,type='l',col=colors[i-1])
            abline(h=median(subset(a0.1,theta_MLE>0)$theta_MLE),lty=2,col=colors[i-1])
            base=basename(args[i])
            legendvec[i-1]=strsplit(base,split="[_.]+")[[1]][1]
    }

    legend("bottomleft",legend=legendvec,col=colors,lty=rep(1,tot-1),lwd=rep(1.5,tot-1),horiz=T)
}
dev.off()

Method

Consider that at each site \(i\) there are \(n_i\) reads. We denote by \(d_{ij}\), \(j=1,\ldots, n_i\) the base of read \(j\) covering site \(i\).

The likelihood function for estimating \(\theta\) in a genomic window of length \(I\) is the following:

\begin{equation*} \mathbb{P}(\boldsymbol{d}|\theta, \boldsymbol{\pi}) = \prod_{i=1}^I \sum_g \prod\limits_{j=1}^{n_i} \mathbb{P}(d_{ij}|g_i=g)\mathbb{P}(g_i=g|\theta,\boldsymbol{\pi}), \end{equation*}

where \(g_i\) is the genotype at that site and \(g_i\in\{AA,AC,...,TT\}\) is one of all possible 10 diploid genotypes.

\(\mathbb{P}(g_i|\theta,\boldsymbol{\pi})\) is given by Felsenstein's substitution model 1981 (see Overview section) and the probabilities \(\mathbb{P}(d_i|g_i=g)\) are given by our Genotyping Model, which takes into account the recalibrated error rates and the post-mortem damage patterns (if the necessary parameter values are provided).

\(\theta\) and \(\boldsymbol{\pi}\) are inferred using a standard EM-algorithm, described in detail in Kousathanas, A. et al. (2017). Inferring Heterozygosity from Ancient and Low Coverage Genomes. Genetics, 205(1), 317–332 (page 3: Inference using Expectation-Maximization).