# ATLAS / Sequence Data Processing Tools: recal

## Overview

Monomorphic sites

Recal is a method for base quality score recalibration, as BQSR. The difference between the two is that recal does not rely on a reference sequence. Instead, it relies on genomic regions for which the individual shows no polymorphism. Our current implementation is made to use the sex chromosomes in heterogametic individuals (e.g. the X chromosomes in most mammalian males), or for diploid regions that are known to be monomorphic, such as positions that are highly conserved among species or positions retained after filtering out those with high minor allele counts. In order to be consistent between males and females we use highly conserved sites defined by high GERP scores. GERP scores for your species in binary BED format can be downloaded from ensembl. Recal uses haploid/invariant sites as a training set to infer the recalibration parameters that are then applied across all sequencing data.

(In the past we have used the ultra-conserved sites defined by UCNE base for recal or the conserved regions defined by CEGA (bed files for CEGA can be downloaded here but we have found that these regions are overall less conserved than when we use individual sites defined by high GERP scores)

Pooling data from different read groups

Unless read groups are merged (see the parameter poolReadGroups), recal treats each readgroup independently, and outputs separate recalibration parameters. Since the sequencing machine accuracy depends on the lane and run, the recalibration parameters should be estimated separately for each lane-run combination. We recommend to pool readgroups sequenced in the same lane and run to maximize sequencing depth. In general, we see that the higher the depth the better the recalibration accuracy.

Data requirement

The minimum amount of data required for recal to work is ca. 1Mbp of haploid/invariant sites that are covered by at least 0.5X. Lack of sequencing depth can be made up for with more invariant sites and vice-versa, with the limitation that all contexts need to be represented.

Memory requirement

The memory requirement for this task is high: For a mean sequencing depth of about 2X on the X-chromosome, around 55 GB are required. If you have memory problems, you can try reducing the window size, so that less new data is read at a time. Since each window needs to be read to extract the invariant sites, but is then forgotten, this can help. However, all invariant sites need to be in memory at the same time to estimate the recal parameters. As a last resort, you can limit the amount of input variant sites by using the argument limitWindows.

In order to save memory and speed up the task we restrict our analyses to sites covered at least twice (minDepth=2), since sites covered only once contain no information about which is the correct base.

If the allele's at the monomorphic sites are known, for example if you are using sites that are conserved across many individuals or species, you can provide these alleles to recal with the parameter alleles. The benefit is that you can run recal separately and in parallel for the different read groups in one sample, which reduces the memory requirement per run.

Other

We recommend assuming the base frequencies to all be fixed to 0.25 instead of estimating them (default behavior, use estimateBaseFrequencies to turn off). The reason is that the base frequencies in your invariant sites set might be biased towards A/T or G/C. This may cause the algorithm to prefer solutions where the less common context is always considered to be a sequencing error, leading to a dichotomy in the context parameter estimates (e.g. all contexts ending with A or T will have very low values while all contexts ending with a C or G will have very high values).

## Input

• A BAM file

• post-mortem damage pattern (optional, for ancient DNA)

• txt file created by user (optional): e.g. mergeTheseRGs.txt. Provide the names of read groups should be merged for recalibration. This is the case if they were sequenced on the same sequencing lane and in the same sequencing run, or if they have a sequencing depth of less than 2x and thus not enough data for ATLAS to estimate the recalibration parameters separately for them. All read groups that should be merged should be listed on the same line, separated by any white space. The read groups that are not mentioned in this file will be recalibrated separately.

Example:

• A bed file with invariant site coordinates created by user (optional). Instead of running recal on haploid regions of the genome, you can run on with knowns invariant sites. You can pass a list of these sites with a 0-based BED file (following format):

 chromosome start_position end_position

The end_position is not taken into account. Additional columns are ignored. This bed file can either be passed to ATLAS with the parameter regions, which is more efficient if there are many individual sites uniformly distributed along the genome, or with the parameter window, which is more efficient if the sites are located in clusters.

• A alleles.bed file with known alleles created by user (optional). If the allele's at the monomorphic sites are known, for example if you are using sites that are conserved across many individuals or species, you can provide these alleles to recal with the parameter "alleles". This allows parallelizing by read group, since the underlying genotype no longer needs to be inferred, and should increase statistical power to estimate the sequencing error. You can provide the known alleles with a 0-based BED file in the following format.

 chromosome start_position end_position allele_1 allele_2

The end_position is not taken into account. Note that allele_1 must equal allele_2, otherwise the site is not monomorphic.

## Output

• A table with suffix "_recalibrationEM.txt" specifying the recalibration parameters for all covariates and readgroups

## Usage Example

For a human male using the haploid X-chromosome as input data:

./atlas task=recal bam=example.bam pmdFile=example_PMD_input_Empiric.txt chr=X minDepth=2 verbose


If you do not have a male or not enough positions on the X chromosome, you can run recal based on conserved sites that show no polymorphism (see Overview)

./atlas task=recal bam=example.bam pmdFile=example_PMD_input_Empiric.txt window=invariant_regions.bed verbose #if your bed file contains regions that are smaller than 1000 replace window by regions


If you have a file of intermediate estimates from a recalibration loop of an unfinished recal run, you can continue the refinement of the parameter estimates by passing this intermediate file as an argument, and by adding the argument "estimateRecal":

./atlas task=recal bam=example.bam pmdFile=example_PMD_input_Empiric.txt chr=X minDepth=2 recal=example_recalibrationEM_Loop5.txt estimateRecal verbose


## Specific Arguments

• recal : Use this to pass a file of intermediate recalibration parameter estimates produced by this tool
• estimateRecal : Add this parameter to the command line if you provide a recal file containing intermediate estimates, which are not the values obtained from a completed run of recal but are values you want to use as starting values for the continuation of recal.
• maxEps : An EM algorithm is considered to have converged when the likelihood of the data given the parameters does not change anymore. The difference between the likelihood of the parameter estimates of the previous iteration and the likelihood of the current iteration is called Epsilon. You can specify the epsilon, which you consider to be close enough to zero with this argument. The analogous argument in BQSR is maxF. Default = 0.000001
• iterations : Maximum number of EM-iterations to be performed. After this number is reached the current parameter estimates are written to file nevertheless. Default=100
• NRiterations : Maximum number of Newton-Raphson iterations per EM-iteration. Default=20
• maxF : Maximum value for F in Newton-Raphson. Default=0.0001
• regions or window: Input a BED file containing invariant site positions (see Input for format). regions is more efficient if there are many individual sites uniformly distributed along the genome, and window is more efficient if the sites are located in clusters.
• alleles: Input a BED file containing invariant site positions and the true alleles (see Input for format). If alleles is given, there is no need to limit the input data with regions or window.
• writeTmpTables : Output files with intermediate results of EM and Newtop-Raphson.
• estimateBaseFrequencies: Estimate base frequencies for each window instead of setting them to 0.25.
• model: Define the recal model to be used (see below for definition of models). Accepted models are: qualFuncPosFuncContext, qualFuncPosFunc, qualFuncPosSpecificContext and qualFuncPosSpecific

## Engine Parameters

Engine parameters that are common to all tasks can be found here.

## Method

Model = qualFuncPosFuncContext

Consider that at each site $$i$$ there are $$n_i$$ reads. We denote by $$d_{ij}$$, $$j=1,\ldots, n_i$$ the base of read $$j$$ covering site $$i$$. A sequencing error occurs with probability $$\epsilon_{ij}$$. These probabilities are given by

\begin{equation*} \epsilon_{ij} = \epsilon(\boldsymbol{q_{ij}}, \boldsymbol{\beta}), \end{equation*}

where $$\boldsymbol{q_{ij}}=(e_{ij}, p_{ij}, c_{ij})$$ is a given external vector of information and $$\boldsymbol{\beta} = (\beta_e, \beta_{e^2}, \beta_p, \beta_{p^2}, \beta_c)$$ are the parameters of the model that have to be estimated. While our approach is flexible regarding the choice of included covariates $$\boldsymbol{q_{ij}}$$, we for now have implemented the machine-given error estimation $$e$$, the position within the read $$p$$, the squares of these to account for a non-linear relationships, and all two-base contexts $$c$$ consisting of the bases of the read at positions $$i-1$$ and $$i$$.

In order for the error rates to be confined to [0,1], we impose the logit model:

\begin{equation*} \epsilon_{ij}(\boldsymbol{q_{ij}}, \boldsymbol{\beta}) = \frac{\exp(\eta_{ij}(\boldsymbol{\beta}))}{1+\exp(\eta_{ij}(\boldsymbol{\beta}))} \end{equation*}

where

\begin{equation*} \eta_{ik}(\boldsymbol{\beta}) = \beta_c + \beta_e \ln(\frac{e}{1-e}) + \beta_{e^2} e + \beta_p p + \beta_{p^2} p^2. \end{equation*}

We transformed $$e$$ in the second term with the inverse of the logit function so that when the machine-given error rate equals the true error rate, $$\beta_e=1$$ and all other $$\boldsymbol{\beta} = 0$$.

The likelihood for one site is

\begin{equation*} \mathbb{P}(\boldsymbol{d_i} | \boldsymbol{\beta}) = \displaystyle \sum_g \prod\limits_{j=1}^{n_i} \mathbb{P}(d_{ij}|g=g_i) \mathbb{P}(g|\boldsymbol{\pi}), \end{equation*}

where the hidden (haploid or homozygous) genotype at site $$i$$ is $$g_i$$ and $$g_i\in\{AA,CC,GG,TT\}$$ is one of the possible diploid homozygous genotypes. The probabilities $$\mathbb{P}(d_i|g = g_i)$$ are given by our genotyping model detailed here, which take into account the recalibrated error rates and the post-mortem damage patterns (if the necessary parameter values are provided). The base frequencies $$\pi$$ are derived from counting. We find the maximum likelihood solution for the full model using the EM algorithm (with the Newton-Raphson algorithm in the M-step), described in detail in: Kousathanas, A. et al. (2017). Inferring Heterozygosity from Ancient and Low Coverage Genomes. Genetics, 205(1), 317–332.

Model = qualFuncPosFunc

This is the same except we do not differentiate between contexts and estimate a single intercept for all.

Model = qualFuncPosSpecificContext

We estimate a separate $$\beta_{p_p}$$ for all $$p$$:

\begin{equation*} \eta_{ik}(\boldsymbol{\beta}) = \beta_c + \beta_e \ln(\frac{e}{1-e}) + \beta_{e^2} e + \beta_{p_p} \end{equation*}

Model = qualFuncPosSpecific

\begin{equation*} \eta_{ik}(\boldsymbol{\beta}) = \beta_e \ln(\frac{e}{1-e}) + \beta_{e^2} e + \beta_{p_p} \end{equation*}

Updated