Wiki

Overview

A characteristic feature of ancient DNA is post-mortem damage (PMD). The most common form of PMD is C-deamination, which leads to a C-->T transition on the affected strand and a G-->A transition on the complimentary strand. These deaminations do not occur randomly along the whole read, but instead are observed much more frequently at the beginning of a read. This is due to fragment ends being more often single-stranded and thus subject to a much higher rate of deamination.

The extent of PMD can either be estimated in a position-specific way or it can be modeled with an exponential decay function, in which the rate of PMD decays with the position in the read. The reference is used to count C-->T and G-->A changes but in both models we take into account the probability that these changes might be true variants.

The advantage of modeling PMD with a function compared to the position-specific estimation is that the rate of PMD is generally low far away from the read ends, and the estimates may become noisy for these positions, particularly if data is limited. However, at the ends of the reads where the rate of PMD is high, the position-specific estimation is more accurate.

Input

BAM file
FASTA file
txt file created by user (optional): e.g. mergeTheseRGs.txt. Provide the names of read groups should be merged for PMD estimation. All read groups that should be pooled should be listed on the same line, separated by any white space. The read groups that are not mentioned in this file will be recalibrated separately. Read groups should only be pooled if they come from the same sample and you have strong reasons to believe that they have the same PMD pattern. PMD estimation does not require a lot of data so it is usually advisable not to pool any read groups, i.e. not provide this pooling file.

Example:

readgroup1 readgroup2

readgroup4 readgroup6 readgroup8

Output

example_PMD_input_Empiric.txt:

A list of pmd patterns estimated with the empirical method for each readgroup
example_PMD_input_Exponential.txt:

A list of pmd patterns estimated with the exponential method for each readgroup
example_PMD_Table_counts.txt:

The counts of all possible transitions for each read position (or up to a certain position, see specific command length)
example_PMD_Table.txt:

For all possible transitions the ratio of the transition counts, which are taken from the _counts.txt table, over the total amount of the base that was mutated, for each position and readgroup

Usage Example

Example for the estimation of the PMD pattern using the first 50 bases of each read on chromosome 1:

./atlas task=PMD bam=example.bam fasta=reference.fa length=50

Specific Arguments

length : Specify the amount of positions, counted from the ends of the reads, for which the the amount of C-->T and G-->A transition shall be counted and taken into account in the estimation of pmd. This value must be smaller or equal to the maximum read length. Increase this value if the final numbers on each line in the output file example_PMD_input_Empiric.txt are larger than 0.01. Default = 50
numNRIterations : Specify the number of iterations to be performed when estimating the exponential model. Default=100, does not need to be changed for standard procedure.
eps : This option applies to exponential model. It specifies the error between old and new likelihood that is accepted for convergence in the EM algorithm. Default = 0.001
poolReadGroups : Provide a txt file with read group names that should be pooled for PMD estimation. See Input for more information.

Engine Parameters

Engine parameters that are common to all tasks can be found here.

Plot PMD patterns with R

The empiric and exponential PMD patterns can be plotted for each read group with the R script below called plot_PMD.R. It requires the "stringr" package and the prefix of the output tables as an argument.

Example command for running the script:

Rscript plot_pmd.R example

plot_pmd.R:

require(stringr)

args <- commandArgs(TRUE)
file <- args[1]

#open files
t_emp <- read.table(paste(file,"_PMD_input_Empiric.txt", sep=''))
t_emp <- t_emp[which(t_emp$V2 == 'CT' | t_emp$V2 == 'GA'),]
t_exp <- read.table(paste(file,"_PMD_input_Exponential.txt", sep=''))


#functions
extractDoubles <- function(s) { return(as.numeric(unlist(str_extract_all(s, '-?[0-9.]+e?[-+]?[0-9]*')))) }
pmd <- function(x,a,b,c){ return(a*exp(-x*b)+c) }

num_RG <- length(levels(t_emp$V1))

#fill lists
l_emp_fwd <- list()
l_emp_rev <- list()
counter = 0
readgroup_names = character(num_RG)
for(i in seq(1, 2*num_RG, by=2)){
counter = counter + 1
readgroup_names[counter] <- as.character(t_emp$V1[i])
emp_fwd <- extractDoubles(t_emp$V3[i])
emp_rev <- extractDoubles(t_emp$V3[i+1])
l_emp_fwd[[counter]] <- emp_fwd
l_emp_rev[[counter]] <- emp_rev
}

num_pos <- length(l_emp_fwd[[1]])

#define plotting parameters
n_row <- ceiling(num_RG / 5)
pdf(paste(file, "_pmdPlot.pdf", sep=''), width=20, height=n_row*4)
par(mfrow=c(n_row, 5))
CEX <- 1.5

#plot
for(i in 1:num_RG){
plot(1:num_pos, rep(0, num_pos), type="l", xlim=c(1,num_pos), ylim=c(0, 0.7), lwd=0.1, xlab="", ylab='', yaxs='i', main=readgroup_names[i], cex.axis=CEX)
mtext("Mismatch rate", side=2, line=2.5)
mtext("Position in read", side=1, line=2.5)

#empiric
lines(1:num_pos, l_emp_fwd[[i]], col="red", lty=2, pch=20, type="b")
lines(num_pos:1, l_emp_rev[[i]], col="blue", lty=2, pch=20, type="b")

#exponential
exp_fwd <- extractDoubles(t_exp$V2[i])
exp_rev <- extractDoubles(t_exp$V3[i])
lines(x=1:num_pos, y=pmd(1:num_pos, exp_fwd[1], exp_fwd[2], exp_fwd[3]), col="red", lwd=2)
lines(x=1:num_pos, y=pmd(num_pos:1, exp_rev[1], exp_rev[2], exp_rev[3]), col="blue", lwd=2)
}
legend("topleft", c("C->T", "G->A", "emp", "exp"), col=c("red", "blue", "black", "black"), text.col=c("red",     "blue", "black", "black"), text.width=c(8,8,9,9), cex=CEX, seg.len=1, x.intersp = 0.3, pch=c(20,20,NA,NA),     lty=c(NA,NA,2,1), bty="n", horiz=T)

dev.off()

Methods

The extent of PMD can either be estimated in a position-specific way by calculating for each position:

\begin{equation*} \frac{\frac{C \rightarrow T}{C_{ref}} - \frac{T \rightarrow C}{T_{ref}}}{1 - \frac{T \rightarrow C}{T_{ref}}}, \end{equation*}

or it can be modeled with an exponential decay function, in which the rate of PMD decays with the position in the read:

\begin{equation*} P(\boldsymbol{d}|p) = \mu + (1-\mu) ( a e^{-b p} + c) \end{equation*}

where \(p\) is the position in the read, \(\mu\) is the probability of a true difference between the sequenced individual and the reference and is estimated based on the T to C or A to G mutations. a, b and c are the parameters of a exponential distribution. These parameters are inferred with a Newton-Raphson Algorithm. For more mathematical details see Kousathanas, A. et al. (2017). Inferring Heterozygosity from Ancient and Low Coverage Genomes. Genetics, 205(1), 317–332 (page 13: Inference using Expectation-Maximization).