Wiki

Clone wiki

ATLAS / Sequence Data Processing Tools: BQSR

Overview

How BQSR works

This functionality is a direct extension of BQSR (Base Quality Score Recalibration) implemented in GATK (DePristo et al. (2011), McKenna et al. (2010)), the difference being that our version of BQSR is able to correct for post-mortem damage (PMD).

The simplified explanation of the BQSR method is the following: Sequencing reads are mapped to a reference genome and the known variant sites are masked. The sequencing error rate is then defined as the number of mismatches over the number of positions aligned. However, it is known that the error rate is not the same for all bases called in a sequencing round. Therefore, BQSR infers new quality scores by binning the sequencing data into groups based on covariates. All bases within such a bin are assumed to share the same error rate or recalibration coefficients. The covariates currently implemented in ATLAS are: the raw machine-estimated error rate, the position in the read as counted from the 5'-end, the reverse position in the read as counted from the 3'-end of the read, and the sequence context and their corresponding recalibration coefficients are \(\boldsymbol{\beta_q},\boldsymbol{\beta_p}, \boldsymbol{\beta_r}, \boldsymbol{\beta_c}\), respectively. The recalibration coefficients are inferred based on the reference sequence after the positions of common variants are masked. The Newton-Raphson method is used to find the most likely estimates for all \(\boldsymbol{\beta}\). The final error rate for each called base is computed as follows:

\begin{equation*} \boldsymbol{\epsilon} = \boldsymbol{\beta_q}\boldsymbol{\beta_p} \boldsymbol{\beta_r} \boldsymbol{\beta_c}, \end{equation*}

For each covariate one BQSR table is produced, which contains the recalibration coefficients for each bin of the covariate. These tables are produced sequentially in the order of the equation above. Separate recalibration coefficients are are estimated for each read group.

IMPORTANT NOTE: BQSR assumes that all variant sites in the genome are known and provided in the masking bed file. See recal for another recalibration method.

Different ways of using BQSR

The reading of the BAM file is the most time-consuming part. Since the bins are defined separately for each readgroup, it is possible to parallelize BQSR by readgroup. This can be done by splitting the BAM file according to readgroup, running BQSR for each of these BAM files separately, and in the end concatenating the covariate tables.

For the estimation of each covariate, the BAM file is read and stored in memory. The amount of memory needed is approximately equal to the size of the BED file + 5 x the size of the BAM file. If this memory requirement cannot be met, even when parallelizing by readgroup, it is possible, although tedious and slow, to run BQSR by reading and storing only one window at a time. See example below.

BQSR requires a minimum of around 6 million reads per read group. If some read groups are too small, you should merge them by using the input option poolReadGroups (see Input for more details). Read groups should be merged according to library, sequencing run and sequencing lane, in that order. The more diverse the read groups are, the less it is advised to merge them, since BQSR estimates parameters that are very specific to the read group.

The data used for BQSR can be increased by using the engine parameter keepDuplicates, which suppresses the filter that normally filters out PCR-duplicates. PCR-duplicates can be considered as independent data points when trying to recover sequencing accuracy.

In order to mask the variant sites you can either pass a bed file containing the invariant sites (use the parameter "alleles") or a bed file containing the variant sites (use the parameter "mask").

Input

  • BAM file

  • FASTA file

  • BED file : specifying known variants, centromeres, and telomeres (the positions should be 0-based)

  • PMD file : for ancient DNA

  • txt file (optional): e.g. mergeTheseRGs.txt. This file can be used if some read groups have less than ~6Mio reads and should be pooled during BQSR. All read groups that should be merged are listed on the same line, separated by any white space. The read groups that are not mentioned in this file will be recalibrated separately.

    Example:

    readgroup1 readgroup2

    readgroup4 readgroup6 readgroup8

  • A alleles.bed file : This file contains all known invariant sites of the genomes in the following format

    chromosome position allele1 allele2

    If at a given position the reference differs from both of the alleles in the bed file, this position will be written into a conflicts.txt file and will not be taken into account.Alleles1 and 2 have to be the same in the bed file, otherwise it is not an invariant site. If the alleles differ, there will not be an error but it is allele1 that will be considered to be reference.

  • OR: A mask.bed file : This file contains all known variant sites of the genomes in the following format

    chromosome positionStart positionEnd

Output

Recalibration Tables : For each cofactor analyzed one table is produced. These tables specify the recalibration parameters and will be used as input in subsequent tasks, such as variant calling. The following information is provided:

  • Read group which the estimates correspond to. If read groups were merged for BQSR, the read group names will be different but all other estimates will be identical
  • The bins of the covariates
  • The event type. The only event type implemented in ATLAS is "M" for mutation
  • The recalibration coefficient for each bin
  • The number of observations in each bin in log10 scale
  • The first derivative of the loglikelihood function at the current estimation of the error rate. This is needed for the Newton-Raphson method and should converge to zero. You can specify the threshold after which you consider the algorithm to have reached convergence (see specific argument maxF below)
  • The second derivative. This is needed for the Newton-Raphson method
  • F: first derivative / number of observations

Usage Examples

The standard way to produce tables for all covariates is the following:

./atlas task=BQSR bam=example_readgroup1.bam fasta=reference.fa pmdFile=example_PMD_input_Empiric.txt mask=variable_regions.bed estimateBQSRPosition estimateBQSRPositionReverse estimateBQSRContext maxPos=100 window=1000000 minEpsQuality=0.001 minEpsFactors=0.001 storeInMemory verbose

With merged read groups:

./atlas task=BQSR bam=example_readgroup1.bam fasta=reference.fa pmdFile=example_PMD_input_Empiric.txt mask=variable_regions.bed  estimateBQSRPosition estimateBQSRPositionReverse estimateBQSRContext maxPos=100 window=1000000 minEpsQuality=0.001 minEpsFactors=0.001 storeInMemory poolReadGroups=mergeTheseReadGroups.txt verbose #the recalibration table for the quality covariate is produced by default

It is possible to not estimate all covariates at once, but previously produced covariate tables should be passed with the appropriate argument in order to get correct estimates for the subsequent covariates. Example for when the quality table was previously created:

./atlas task=BQSR bam=example_readgroup1.bam fasta=reference.fa pmdFile=example_PMD_input_Empiric.txt mask=variable_regions.bed BQSRQuality=example_readgroup1_BQSR_ReadGroup_Quality_Table.txt estimateBQSRPosition estimateBQSRPositionReverse estimateBQSRContext maxPos=100 window=1000000 minEpsQuality=0.001 minEpsFactors=0.001 storeInMemory verbose

It is also possible to re-estimate covariates for which tables have already been produced. In this case, the estimates of the input table are used as starting values. If the starting values are close to the true values, this can save time:

./atlas task=BQSR bam=example_readgroup1.bam fasta=reference.fa pmdFile=example_PMD_input_Empiric.txt mask=variable_regions.bed BQSRQuality=example_readgroup1_BQSR_ReadGroup_Quality_Table.txt estimateBQSRQuality estimateBQSRPosition estimateBQSRPositionReverse estimateBQSRContext maxPos=100 window=1000000 minEpsQuality=0.001 minEpsFactors=0.001 storeInMemory verbose

When running BQSR by reading and storing only one window at a time, adding a "preconvergence" step before estimating each of the covariates can speed up the process. The idea behind the preconvergence step is to estimate the given covariate for only part of the genome, e.g. only the first 100 windows of chromosome 1, and then pass the resulting table as an argument when running BQSR on the full chromosome. This will improve the starting values of the algorithm and thus decrease the amount of loops needed to obtain the final estimates:

./atlas task=BQSR bam=example_readgroup1.bam fasta=reference.fa pmdFile=example_PMD_input_Empiric.txt  mask=variable_regions.bed preconverge=50 numWindows=100 verbose #the recalibration table for the quality covariate is produced by default

Specific Arguments

General

  • estimateBQSRQuality : create a readgroup x quality recalibration table
  • estimateBQSRPosition : Create a readgroup x position recalibration table
  • estimateBQSRPositionReverse : Create a readgroup x position reverse recalibration table
  • estimateBQSRContext : Create a readgroup x context recalibration table
  • minQ : Minimum raw quality score to be considered for recalibration. Bases with lower quality scores will be added to the minQ bin. Default = 1
  • maxQ : Maximum raw quality score to be considered. Bases with higher quality scores will be added to the minQ bin. Default = 100
  • maxPos : This should be set to equal or larger than the maximum read length in the case of single-end data and equal or larger than the maximum fragment length in the case of paired-end data. It let's BQSR know how many variables it will have to estimate for the position covariate. If a read/fragment is found that is longer than maxPos an error will be thrown. Default = 100
  • storeInMemory : Read the BAM file and store it in memory for the estimation of each cofactor as opposed to reading it window by window. Use this option unless the BAM file requires too much memory for your machine and cannot be split according to read groups.
  • writeTmpTables : Print table for each loop of the Newton-Raphson algorithm, which finds the maximum likelihood recalibration coefficient for each bin.

Newton-Raphson

  • maxF : Needed for fine-tuning the Newton-Raphson algorithm. The Newton-Raphson method is used for the estimation of the recalibration parameter for each bin and each cofactor. With MaxF you specify at which value you consider the Newton-Raphson to have converged close enough to 0.0. Default = 0.0000001
  • minEpsQuality : Needed for fine-tuning the Newton-Raphson algorithm. The Newton-Raphson method is used for the estimation of the recalibration parameter for each bin and each cofactor. With minEpsQuality you can consider the algorithm to have converged when the recalibration parameter that is being estimated remains constant up to the specified decimal position. Use minEpsQuality to specify this threshold for the Quality covariate. Default = 10^-6, we used 10^-3.
  • minEpsFactors : Needed for fine-tuning the Newton-Raphson algorithm. The Newton-Raphson method is used for the estimation of the recalibration parameter for each bin and each cofactor. With minEpsFactors you can consider the algorithm to have converged when the recalibration parameter that is being estimated remains constant up to the specified decimal position. Use minEpsFactors to specify this threshold for the covariates Position, Position Reverse and Context. Default = 10^-4, we used 10^-3
  • minObservations : Minimum number of values in a cell required for ATLAS to consider it. Default = 32'000
  • maxNumLoops : Specify the max number of loops for Newton-Raphson, default = 500

Input

  • BQSRQuality : File with error estimates based on the quality score (to be used as initial values for further recalibration steps)
  • BQSRPosition : File with alpha estimates based on the position (to be used as initial values for further recalibration steps)
  • BQSRPositionReverse : File with alpha estimates based on the position (to be used as initial values for further recalibration steps)
  • BQSRContext : File with error estimates based on the context score (to be used as initial values for further recalibration steps)
  • poolReadGroups : Input a .txt file that specifies which read groups should be merged for the BQSR. See Input for more information.
  • alleles : Input a 0-based BED file containing invariant sites positions and the corresponding base (see Input for format)

preConverge

  • preConverge : Loop over chr1 for maximally the specified number of loops and use values in resulting table as initial error estimates for running over the whole genome. This will reduce the total running time.
  • limitChrPreConv : Specify up to which chromosome preconvergence should run. Default = 1
  • limitWindowsPreConv : Specify up to which window on each chromosome preconvergence should run. Default = 100
  • LLSurface : print log likelihood surface, for debugging

Engine Parameters

Engine parameters that are common to all tasks can be found here.

Updated