Wiki

Clone wiki

ATLAS / Auxiliary Tools: simulate

Overview

Simulate BAM files. The quality scores can reflect either the true error rates or they can be distorted with recalibration parameters as in recal. Possibility of adding post-mortem damage and contamination.

Output

  • BAM file containing simulated reads with suffix _simulations.bam
  • reference FASTA file

If the parameter writeTrueGenotypes is given:

  • true genotypes in VCF format

If the parameter writeVariantBED is given:

  • BED file with variant site positions and genotypes
  • BED file with invariant site positions and genotypes

Usage Example

./atlas task=simulate out=example depth=2 theta=0.001

Specific Arguments

Genome

The following arguments allow the simulated genomic architecture to be specified:

  • baseFreq: Frequencies of A, C, G and T to be simulated. Default: baseFreq=0.25,0.25,0.25,0.25, which implies equal base frequencies.
  • chrLength: List of the lengths of all chromosomes to be simulated. Default: a single chromosome of length 1000000.
  • ploidy: List of the ploidy for each chromosome. Currently only accepts 1 (haploid) and 2 (diploid). Default: a single chromosome of ploidy 2.

Both lists may contain multipliers indicated by curly braces. Example: chrLength=1000000{2},5000000 implies three chromosomes, of which the first two are of 1Mb in length, and the third half of that.

If all chromosomes are to be simulated identically, then you may use the argument:

  • numChr: Allows to specify the number of identical chromosomes to be simulated. Is ignored if more than one length or ploidy is specified.

Samples

  • sampleSize: Number of samples to be simulated. Default: a single individual
  • depth: Average read depths per individual. Default: 10

ReadLength

There are different ways to simulate the read lengths and the read length parameter is also used to define if you simulate single-end or paired-end data. To specify sequencing type, prefix the read length distribution parameters with either "single:" or "paired:".

By default all reads are considered to be of equal length and also equal to the underlying fragments. In this case:

  • readLength: Length of simulated reads/fragments. Default: "single:fixed(100)"

However, fragment lengths can also be sampled from a gamma distribution. The gamma distribution can be parameterized either with the shape parameters alpha and beta or with the mode and the variance. In both cases a minimum and maximum read length have to be provided. The simulated fragments that are shorter than the minimum are be discarded while those that are longer than the maximum are kept but truncated. The amount of reads that need to be simulated given the required sequencing depth is then estimated based on the read length distribution's mean, i.e. the a truncated gamma distribution's mean.

  • gamma(alpha,beta)[min,max]: Use this syntax if you know the shape parameters of your gamma function of interest. For example gamma(10,0.2)[30,101].
  • gammaMode(mode,var)[min,max]: Use this syntax if you know the mode and variance of your gamma function of interest

Example to simulate single-end reads from fragments following a gamma distribution:

readLength=single:gamma(10,0.2)[30,101]

Diversity

Different way to specify genetic diversity are available, depending on the number of samples to be simulated.

type = one

  • theta: For a single sample, diversity is specified as \(\theta = 2 T \mu\). Default: 0.001. Note that theta has no effect on haploid chromosomes. A different theta can be specified for each chromosome.

Population sample (two or more samples):

For population samples, ATLAS offers different ways to specify diversity. Specify the prefered model with method = one of the below

type = SFS

  • theta: This option will simulate samples from a single population with a site frequency spectrum (SFS) as expected with a \(\theta=4N\mu\). Note that you can specify a separate theta value for each chromosome (accepts multipliers). For instance, theta=0.001{2},0.0005 would implied a theta of 0.001 for the first two and a theta of 0.0005 for the third chromosome.
  • sfs: You may also specify the name of a file that contains a specific site frequency spectrum (SFS) directly with this command (also accepts one SFS filename for each chromosome). Here an SFS file simply contains a single row with the frequencies (or counts) associated with each SFS bin. Note that SFS for haploid chromosomes must contain reflect the smaller number of samples for that chromosome.

You may also add the argument folded to indicate that the SFS files provided contained folded spectra.

type = pair

  • phi: ATLAS also allows to simulate two individuals with a particular genetic distance. To do so, use sampleSize=2 and provide the frequencies of the nine genotype configurations aa/aa, aa/ab, ab/aa, aa/bb, ab/ab, ab/ac, aa/bc, ab/cc and ab/cd as for instance phi=0.4,0.1{4},0.05{4}. The distribution of the actual genotypes is then determined based on these phi and the base frequencies (see above).

More information on this model can be found under the task estimateDist.

type = HW

You can simulate according to HW, with or without inbreeding

  • F : Inbreeding coefficient. Default = 0
  • alpha: Parameter of the beta distribution from which allele frequencies are drawn for polymorphic sites. Default=0.5
  • beta: Parameter of the beta distribution from which allele frequencies are drawn for polymorphic sites. Default=0.5
  • fracPoly: Fraction of sites that are polymorphic, i.e. with allele freq that is not zero or one. Default=0.1
  • writeTrueAlleleFreq: the true allele frequencies will be written to a file.

Reference Genome

ATLAS will also simulate a reference genome, for which the following arguments are available:

  • refDiv: The probability that the reference carries a different base at a site monomorphic in the sample. At polymoprhic sites the reference will carry one of the two segregating alleles at random. Default: 0.01
  • writeTrueGenotypes: write true genotypes and bed files with variant and invariant sites to file

Contamination

  • contamination : With this probability a read will be simulated based on the reference genome instead of one of the sample's haplotypes. Default = 0

Base qualities

There are different distributions of base quality scores according to which the sequencing errors will be simulated. By default the qualities are sampled from a normal distribution. In this case:

  • qualityDist=normal(30,10)[0,93]

where the syntax is normal(mean,sd)[min,max]. The other quality simulation options are:

  • fixed(quality_1): simulate a single quality score
  • binned(quality_1,quality_2,...,quality_n): sample quality scores uniformly from predefined bins

Base quality transformation

Base qualities can also be transformed (distorted) such that they need to be recalibrated. This distortion can be simulated according to either of our two recalibration methods

recal

This implements the inverse of the recalibration function learned by the task recal. To simulate according to the recal model you have to provide the following parameters: quality, quality^2, position, position^2 and one each for all 20 contexts. This can be done either via command line as:

  • recal=recal[beta_q,beta_q2,beta_p,beta_p2,...(beta for all 20 context)...]

or you can provide the example_recalibrationEM.txt file produced with ATLAS' recal.

BQSR

Simulating according to the BQSR model is less straightforward because the recalibration parameters are estimated sequentially, meaning that different combinations of the recalibration parameters \(\boldsymbol{\beta}\) can result in the same \(\epsilon\). Our BQSR simulation model therefore only contains the recalibration parameters \(\boldsymbol{\beta_q}\) and \(\boldsymbol{\beta_p}\). The quality scores written into the BAM file correspond to \(\boldsymbol{\beta_q}\). They are related to the simulated error rates \(\boldsymbol{\epsilon_q}\) with the following equation:

\begin{equation*} \boldsymbol{\epsilon_q} = 10^{-\frac{1}{10} \phi_2 \boldsymbol{\beta_{q}}} + 10^{-\phi_1/10}. \end{equation*}

The \(\boldsymbol{\beta_p}\) are simulated to have a linear relationship with \(p\), to equal revIntercept at the maximum read length, and to fulfill the following condition:

\begin{equation*} \sum_{p=1}^{P} \beta_{p} f_{p} = 1 \end{equation*}

To simulate according to the BQSR model use the command line to provide the following:

  • BQSRTransformation=BQSR[phi1,phi2,revIntercept]

where

  • phi1: quality score at which the transformation function should plateau, use e.g. 40
  • phi2: scaling factor that determines how fast the plateau should be reached, use e.g. 1.2
  • revIntercept: maximum position-effect scaling factor. This is the \(\beta_p\) at the maximum read length. Use e.g. 1.5. If the position should have no effect on the quality distortion use 1.

Post-Mortem Damage

The simulation tool accepts the arguments pmd, pmdCT and pmdGA as detailed on the engine parameter page. If neither of those arguments are given, no PMD is simulated.

Method

Reference sequences for each chromosome are simulated as a first step, Then, one or two haplotypes are simulated per chromosome, depending on the ploidy. These haplotypes differ from the reference with a probability given by refDiv. Finally, the reads are created based on the chromosome's haplotypes (or the reference if there is contamination). The haplotype used for each read is chosen at random and errors are simulated according to the true quality distribution. Post-mortem damage events can be added to the reads and the quality scores can be transformed according to the provided distributions.

Engine Parameters

Note that the simulation tool does not accept most of the engine parameters except the random generator options:

  • fixedSeed : set the seed of the random generator
  • addToSeed : this command is useful if you launch several jobs at the same time and you do not want them to use the same seed. As a default, the random generator obtains its seed based on the time of day. With addToSeed you can add something to this seed based on the time, such as job-ID

Updated