SimLoRD - Simulation of Long Read Data
SimLoRD is a Python command line program for the simulation of long reads from third generation sequencing technologies.
Currently only the error model of Pacific Biosciences SMRT-sequencing is supported. The identified error model is based on the analysis of two freely available datasets from Pacific Biosciences (D1, D2) and was then validated with two different datasets (D3, D4).
|ID||type||organism||#CCSs||CCS bases||#subreads||subreads bases||chemistry|
|D1||DNA||Neurospora Crassa||12 566||103 Mbp||176 075||982 Mbp||P4-C3|
|D2||RNA||Homo sapiens - brain||161 948||481 Mbp||2 334 092||6 Gbp||P5-C3|
|D3||RNA||Homo sapiens - MCF7||1 134 517||1.9 Gbp||10 029 360||15 Gbp||P4-C2|
|D4||DNA||Caenorhabditis elegans||21 426||350 Mbp||461 333||5 Gbp||P6-C4|
Description of data characteristics and error model
There are two different types of reads resulting from SMRT sequencing: subreads and circular consensus sequence reads (CCSs). Because of the circular form of the DNA fragments with adapter sequences between forward and backward strand, a DNA fragment may be sequenced multiple times in a single run. For a single pass through the sequencer, the error rate is high (subreads), but it is possible to calculate a consensus after multiple passes (CCSs).
We found that DNA reads (subreads as well as CCSs) show a lognormal distribution of the read length, while RNA reads are usually size selected and show peaks in the read length distribution.
Examples for real datasets:
- Read length distribution of the subreads from N. crassa (D1)
- Read length distribution of the CCSs from N. crassa (D1)
- Size selected read length distribution of the CCSs from HG brain (D2)
- Read length distribution of the size selected CCSs (15-20 KBp) from C. elegans (D4)
Results with SimLoRD:
- Calling SimLoRD with (0.2001, -10075, 17923) as lognorm-params (according to the N. crassa dataset, parameter set S1)
Number of passes for one molecule
Within one sequencer run the DNA molecules may be sequenced multiple times depending on their length. Shorter molecules have a tendency for more passes than longer molecules. The following plots show that the different datasets follow a similar distribution.
- Correlation between read length and number of passes for N. crassa CCS (D1)
- Correlation between read length and number of passes for HG brain CCS (D2)
- Correlation between read length and number of passes for HG MCF 7 (D3)
- Correlation between read length and number of passes for C. elegans (D4)
Using the paramters obtained from (D2) SimLoRD leads to the following results for the lengths passes correlation:
If you take a look at the profile of the plots, you can see that the number of passes is chi-square distributed with the parameter depending on the read length.
- chi^2 distribution for CCSs of HG brain with read length from 2050 to 2100 (D2)
- chi^2 distribution for CCSs of HG brain with read length from 3050 to 3100 (D2)
After obtaining the parameters of the chi² distribution for all read length ranges it is possible to fit functions for the parameter calculation:
Correlation of quality increase and number of passes
A FASTQ quality Q describes the error probability p of that base on the logarithmic PHRED scale.
Q = -10 * log10(p) p = 10**(-Q/10)
Subreads have a mean quality of 5 to 10 which means an error probability between 31% and 10%:
- Mean quality of a read vs.read length - subreads N. crassa (D1)
- Mean quality of a read vs.read length - subreads HG brain (D2)
- Mean quality of a read vs.read length - subreads C. elegans (D4)
For CCS reads the quality increases depending on the number of passes for each read. A quality of 25 for example represents a error probability of 0.31%:
If you look at the increase of quality as a function of the number of passes, you see a distribution following a noisy square root function:
Simulation of base qualities
Putting these results together:
- The read length is drawn from a lognormal distribution (or specified otherwise).
- Depending on the read length, the parameters for the chi-square distribution are calculated.
- The number of passes is drawn from the chi square distribution.
- The increase i of quality is drawn from a noisy square root.
- If initial error probabilities p(0) for reads (or basepairs) with one pass are given, we adjust them as follows to p(i) for quality increase i:
p(i) = p(0)**i
- With the adjusted error probabilities, the read can be simulated.
Parameters of SimLoRD
SimLoRD needs a path to an output file for the simulated reads.
Optionally, different names than the defaults may be set for the SAM alignment output, or to save a generated reference.
There are four possibilities to specify the read length:
- give three parameter for lognormal distribution to draw read length from (-ln)
- set fixed read length for all reads (-fl)
- give a FASTQ-file with reads to draw the read length from (-sf)
- give a file containing only read lengths (one int per line) to draw from (-st)
The parameters n and s for the chi² distribution are calculated with the following functions, where x is the read length and the parameters m, b, z and m, b, z, c, a are defined with
- n(x) =
- = m*x + b if x < z,
- = m*z + b if x >=z;
- s(x) =
- = m*x + b if x <= z,
- = c * x**(-a) if x > z.
The quality increase is calculated as sqrt(p + a) - b + N with number of passes p, parameter a and b (-sq) and normally distributed noise N with additional parameters.
Overview over all parameters
The table shows all parameters of SimLoRD as well as the used parameter sets S1 and S2.
|parameter / option (long)||short||description [default value]||parameter set S1||parameter set S2|
|--num-reads||-n||number of reads to simulate ||12566||-|
|--coverage||-c||desired read coverage, used to calculate number of reads||-||-|
|--read-reference PATH||-rr||read reference from FASTA file||neurospora_crassa.fa||-|
|--generate-reference GC LEN||-gr||generate reference with given GC content and length||-||-|
|--save-reference PATH||-sr||path for generated reference||-||-|
|--lognorm-readlength PARAMS||-ln||draw read length from log-normal distribution [0.2001, -10075.4364, 17922.611]||0.2001, -10075.4364, 17922.611||-|
|--min-readlength LEN||-mr||minimum log-normal read length ||50||-|
|--fixed-readlength LEN||-fl||constant read length for all reads||-||-|
|--sample-readlength-from-fastq PATH||-sf||draw read length from given FASTQ file||-||-|
|--sample-readlength-from-text PATH||-st||draw read length from given numbers (one per line)||-||-|
|--max-passes N||-mp||maximum number of passes over a read ||40||-|
|--chi2-params-n PARAMS||-xn||parameters to calculate n (df) for chi^2 distribution [1.8923e-03, 2.5394e+00, 5500]||1.8923e-03, 2.5394e+00, 5500||-|
|--chi2-params-s PARAMS||-xs||parameters to calculate s for chi^2 distribution [0.0121, -5.12, 675, 48303.073, 1.469]||0.0121, -5.12, 675, 48303.073, 1.469||-|
|--sqrt-params PARAMS||-sq||parameters for quality increase square root function [0.5, 0.2247]||0.5, 0.2247||-|
|--prob-ins P||-pi||insertion prob. for one-pass reads [0.11]||0.15||-|
|--prob-del P||-pd||deletion prob. for one-pass reads [0.04]||0.09||-|
|--prob-sub P||-ps||substitution prob. for one-pass reads [0.01]||0.4||-|
|--sam-output PATH||-so||custom path for SAM file||-||-|
|--no-sam||do not create SAM file||-||-|
|--without-ns||skip regions containing Ns and sample reads only from parts completly without Ns||-||-|
|--uniform-chromosome-probability||sample chromosomes for reads equally distributed instead of weighted by their length. (Was default behaviour up to version 1.0.1)||-||-|
Warning: Using --without-ns may lead to biased read coverage depending on the size of contigs without Ns and the expected readlength.
Comparison of SimLoRD and other simulators
A comparison of the simulation results of SimLoRD, PBSIM and FASTQSim can be found here.