SimLoRD - Simulation of Long Read Data

SimLoRD is a Python command line program for the simulation of long reads from third generation sequencing technologies.

Currently only the error model of Pacific Biosciences SMRT-sequencing is supported. The identified error model is based on the analysis of two freely available datasets from Pacific Biosciences (D1, D2) and was then validated with two different datasets (D3, D4).

ID	type	organism	#CCSs	CCS bases	#subreads	subreads bases	chemistry
D1	DNA	Neurospora Crassa	12 566	103 Mbp	176 075	982 Mbp	P4-C3
D2	RNA	Homo sapiens - brain	161 948	481 Mbp	2 334 092	6 Gbp	P5-C3
D3	RNA	Homo sapiens - MCF7	1 134 517	1.9 Gbp	10 029 360	15 Gbp	P4-C2
D4	DNA	Caenorhabditis elegans	21 426	350 Mbp	461 333	5 Gbp	P6-C4

Description of data characteristics and error model

There are two different types of reads resulting from SMRT sequencing: subreads and circular consensus sequence reads (CCSs). Because of the circular form of the DNA fragments with adapter sequences between forward and backward strand, a DNA fragment may be sequenced multiple times in a single run. For a single pass through the sequencer, the error rate is high (subreads), but it is possible to calculate a consensus after multiple passes (CCSs).

Read lengths

We found that DNA reads (subreads as well as CCSs) show a lognormal distribution of the read length, while RNA reads are usually size selected and show peaks in the read length distribution.

Examples for real datasets:

Results with SimLoRD:

Calling SimLoRD with (0.2001, -10075, 17923) as lognorm-params (according to the N. crassa dataset, parameter set S1)

Number of passes for one molecule

Within one sequencer run the DNA molecules may be sequenced multiple times depending on their length. Shorter molecules have a tendency for more passes than longer molecules. The following plots show that the different datasets follow a similar distribution.

Using the paramters obtained from (D2) SimLoRD leads to the following results for the lengths passes correlation:

SimLoRD results for parameter set S1

If you take a look at the profile of the plots, you can see that the number of passes is chi-square distributed with the parameter depending on the read length.

After obtaining the parameters of the chi² distribution for all read length ranges it is possible to fit functions for the parameter calculation:

Correlation of quality increase and number of passes

A FASTQ quality Q describes the error probability p of that base on the logarithmic PHRED scale.

Q = -10 * log10(p)
p = 10**(-Q/10)

Subreads have a mean quality of 5 to 10 which means an error probability between 31% and 10%:

For CCS reads the quality increases depending on the number of passes for each read. A quality of 25 for example represents a error probability of 0.31%:

If you look at the increase of quality as a function of the number of passes, you see a distribution following a noisy square root function:

Increase vs. passes HG brain (D2))

Simulation of base qualities

Putting these results together:

The read length is drawn from a lognormal distribution (or specified otherwise).
Depending on the read length, the parameters for the chi-square distribution are calculated.
The number of passes is drawn from the chi square distribution.
The increase i of quality is drawn from a noisy square root.
If initial error probabilities p(0) for reads (or basepairs) with one pass are given, we adjust them as follows to p(i) for quality increase i:
```
p(i) = p(0)**i
```
With the adjusted error probabilities, the read can be simulated.

Parameters of SimLoRD

SimLoRD needs a path to an output file for the simulated reads.

Optionally, different names than the defaults may be set for the SAM alignment output, or to save a generated reference.

There are four possibilities to specify the read length:

give three parameter for lognormal distribution to draw read length from (-ln)
set fixed read length for all reads (-fl)
give a FASTQ-file with reads to draw the read length from (-sf)
give a file containing only read lengths (one int per line) to draw from (-st)

The parameters n and s for the chi² distribution are calculated with the following functions, where x is the read length and the parameters m, b, z and m, b, z, c, a are defined with -xn and -xs, respectively:

n(x) =
- = m*x + b if x < z,
- = m*z + b if x >=z;
s(x) =
- = m*x + b if x <= z,
- = c * x**(-a) if x > z.

The quality increase is calculated as sqrt(p + a) - b + N with number of passes p, parameter a and b (-sq) and normally distributed noise N with additional parameters.

Overview over all parameters

The table shows all parameters of SimLoRD as well as the used parameter sets S1 and S2.

parameter / option (long)	short	description [default value]	parameter set S1	parameter set S2
--num-reads	-n	number of reads to simulate [1000]	12566	-
--coverage	-c	desired read coverage, used to calculate number of reads	-	-
--read-reference PATH	-rr	read reference from FASTA file	neurospora_crassa.fa	-
--generate-reference GC LEN	-gr	generate reference with given GC content and length	-	-
--save-reference PATH	-sr	path for generated reference	-	-
--lognorm-readlength PARAMS	-ln	draw read length from log-normal distribution [0.2001, -10075.4364, 17922.611]	0.2001, -10075.4364, 17922.611	-
--min-readlength LEN	-mr	minimum log-normal read length [50]	50	-
--fixed-readlength LEN	-fl	constant read length for all reads	-	-
--sample-readlength-from-fastq PATH	-sf	draw read length from given FASTQ file	-	-
--sample-readlength-from-text PATH	-st	draw read length from given numbers (one per line)	-	-
--max-passes N	-mp	maximum number of passes over a read [40]	40	-
--chi2-params-n PARAMS	-xn	parameters to calculate n (df) for chi^2 distribution [1.8923e-03, 2.5394e+00, 5500]	1.8923e-03, 2.5394e+00, 5500	-
--chi2-params-s PARAMS	-xs	parameters to calculate s for chi^2 distribution [0.0121, -5.12, 675, 48303.073, 1.469]	0.0121, -5.12, 675, 48303.073, 1.469	-
--sqrt-params PARAMS	-sq	parameters for quality increase square root function [0.5, 0.2247]	0.5, 0.2247	-
--prob-ins P	-pi	insertion prob. for one-pass reads [0.11]	0.15	-
--prob-del P	-pd	deletion prob. for one-pass reads [0.04]	0.09	-
--prob-sub P	-ps	substitution prob. for one-pass reads [0.01]	0.4	-
--sam-output PATH	-so	custom path for SAM file	-	-
--no-sam		do not create SAM file	-	-
--without-ns		skip regions containing Ns and sample reads only from parts completly without Ns	-	-
--uniform-chromosome-probability		sample chromosomes for reads equally distributed instead of weighted by their length. (Was default behaviour up to version 1.0.1)	-	-

Warning: Using --without-ns may lead to biased read coverage depending on the size of contigs without Ns and the expected readlength.

Comparison of SimLoRD and other simulators

A comparison of the simulation results of SimLoRD, PBSIM and FASTQSim can be found here.

Wiki

simlord / Home