Wiki

Error Model

We want to infer the per-allele error rate in our genotype calls produced with RAD-sequencing data.

The probability of the observed genotype given the true genotype \(\mathbb{P}(g|\gamma, \epsilon_0, \epsilon_1)\) is described in the following table, where the rows correspond to the true genotypes and the columns to the called genotypes, and the numbers correspond to the amount of alternative alleles in the genotype:

	0	1	2
0	\((1-\epsilon_0)^2\)	\(2\epsilon_0(1-\epsilon_0)^2\)	\(\epsilon_0^2\)
1	\(\epsilon_1(1-\epsilon_1)\)	\((1-\epsilon_1)^2 +\epsilon_1\)	\(\epsilon_1(1-\epsilon_1)\)
2	\(\epsilon_0^2\)	\(2\epsilon_0(1-\epsilon_0)^2\)	\((1-\epsilon_0)^2\)

Error models

The error rates \(\epsilon_0\) and \(\epsilon_1\) are the per-allele genotyping error rates for homozygous and heterozygous sites, repsectively. TIGER also estimates a single error rate for all sites combined.

Sets

Error rates can vary due to sequencing run and sequencing depth. We therefore categorize our data into different sets, for which we want to estimate separate error rates. A set is defined by its combination of sequencing batch and sequencing depth.

	sequencing batch 1	sequencing batch 2
1x	set 1	set 2
2x	set 3	set 4

Inference Models

The error rates can be estimated from sequencing data with different strategies based on different external information or data.

Individual Replicates

This method assumes that you have replicate groups, consisting of different sequencing runs of the same individual.

Hardy-Weinberg

This method assumes that you have populations that are in Hardy-Weinberg equilibrium, i.e. the genotype frequencies are determined solely by the allele frequencies.

Truth Set

This method assumes that you have individual pairs, consisting again of sequencing runs of the same individual, but you assume one of the runs to produce the true genotypes, while the other run contains genotyping errors.