Wiki

Overview

We want to infer the per-allele error rate in our genotype calls produced with RAD-sequencing data. This can be using one of the following models, which are based on different external information or data. The only thing that differs between the models is the likelihood function. The tasks for the different models are called: estimateHardyWeinberg, estimateIndReps, estimateTruthSet

For all likelihood models two different error rate estimations are performed. 1. Estimating a single error rate for all sites 2. Estimating a separate error rate for truly homozygous sites and truly heterozygous sites.

Basic parameters

vcf: The minimum input is a VCF file
outname: By default the output uses the vcf filename as a prefix but can be changed with this parameter
limitLines: Read only a given amount of lines from VCF

Sample groups

By default, all samples are considered to belong to the same group. The definition of the group depends on the model used. If the Hardy Weinberg model is used, the groups are the populations. If the replicates model is used, the groups are the replicate groups and each group would only contain data from one individual.

If the truth set model is a bit different because the number of samples per group must always be two. For this model you must always also provide another file containing the information (see parameter samplePairs)

Parameters:

groups: Specify the samples that should be in the same group as [Sample1,Sample2],[Sample3,Sample4], where all samples must be contained in one group. You can also provide a text file where the first column contains the sample names and one of the following columns is contains a group assignment. This text file must contain a header.
groupCol: Indicate which column to read group assignments from when using groups to specify an assignment file. Default = 2
samplePairs: Provide a text file with two columns: The first contains the "observed" sample names and the second the corresponding "true" sample names.

Sequencing batches

By default, all samples are considered to have the same error rates per depth. However, you may have samples that were sequenced in separate runs and thus have different error rates. want to create error classes, e.g. if some samples were sequenced in different runs and you have reason to believe that this affects the error rate.

Parameters:

batches: Specify the samples that should be in the same sequencing batch as [Sample1,Sample2],[Sample3,Sample4], where all samples must be contained in one class. You can also provide a text file where the first column is the sample names and one of the following columns is contains a error class assignment. This text file must contain a header.
batchesCol: Indicate which column to read batch assignments from, when using batches to specify an assignment file. Default = 3

Inference algorithm

For all models you can choose if you want to infer the parameters with an EM-algorithm and obtain the maximum likelihood estimates or if you want to use an MCMC and obtain the posterior densities of the parameters. The parameter for this is algo and you can set it to either MLE or Bayes.

Parameters:

algo: Can be set to either "MLE" or "Bayes". Default for the Hardy-Weinberg model is Bayes and for the rest MLE.
numIter: Control the maximum number of iterations of the EM algorithm or MCMC algorithm. Default for EM = 1000 default for MCMC = 100000.
minDelta: Assume EM-algorithm to have converged when the parameter value does not change by more than minDelta. default = 0.000001
minEpsIter: Minimum amount of EM iterations to be run, even if algorithm has converged
maxEpsIter: Minimum amount of EM iterations to be run, even if algorithm has not converged
burnin: Length of burnin in MCMC. Default = 5000
numBurnins: Number of burnin rounds in MCMC (used to adjust proposal range for optimal acceptance rate). Default = 5
thinning: After how many MCMC iterations the sampled values should be printed to file. Default = 10.
epsLambda: The parameter used for the exponential prior on the error rates

estimateIndReps

This model assumes that the same individual was sequenced in different sequencing runs.

Input

VCF file
optional: replicate group association for each sample
optional: batch association

Output

example_errorRates.txt: estimates of error rates for all sets (as defined by sequencing depth and batch)
example_individualGenotypeFrequencies.txt: estimates of genotype frequencies for all replicate groups

Usage example

Different error rates for depth=1 and depth=2, only one sequencing batch, 3 replicates

./tiger task=estimateIndReps outname=example vcf=example.vcf.gz groups=example_sampleGroups.txt

Different error rates for depth=1 and depth=2, two sequencing batches, 3 replicates

./tiger task=estimateIndReps vcf=multipleSets.vcf.gz groups=multipleSets_sampleGroups.txt groupCol=2 batches=multipleSets_sampleGroups.txt batchesCol=3

For more examples see our Individual Replicates Tutorial

estimateHardyWeinberg

This model assumes that all individuals sequenced come from at least one population in Hardy-Weinberg equilibrium. Their genotype frequencies thus are determined solely by the allele frequencies.

Input

VCF file
optional: population association for each sample
optional: batch association

Output

example_errorRates.txt: error rates for all sets (as defined by sequencing depth and batch)
example_alphaBeta.txt: posterior mean of alpha and beta parameters for all populations
example_populationAlleleFrequencies.txt: posterior mean of allele frequencies for all populations
example_twoError_MCMC_chain.txt.gz: MCMC samples for error rates estimated from heterozygous and homozygous sites separately
example_oneError_MCMC_chain.txt.gz: MCMC samples for error rates estimated from all sites combined

Usage example

./tiger task=estimateHardyWeinberg vcf=example.vcf.gz batches=example_sampleGroups.txt batchesCol=3  groups=example_sampleGroups.txt outname=example

For more examples see our Hardy-Weinberg Tutorial

estimateTruthSet

This model assumes that for each individual you have true genotypes and observed genotypes, both represented by samples in your VCF.

Parameters specific to truth set

trueMinDepth: filter out sites that have smaller depth in "true" sample
depthBins: comma separated list of intervals, in format "min-max". Not individual depths but these bins will then define the sets
maxDepth: all sites larger or equal to this depth will be put in the same bin
printTables: print a table for every set that is similar to the table in the models page, but contains the counts of sites in each category

Input

VCF file
sample pairs
optional: batch association

Output

example_errorRates.txt: error rate estimates for all sets (as defined by sequencing depth and batch)
example_errorRates_perIndividual.txt: error rate estimates per individual (ignoring depth and batch)

Usage example

./tiger task=estimateTruthSet vcf=multipleBatches.vcf.gz samplePairs=multipleBatches_samplePairs.txt batches=multipleBatches_sampleGroups.txt batchesCol=3

For more examples see our Truth Set Tutorial