Wiki

Clone wiki

Tiger / infer

Overview

We want to infer the per-allele error rate in our genotype calls produced with RAD-sequencing data. This can be using one of the following models, which are based on different external information or data. The only thing that differs between the models is the likelihood function. The tasks for the different models are called: estimateHardyWeinberg, estimateIndReps, estimateTruthSet

For all likelihood models two different error rate estimations are performed. 1. Estimating a single error rate for all sites 2. Estimating a separate error rate for truly homozygous sites and truly heterozygous sites.

Basic parameters

  • vcf: The minimum input is a VCF file
  • outname: By default the output uses the vcf filename as a prefix but can be changed with this parameter
  • limitLines: Read only a given amount of lines from VCF

Sample groups

By default, all samples are considered to belong to the same group. The definition of the group depends on the model used. If the Hardy Weinberg model is used, the groups are the populations. If the replicates model is used, the groups are the replicate groups and each group would only contain data from one individual.

If the truth set model is a bit different because the number of samples per group must always be two. For this model you must always also provide another file containing the information (see parameter samplePairs)

Parameters:

  • groups: Specify the samples that should be in the same group as [Sample1,Sample2],[Sample3,Sample4], where all samples must be contained in one group. You can also provide a text file where the first column contains the sample names and one of the following columns is contains a group assignment. This text file must contain a header.
  • groupCol: Indicate which column to read group assignments from when using groups to specify an assignment file. Default = 2
  • samplePairs: Provide a text file with two columns: The first contains the "observed" sample names and the second the corresponding "true" sample names.

Sequencing batches

By default, all samples are considered to have the same error rates per depth. However, you may have samples that were sequenced in separate runs and thus have different error rates. want to create error classes, e.g. if some samples were sequenced in different runs and you have reason to believe that this affects the error rate.

Parameters:

  • batches: Specify the samples that should be in the same sequencing batch as [Sample1,Sample2],[Sample3,Sample4], where all samples must be contained in one class. You can also provide a text file where the first column is the sample names and one of the following columns is contains a error class assignment. This text file must contain a header.
  • batchesCol: Indicate which column to read batch assignments from, when using batches to specify an assignment file. Default = 3

Inference algorithm

For all models you can choose if you want to infer the parameters with an EM-algorithm and obtain the maximum likelihood estimates or if you want to use an MCMC and obtain the posterior densities of the parameters. The parameter for this is algo and you can set it to either MLE or Bayes.

Parameters:

  • algo: Can be set to either "MLE" or "Bayes". Default for the Hardy-Weinberg model is Bayes and for the rest MLE.
  • numIter: Control the maximum number of iterations of the EM algorithm or MCMC algorithm. Default for EM = 1000 default for MCMC = 100000.
  • minDelta: Assume EM-algorithm to have converged when the parameter value does not change by more than minDelta. default = 0.000001
  • minEpsIter: Minimum amount of EM iterations to be run, even if algorithm has converged
  • maxEpsIter: Minimum amount of EM iterations to be run, even if algorithm has not converged
  • burnin: Length of burnin in MCMC. Default = 5000
  • numBurnins: Number of burnin rounds in MCMC (used to adjust proposal range for optimal acceptance rate). Default = 5
  • thinning: After how many MCMC iterations the sampled values should be printed to file. Default = 10.
  • epsLambda: The parameter used for the exponential prior on the error rates

estimateIndReps

This model assumes that the same individual was sequenced in different sequencing runs.

Input

  • VCF file
  • optional: replicate group association for each sample
  • optional: batch association

Output

  • example_errorRates.txt: estimates of error rates for all sets (as defined by sequencing depth and batch)
  • example_individualGenotypeFrequencies.txt: estimates of genotype frequencies for all replicate groups

Usage example

Different error rates for depth=1 and depth=2, only one sequencing batch, 3 replicates

./tiger task=estimateIndReps outname=example vcf=example.vcf.gz groups=example_sampleGroups.txt

Different error rates for depth=1 and depth=2, two sequencing batches, 3 replicates

./tiger task=estimateIndReps vcf=multipleSets.vcf.gz groups=multipleSets_sampleGroups.txt groupCol=2 batches=multipleSets_sampleGroups.txt batchesCol=3

For more examples see our Individual Replicates Tutorial

estimateHardyWeinberg

This model assumes that all individuals sequenced come from at least one population in Hardy-Weinberg equilibrium. Their genotype frequencies thus are determined solely by the allele frequencies.

Input

  • VCF file
  • optional: population association for each sample
  • optional: batch association

Output

  • example_errorRates.txt: error rates for all sets (as defined by sequencing depth and batch)
  • example_alphaBeta.txt: posterior mean of alpha and beta parameters for all populations
  • example_populationAlleleFrequencies.txt: posterior mean of allele frequencies for all populations
  • example_twoError_MCMC_chain.txt.gz: MCMC samples for error rates estimated from heterozygous and homozygous sites separately
  • example_oneError_MCMC_chain.txt.gz: MCMC samples for error rates estimated from all sites combined

Usage example

./tiger task=estimateHardyWeinberg vcf=example.vcf.gz batches=example_sampleGroups.txt batchesCol=3  groups=example_sampleGroups.txt outname=example

For more examples see our Hardy-Weinberg Tutorial

estimateTruthSet

This model assumes that for each individual you have true genotypes and observed genotypes, both represented by samples in your VCF.

Parameters specific to truth set

  • trueMinDepth: filter out sites that have smaller depth in "true" sample
  • depthBins: comma separated list of intervals, in format "min-max". Not individual depths but these bins will then define the sets
  • maxDepth: all sites larger or equal to this depth will be put in the same bin
  • printTables: print a table for every set that is similar to the table in the models page, but contains the counts of sites in each category

Input

  • VCF file
  • sample pairs
  • optional: batch association

Output

  • example_errorRates.txt: error rate estimates for all sets (as defined by sequencing depth and batch)
  • example_errorRates_perIndividual.txt: error rate estimates per individual (ignoring depth and batch)

Usage example

./tiger task=estimateTruthSet vcf=multipleBatches.vcf.gz samplePairs=multipleBatches_samplePairs.txt batches=multipleBatches_sampleGroups.txt batchesCol=3

For more examples see our Truth Set Tutorial

Updated