Wiki

Clone wiki

Tiger / Tutorial Truth Set Model

Simple, 2 sets

Simulate

This command will simulate two samples with 100 loci of which one percent are heterozygous. Both samples correspond to sequencing runs from the same individual, but one has perfect genotype calls (called "true") and the other has errors (called "observed"). The error parameter specifies that we want to simulate an error rate of 0.1 for sites where depth=1 and an error rate of 0.2 where depth=2. It also means that we will simulate only loci with these depths. If you want to simulate more different depths, you need to provide more error values.

./tiger task=simulate model=truthSet samples=1 sites=1000 error=0.1,0.2 outname=simple

This command will produce the files: simple.vcf.gz, the VCF file containing the genotype calls for all samples, simple_samplePairs.txt, which lists the names of the true and the observed samples that are in one pair on one line, and simple_sampleGroups.txt, which provides the error set association (here we only have one observed sample, so it is alone in one error set).

Infer

This command will infer the probability distributions of all parameters, i.e. the error rates. The genotype frequencies do not have to be inferred because they can be counted from the true true sample. We don't need to provide the sampleGroups file because we only simulated one pair.

./tiger task=estimateTruthSet vcf=simple.vcf.gz samplePairs=simple_samplePairs.txt

This error rates for the different depths and the simple_errorRates.txt, which lists the estimated error rates per depth and error set estimated based on all sample pairs, and simple_errorRates_perIndividual.txt, which lists the estimated error rates per individual averaged over all error sets.

Adjust PL

Since we know that the depth has a real influence on the error rate, we use the depth-specific error rates in file simple_errorRates.txt to adjust the genotype likelihoods.

./tiger task=adjustPL vcf=simple.vcf.gz errorRates=simple_errorRates.txt

This command produces the file simple_adjustedPL.vcf.gz, where the PL values have been corrected to reflect the genotyping error. This is the file that should be used in subsequent analyses.

Multiple Pairs, 2 sets

Simulate

This command simulates data in the same way as above except that it produces genotypes for 5 pairs of observed and true samples, leading to a total of 10 samples.

./tiger task=simulate model=truthSet samples=5 sites=1000 error=0.1,0.2 outname=multiplePairs

Infer

This command infers the error rates per sample pair and per depth, same as above. We don't need to provide the sampleGroups file because we didn't simulate any sequencing batches, i.e. groups of samples that were sequenced separately, which can lead to different error rates.

./tiger task=estimateTruthSet vcf=multiplePairs.vcf.gz samplePairs=multiplePairs_samplePairs.txt

Adjust PL

Since we know that the depth has a real influence on the error rate, we use the depth-specific error rates in file simple_errorRates.txt to adjust the genotype likelihoods.

./tiger task=adjustPL vcf=multiplePairs.vcf.gz errorRates=multiplePairs_errorRates.txt

This command produces the file simple_adjustedPL.vcf.gz, where the PL values have been corrected to reflect the genotyping error. This is the file that should be used in subsequent analyses.

Multiple Batches, 4 sets

simulate

This command simulates data in the same way as above except that it simulates two sequencing batches, i.e. two groups of samples that were sequenced together, leading to different error rates (either 0.1 and 0.2 or 0.004 and 0.003 for depths 1x and 2x). The sequencing batches are assigned to each sample randomly.

./tiger task=simulate model=truthSet samples=5 sites=1000 error=[0.1,0.2],[0.004,0.003] outname=multipleBatches

infer

This command infers the error rates per set, of which there are four (2 depths x 2 sequencing batches). By default, each set is defined by the intersection of a batch and depth and the depth of each site is given in the VCF file. Thus, to allow TIGER to define the sets, we still need to provide the batch association of each sample with the groups file.

./tiger task=estimateTruthSet vcf=multipleBatches.vcf.gz samplePairs=multipleBatches_samplePairs.txt batches=multipleBatches_sampleGroups.txt batchesCol=3

Adjust PL

We use the simple_errorRates.txt to adjust the genotype likelihoods because it contains the estimates separately for all sets. To define the sets, TIGER only needs the batch information, because the depth is in the VCF.

./tiger task=adjustPL vcf=multiplePairs.vcf.gz errorRates=multipleBatches_errorRates.txt batches=multipleBatches_sampleGroups.txt batchesCol=3 errorModel=1

This command produces the file simple_adjustedPL.vcf.gz, where the PL values have been corrected to reflect the genotyping error. This is the file that should be used in subsequent analyses.

Updated