Wiki

Clone wiki

Tiger / Tutorial Hardy-Weinberg Model

Simple, 2 sets

Simulate

This command will simulate a VCF file containing 30 polymorphic loci from a population with 50 samples in Hardy-Weinberg equilibrium. The allele frequencies are distributed according to a symmetrical beta distribution B(0.5,0.5), as it is assumed that most allele frequencies are small. The error parameter specifies that we want to simulate an error rate of 0.1 for sites where depth=1 and an error rate of 0.2 where depth=2. It also means that we will simulate only loci with these depths. If you want to simulate more different depths, you need to provide more error values.

./tiger task=simulate model=hardyWeinberg populations=1 samples=50 sites=30 alpha=0.5 beta=0.5 error=0.1,0.2 outname=simple

This command will produce files: test.vcf.gz, which is the VCF of the simulated population, test_trueAlleleFrequencies.txt, which contains the true allele frequencies for each locus, test_sampleGroups.txt, which contains the population and set associations of each sample (in this case all samples are simulated to belong to the same population and to have been sequenced as one set) and test_R_input.txt, which contains the genotype calls for all samples and loci encoded as 1 for homozygous reference, 2 for heterozygous and 3 for homozygous alternative allele.

Infer

This command will infer the probability distributions of all parameters, i.e. error rates, alpha, beta and the allele frequencies.

./tiger task=estimateHardyWeinberg vcf=simple.vcf.gz groups=simple_sampleGroups.txt

The mean of the posterior distribution for alpha and beta are in file test_alphaBeta.txt. The error rates for the two simulated depths are in file test_errorRates.txt.

Adjust PL

This command will adjust the PL values in the VCF file with the error rate estimated from both homozygous and heterozygous sites

./tiger task=adjustPL vcf=simple.vcf.gz errorRates=simple_errorRates.txt errorModel=1

This command produces the file test_adjustedPL.vcf.gz, where the PL values have been corrected to reflect the genotyping error. This is the file that should be used in subsequent analyses.

Multiple populations, 2 sets

Simulate

This command will simulate data in the same way as above except for two different populations, which are each assumed to be in Hardy-Weinberg equilibrium.

./tiger task=simulate model=hardyWeinberg populations=2 samples=50 sites=30 alpha=0.5 beta=0.5 error=0.1,0.2 outname=multiplePops

Infer

This command will infer the probability distributions of all parameters, i.e. error rates, alpha, beta, allele frequencies. Separate alpha and beta values and allele frequencies will be estimated for the two populations. Note the additional parameter "groupCol", as compared to the simple example. This parameter tells TIGER where to find which column in the "groups" file corresponds to the population association.

./tiger task=estimateHardyWeinberg vcf=multiplePops.vcf.gz groups=multiplePops_sampleGroups.txt groupCol=2 outname=multiplePops

Adjust PL

This command will adjust the PL values of the homozygous individuals in the VCF file with the error rate estimated from the homozygous sites, and the equivalent for the heterozygous sites.

./tiger task=adjustPL vcf=multiplePops.vcf.gz errorRates=multiplePops_errorRates.txt errorModel=2

This command produces the file test_adjustedPL.vcf.gz, where the PL values have been corrected to reflect the genotyping error. This is the file that should be used in subsequent analyses.

Multiple batches, 4 sets

Simulate

This command will simulate different error sets, where half of the samples will be simulated with error rate=0.1 for depth=1 and error rate=0.2 for depth=2, and the other half of the samples will be simulated with error rate=0.004 for depth=1 and error rate=0.003 for depth=2. This type of data could be observed in real life if you sequenced some of your samples in different runs, leading to different sequencing error rates for different "sequencing batches".

./tiger task=simulate model=hardyWeinberg populations=2 samples=50 sites=30 alpha=0.5 beta=0.5 error=[0.1,0.2],[0.004,0.003] outname=multipleBatches

Infer

This command will infer the probability distributions of all parameters, i.e. error rates, alpha, beta. Separate error rates will be estimated for the different sets. Note the additional parameters "batches" and "batchesCol", as compared to the example above. These parameters allow TIGER to define the where to find the set that each sample is associated to, namely in the file provided with "batches" and in the column provided by "batchesCol".

./tiger task=estimateHardyWeinberg vcf=multipleBatches.vcf.gz groups=multipleBatches_sampleGroups.txt groupCol=2 batches=multipleBatches_sampleGroups.txt batchesCol=3

Adjust PL

This command will adjust the PL values of the homozygous individuals in the VCF file with the error rate estimated from the homozygous sites, and the equivalent for the heterozygous sites.

./tiger task=adjustPL vcf=test.vcf.gz errorRates=test_errorRates.txt errorModel=2

This command produces the file test_adjustedPL.vcf.gz, where the PL values have been corrected to reflect the genotyping error. This is the file that should be used in subsequent analyses.

Updated