Allele Frequency Likelihoods (AFL)
If sample sizes are known, the recommended way to account for genotype uncertainty when analyzing poolSeq data is to use poolStat to calculate Allele Frequency Likelihoods (AFL) and to then pass them to the program ANGSD for downstream analysis.
poolStat calculates the AFLs as follows:
- The AFLs are calculated in each population for each alleleic combination (AC, AG, At, CG, CT and GT)
- These are then used to find the alleleic combination with the highest joint likelihood across all populations.
- The AFLs for this MLE alleleic combination is the printed for each site to population specific Site Allele Frequency (saf) files that can be used as input to ANGSD.
To generate these saf files, poolStat has to be run in the
AFL mode as follows:
./poolStat task=AFL bamList=pop1.bam,pop2.bam,pop3.bam sampleSize=20,30,15 verbose
bamList: this is a default argument listing the names of bam files (one per population) to be used. Individual files are separated by commas.
sampleSize: a mandatory argument specifying the same sizes for each population. NOTE: the same size is in number of chromosomes (NOT individuals). So 10 diploid individuals correspond to a sample size of 20. Or three tetraploid individuals to a sample size of 12.
window: poolStat advances through the bam files in windows and this argument specifies the size of these windows. Small windows will lead to a computational overhead when parsing bam files, but improve memory handling. The default window size of 10Kb (
window=10000) was found to work well in most cases.
out: the prefix used for general output files (such as the sites file, see below). If not given, the base name of the first bam file is used.
minVariantQ: if given, the output will be restricted to all sites considered polymorphic with quality >= minVariantQ, where the variant quality is calculated as -10 * log10(p-value) with a likelihood-ration test against a monomorphic model.
truncateAFL: log AFL values below which values are truncated to this value. For instance, if truncateAFL=-30, then all log(AFL) < -30 will be set to -30. This is to avoid issues with downstream programs not properly dealing with very small likelihoods such as ANGSD, which requires all log(AFL) > -100. The default value is -100.
chr: optional argument providing a list of chromosomes (separated by commas) on which the analysis will be run. If not given, poolStat will run across the full genome.
limitChr: an optional argument limiting the analysis to the first few n chromosomes.
limitWindows: an optional argument limiting the analysis to the first n windows on each chromosome.
verbose: optional argument asking poolStat to print detailed progress information.
When run this way, poolStat will write the following output files:
A saf file for each population. Saf files are actually collections of three files: the gzipped saf file itself (ending on saf.gz), an index file (ending on saf.idx) and a position file (ending on saf.pos.gz). Seehere for more details on saf files.
A sites file listing all sites included in the saf files.
The sites file will contain the following columns:
- The name of the chromosomes
- The position on that chromosome
- The first allele
- The second allele
- The quality of the variant (the phred-scaled p-value)
- and following: the MLE allele frequency in each population