Wiki
Clone wikiApproxWF / Estimating parameters (task estimate)
Estimating parameters (task estimate)
To estimate parameters with ApproxWF, specify task=estimate
.
Data file
The data has to be provided as a file, the name of which is given with the argument loci
.
The data file is organized in columns, with the first indicating the name of the locus, and the following columns indicating the observed allele frequencies observed at different time points (one column per time point). The first row must be a header giving the generation number of the time points. Note that only differences between time points are used, and hence the two time points 1 and 11 can also be givesn as, for instance, 51 and 61.
An example of such a file example.loci may look like this:
time 10 50 200 240 L1 64/100 24/100 55/100 47/100 L2 64/100 26/100 56/100 63/100 L3 59/100 - 57/100 75/100 L4 49/100 34/100 0/0 55/100
As shown in this example, observed allele frequencies are specified in the format allele counts / sample size. There is no requirement to polarize mutations, as long as the counts of the same allele is given for every time point.
Missing data can be indicated in two ways. First, any string not containing a '/' will be treated as missing data. Second, any sample of size 0 will be ignored too.
Transposed files
Transposed files can be provided using the argument lociTransposed
. These files are also organized in columns, but the first now indicates the time (in generations) at which the sample was taken, and the following columns indicating the observed allele frequencies observed at different loci (one column per locus). While the first row is reserved for a mandatory header containing unique names for each locus, each of the following lines contains the observed allele frequencies of another time point.
An example of such a file example_transposed.loci may look like this:
time L1 L2 L3 10 64/100 24/100 55/100 50 64/100 26/100 56/100 120 59/100 - 57/100 200 49/100 34/100 0/0
Filters
ApproxWF allows to filter the input data in two ways:
- Using the argument
minNumTimePointFilter
, only loci for which both alleles are observed atminNumTimePointFilter
time points or more will be kept. - Using the argument
maxReadLoci
, the analysis can be limited to the first set of loci in the files that pass the above filter. Default value is one million.
These filters work on regular files only (but not on transposed files).
Multiple Replicates
ApproxWF can infer parameters from data of multiple replicates, in which case ApproxWF assumes the parameters to be the same for each data set (e.g. only one selection coefficient will be estimated for each locus).
For regular files, simply add all time points to the same file, and then indicate the beginning of replicates using the argument replicates
:
loci=example.loci replicates=10,200
In the above example, the first two time points (10 and 50) will be used as the first replicate, and the second pair (200 and 240) as the second replicate. Note that the times indicated in the locus file MUST be increasing and can not be repeated. But since only time differences matter, this example would result in two replicates, each with two time points 40 generations apart.
for transposed files, simply provide multiple data file names separated by a comma using the argument lociTransposed
:
lociTransposed=example_transposed.loci,example_transposed2.loci,example_transposed3.loci
If multiple data sets are provided, ApproxWF will read the full set of Loci from the first data set, and then add additional data for those from the additional data set. Loci not present in the first data set, however, will be ignored.
Launching an estimation
The minimal command line to run an estimation is
ApproxWF task=estimate loci=example.loci
In most cases, however, it will be desired to modify the default parameters to suit the particular inference planned. All arguments affecting the estimations are detailed in the following.
Estimation output
The estimation output is a file listing the parameter values samples during the MCMC chain. Each row of this file corresponds to one multi-dimensional parameter vector sampled from the posterior distribution. The first two columns print the index of the sample (the iteration number during the MCMC) and the log-likelihood of that parameter vector. The remaining columns list the parameter values sampled. An example of such a file looks like this:
index LL log10(N) N s_0 s_1 s_2 0 -188.069 4.92553 84243 -0.0141649 0.15224 -0.117764 10 -167.745 5.00259 100597 -0.00650263 0.159497 -0.0580638 20 -151.913 5.26158 182633 0.00982114 0.13461 -0.0230714 30 -141.208 5.12178 132366 -0.00873241 0.125176 0.00177488
Note that the population size is listed both on the log10 as well as on the natural scale. Also, the file only contains the parameters that are estimated (see below on how to fix some parameters).
The following arguments modify the out:
- The argument
outName
specifies the name of the file to be written. The default name is ApproxWF_MCMC_output.txt. - By default, the parameter samples of each iteration are written, resulting in very large files. For posterior estimation, only a subsample of these is required. To tell ApproxWF to only print a subset, use the argument
sampling
. This argument takes an integer value n and result in only every nth parameter vector to be printed.
Check out here how to estimate and plot posterior distributions from this file.
Prior distributions
While the shape of the prior distributions is hard-coded, it can be tuned. Also, each of the model parameter can be fixed to a known value.
Population size N
By default, the population size is estimated using a log-uniform prior is assumed such that \(\log_{10}(N) \sim U(logN_{min}, logN_{max})\). The min and max of that distribution is set with the arguments logN_min
and logN_max
and the default is 1.0 and 8.0, respectively.
To fix the population size to a specific value, use the argument N
.
Selection coefficients s
By default, locus specific selection coefficients are estimates using a normal prior such that \(s_l ~ N(\mu_s, \sigma_s)\), but truncated at -1.0 and 1.0. The mean \(\mu_s\) and standard deviation \(\sigma_s\) of this distribution are set with the arguments s_mean
and s_sd
and the default values are 0.0 and 0.1.
To fix all selection coefficients to a specific value, use the argument s
. There is currently no way to fix values for only a subset of loci.
Dominance coefficients h
By default, locus specific dominance coefficients are estimates using a normal prior such that \(h_l ~ N(\mu_h, \sigma_h)\), but truncated at -1.0 and 1.0. The mean \(\mu_h\) and standard deviation \(\sigma_h\) of this distribution are set with the arguments h_mean
and h_sd
and the default values are 0.5 and 0.1.
To fix all dominance coefficients to a specific value, use the argument h
. There is currently no way to fix values for only a subset of loci.
Sequencing error rate \(\epsilon\)
By default, the sequencing error rate \(\epsilon\) is assumed to be 0.0 and not estimated. To fix it to a different value, use the argument errorRate
.
To estimate the sequencing error rate \(\epsilon\), add the argument estimateErrorRate
, or any of the arguments affecting its prior distribution. This distribution is log-uniform such that \(log_{10}(\epsilon) \sim U(log\epsilon_{min}, log\epsilon_{max})\). The min and max of that distribution is set with the arguments logErrorRate_min
and logErrorRate_max
and the default values are -4.0 and -0.3, respectively.
Mutation rate \(\mu\)
By default, the mutation rate \(\mu\) is assumed to be 0.0 and not estimated. To fix it to a different value, use the argument mutRate
.
To estimate the mutation rate \(\epsilon\), add the argument estimateMutRate
, or any of the arguments affecting its prior distribution. This distribution is log-uniform such that \(log_{10}(\mu) \sim U(log\mu_{min}, log\mu_{max})\). The min and max of that distribution is set with the arguments logMutRate_min
and logMutRate_max
and the default values are -6.0 and -1.0, respectively.
MCMC tuning: proposal distributions
The performance of the MCMC depends critically on choice of proposal kernels. The following kernels are implemented and can be tunes as described below:
- We use a normal kernel for the population size \(\log_{10}(N)\) with standard deviation specified by argument
logN_step
. The default value is 0.05, which we found to work well in most cases. - We use a normal kernel for selection coefficients s with standard deviation specified by argument
s_step
. The default value is 0.1 * the standard deviation of the prior \(\sigma_s\) (see above), which we found to work well in most cases. - We use a normal kernel for dominance coefficients h with standard deviation specified by argument
h_step
. The default value is 0.1 * the standard deviation of the prior \(\sigma_h\) (see above), which we found to work well in most cases. - We use a mixture of two different kernels for sequencing error rate \(\epsilon\). In 80% of the cases, :math;`log_{10}(epsilon)` is updated using a normal kernel with standard deviation specified by argument
logErrorRate_step
. The default value is 0.01, which we found to work well in most cases. But since we found that the error rate can get stuck at very small values, we also use a uniform kernel over the whole range \(U(log\epsilon_{min}, log\epsilon_{max})\) in 20% of the updates. - We use a normal kernel for the mutation rate \(\log_{10}(\mu)\) with standard deviation specified by argument
logMutRate_step
. The default value is 0.01, which we found to work well in most cases.
Further MCMC tuning
The following arguments are used to tune the MCMC alogirthm:
- The length of MCMC chain is specified with the argument
MCMCLength
. The default value is 10,000 iterations, which is likely to short. - The number of frequency states to be used is specified with the argument
nStates
. The default is 51, which we found to be suitable for most cases. Using more states will increase accuracy at the cost of computation time (Note: computation time grows quadratically with the number of states!). - By default, states are distributed using a quadradic grid. To use a uniform grid over the states use the argument
uniformStates
. - The number of CPU cores to be used is specified with the argument
nCores
. While the default is one core, run-time is reduced almost proportionally to the number of cores used. Using 5 cores, for instance, reduces the computation time about 5 fold. Note, however, that the parallelization is by locus, so using 2 or 3 cores on 4 loci will result in the same run-time. - The argument
gammaThreshold
defines the threshold of \(\gamma_s = 2Ns_k(u_2 - u_1)\) above which elements of the transition matrix will be calculated with the approximate formulae for large \(\gamma_s\) rather than through numerical integration. The default value is 10, which we found to work fine in most cases.
Updated