Wiki

Clone wiki

ApproxWF / Estimating parameters (task estimate)

Estimating parameters (task estimate)

To estimate parameters with ApproxWF, specify task=estimate.

Data file

The data has to be provided as a file, the name of which is given with the argument loci.

The data file is organized in columns, with the first indicating the name of the locus, and the following columns indicating the observed allele frequencies observed at different time points (one column per time point). The first row must be a header giving the generation number of the time points. Note that only differences between time points are used, and hence the two time points 1 and 11 can also be givesn as, for instance, 51 and 61.

An example of such a file example.loci may look like this:

time 10      50      200     240
L1   64/100  24/100  55/100  47/100
L2   64/100  26/100  56/100  63/100
L3   59/100  -       57/100  75/100
L4   49/100  34/100  0/0     55/100

As shown in this example, observed allele frequencies are specified in the format allele counts / sample size. There is no requirement to polarize mutations, as long as the counts of the same allele is given for every time point.

Missing data can be indicated in two ways. First, any string not containing a '/' will be treated as missing data. Second, any sample of size 0 will be ignored too.

Transposed files

Transposed files can be provided using the argument lociTransposed. These files are also organized in columns, but the first now indicates the time (in generations) at which the sample was taken, and the following columns indicating the observed allele frequencies observed at different loci (one column per locus). While the first row is reserved for a mandatory header containing unique names for each locus, each of the following lines contains the observed allele frequencies of another time point.

An example of such a file example_transposed.loci may look like this:

time L1      L2      L3
10   64/100  24/100  55/100
50   64/100  26/100  56/100
120  59/100  -       57/100
200  49/100  34/100  0/0

Filters

ApproxWF allows to filter the input data in two ways:

  • Using the argument minNumTimePointFilter, only loci for which both alleles are observed at minNumTimePointFilter time points or more will be kept.
  • Using the argument maxReadLoci, the analysis can be limited to the first set of loci in the files that pass the above filter. Default value is one million.

These filters work on regular files only (but not on transposed files).

Multiple Replicates

ApproxWF can infer parameters from data of multiple replicates, in which case ApproxWF assumes the parameters to be the same for each data set (e.g. only one selection coefficient will be estimated for each locus).

For regular files, simply add all time points to the same file, and then indicate the beginning of replicates using the argument replicates:

loci=example.loci replicates=10,200

In the above example, the first two time points (10 and 50) will be used as the first replicate, and the second pair (200 and 240) as the second replicate. Note that the times indicated in the locus file MUST be increasing and can not be repeated. But since only time differences matter, this example would result in two replicates, each with two time points 40 generations apart.

for transposed files, simply provide multiple data file names separated by a comma using the argument lociTransposed:

lociTransposed=example_transposed.loci,example_transposed2.loci,example_transposed3.loci

If multiple data sets are provided, ApproxWF will read the full set of Loci from the first data set, and then add additional data for those from the additional data set. Loci not present in the first data set, however, will be ignored.

Launching an estimation

The minimal command line to run an estimation is

ApproxWF task=estimate loci=example.loci

In most cases, however, it will be desired to modify the default parameters to suit the particular inference planned. All arguments affecting the estimations are detailed in the following.

Estimation output

The estimation output is a file listing the parameter values samples during the MCMC chain. Each row of this file corresponds to one multi-dimensional parameter vector sampled from the posterior distribution. The first two columns print the index of the sample (the iteration number during the MCMC) and the log-likelihood of that parameter vector. The remaining columns list the parameter values sampled. An example of such a file looks like this:

index   LL          log10(N)   N         s_0            s_1         s_2
0       -188.069    4.92553    84243     -0.0141649     0.15224     -0.117764
10      -167.745    5.00259    100597    -0.00650263    0.159497    -0.0580638
20      -151.913    5.26158    182633     0.00982114    0.13461     -0.0230714
30      -141.208    5.12178    132366    -0.00873241    0.125176     0.00177488

Note that the population size is listed both on the log10 as well as on the natural scale. Also, the file only contains the parameters that are estimated (see below on how to fix some parameters).

The following arguments modify the out:

  • The argument outName specifies the name of the file to be written. The default name is ApproxWF_MCMC_output.txt.
  • By default, the parameter samples of each iteration are written, resulting in very large files. For posterior estimation, only a subsample of these is required. To tell ApproxWF to only print a subset, use the argument sampling. This argument takes an integer value n and result in only every nth parameter vector to be printed.

Check out here how to estimate and plot posterior distributions from this file.

Prior distributions

While the shape of the prior distributions is hard-coded, it can be tuned. Also, each of the model parameter can be fixed to a known value.

Population size N

By default, the population size is estimated using a log-uniform prior is assumed such that \(\log_{10}(N) \sim U(logN_{min}, logN_{max})\). The min and max of that distribution is set with the arguments logN_min and logN_max and the default is 1.0 and 8.0, respectively.

To fix the population size to a specific value, use the argument N.

Selection coefficients s

By default, locus specific selection coefficients are estimates using a normal prior such that \(s_l ~ N(\mu_s, \sigma_s)\), but truncated at -1.0 and 1.0. The mean \(\mu_s\) and standard deviation \(\sigma_s\) of this distribution are set with the arguments s_mean and s_sd and the default values are 0.0 and 0.1.

To fix all selection coefficients to a specific value, use the argument s. There is currently no way to fix values for only a subset of loci.

Dominance coefficients h

By default, locus specific dominance coefficients are estimates using a normal prior such that \(h_l ~ N(\mu_h, \sigma_h)\), but truncated at -1.0 and 1.0. The mean \(\mu_h\) and standard deviation \(\sigma_h\) of this distribution are set with the arguments h_mean and h_sd and the default values are 0.5 and 0.1.

To fix all dominance coefficients to a specific value, use the argument h. There is currently no way to fix values for only a subset of loci.

Sequencing error rate \(\epsilon\)

By default, the sequencing error rate \(\epsilon\) is assumed to be 0.0 and not estimated. To fix it to a different value, use the argument errorRate.

To estimate the sequencing error rate \(\epsilon\), add the argument estimateErrorRate, or any of the arguments affecting its prior distribution. This distribution is log-uniform such that \(log_{10}(\epsilon) \sim U(log\epsilon_{min}, log\epsilon_{max})\). The min and max of that distribution is set with the arguments logErrorRate_min and logErrorRate_max and the default values are -4.0 and -0.3, respectively.

Mutation rate \(\mu\)

By default, the mutation rate \(\mu\) is assumed to be 0.0 and not estimated. To fix it to a different value, use the argument mutRate.

To estimate the mutation rate \(\epsilon\), add the argument estimateMutRate, or any of the arguments affecting its prior distribution. This distribution is log-uniform such that \(log_{10}(\mu) \sim U(log\mu_{min}, log\mu_{max})\). The min and max of that distribution is set with the arguments logMutRate_min and logMutRate_max and the default values are -6.0 and -1.0, respectively.

MCMC tuning: proposal distributions

The performance of the MCMC depends critically on choice of proposal kernels. The following kernels are implemented and can be tunes as described below:

  • We use a normal kernel for the population size \(\log_{10}(N)\) with standard deviation specified by argument logN_step. The default value is 0.05, which we found to work well in most cases.
  • We use a normal kernel for selection coefficients s with standard deviation specified by argument s_step. The default value is 0.1 * the standard deviation of the prior \(\sigma_s\) (see above), which we found to work well in most cases.
  • We use a normal kernel for dominance coefficients h with standard deviation specified by argument h_step. The default value is 0.1 * the standard deviation of the prior \(\sigma_h\) (see above), which we found to work well in most cases.
  • We use a mixture of two different kernels for sequencing error rate \(\epsilon\). In 80% of the cases, :math;`log_{10}(epsilon)` is updated using a normal kernel with standard deviation specified by argument logErrorRate_step. The default value is 0.01, which we found to work well in most cases. But since we found that the error rate can get stuck at very small values, we also use a uniform kernel over the whole range \(U(log\epsilon_{min}, log\epsilon_{max})\) in 20% of the updates.
  • We use a normal kernel for the mutation rate \(\log_{10}(\mu)\) with standard deviation specified by argument logMutRate_step. The default value is 0.01, which we found to work well in most cases.

Further MCMC tuning

The following arguments are used to tune the MCMC alogirthm:

  • The length of MCMC chain is specified with the argument MCMCLength. The default value is 10,000 iterations, which is likely to short.
  • The number of frequency states to be used is specified with the argument nStates. The default is 51, which we found to be suitable for most cases. Using more states will increase accuracy at the cost of computation time (Note: computation time grows quadratically with the number of states!).
  • By default, states are distributed using a quadradic grid. To use a uniform grid over the states use the argument uniformStates.
  • The number of CPU cores to be used is specified with the argument nCores. While the default is one core, run-time is reduced almost proportionally to the number of cores used. Using 5 cores, for instance, reduces the computation time about 5 fold. Note, however, that the parallelization is by locus, so using 2 or 3 cores on 4 loci will result in the same run-time.
  • The argument gammaThreshold defines the threshold of \(\gamma_s = 2Ns_k(u_2 - u_1)\) above which elements of the transition matrix will be calculated with the approximate formulae for large \(\gamma_s\) rather than through numerical integration. The default value is 10, which we found to work fine in most cases.

Updated