Wiki

Clone wiki

Flink / Parameter Estimation

Parameter Estimation (task estimate)

To infer the values of the parameters with Flink, use the task “estimate”. The data has to be provided as a file, the name of which is given with the argument data (to see how an input file has to look, read the paragraph “Input File” in the “Launching Flink” chapter).

One group

You can infer the parameter of only one group if you give in input populations belonging to the same group. You can fix or infer the following parameters (to see how to fix and infer them, see the paragraphes “Fixing a parameter” and “Inferring a parameter” of the section Launch Flink):

Maximum coefficient of selection of a group (argument alpha_max)

The maximum coefficient of selection αmax is estimated using an exponential prior with mean μα and the parameter of the distribution λα , often called the rate parameter, such that αmax ∼ Exp(λα).

Coefficient of drift of a population (argument beta)

The coefficient of drift β is estimated using a normal prior with mean μβ and the standard deviation σβ, such that β ∼ N(μβ,σβ). The range of the values that β can assume belongs to the interval [-10.0, 0.0].

Logarithm of group positive scaling parameter (argument lnkappa)

The logarithm of group positive scaling parameter ln(kappa) is estimated using a uniform prior such that ln(kappa)∼U(ln(kappamin),ln(kappamax)). You can decide the range of the values that can be inferred otherwise the program will keep the default value of -2.0.

example of the parameter estimation for a group:

./Flink task=estimate data=Flink_simulations.txt beta="(-2.0,0.1)" alpha_max="(3.0,0.1)" lnkappa="(-20,-0.1)"

Higher hierarchy

Inferring the parameters of more than one group, you have to run an higher hierarchical model. To get an higher hierarchical simulation, you have also to fix or infer the parameters B, A_max and lnK, adding them to the parameters that have been shown in the one group model (to see how to fix and infer them, see the paragraphes “Fixing a parameter” and “Inferring a parameter” of the section Launch Flink):

Maximum coefficient of selection of the world (argument A_max)

The maximum coefficient of selection Amax is estimated using an exponential prior with mean μA and the parameter of the distribution λA , often called the rate parameter, such that Amax ∼ Exp(λA).

Coefficient of drift of a group (argument B)

The coefficient of drift B is estimated using a normal prior with mean μB and the standard deviation σB, such that B ∼ N(μB,σB). The range of the values that B can assume belongs to the interval [-10.0, 0.0].

Logarithm of world positive scaling parameter (argument lnK)

The logarithm of world positive scaling parameter ln(K) is estimated using a uniform prior such that ln(K)∼U(ln(Kmin),ln(Kmax)). You can decide the range of the values that can be inferred otherwise the program will keep the default value of -2.0.

example of the parameter estimation for the higher hierarchy:

./Flink task=estimate data=Flink_simulations.txt beta="(-2.0,0.1)" alpha_max="(3.0,0.1)" lnkappa="(-20,-0.1)" A_max=1.0,0.1 B="(-1.0,0.1)" lnK=-10,-0.1

Estimation output

The estimation output is a file listing the parameter values samples during the MCMC chain. Each row of this file corresponds to one multi-dimensional parameter vector sampled from the posterior distribution. To see an example of the output file, check it out in estimating and plotting posterior distributions.
The following arguments modify the output:

  • The argument outName specifies the initial tag of the file to be written. The default name is ‘tagOfTheInputFile’_MCMC_hierarchicalParameters.txt.
  • By default, the parameter samples of each iteration are written, resulting in very large files. For posterior estimation, only a subsample of these is required. To tell Flink to only print a subset, use the argument thinning. This argument takes an integer value n and result in only every nth parameter vector to be printed.

Restart

Use the option restart if you want to add some values from which the program will start. After the option restart, provide the starting values for the following arguments that you want to infer:

  • cur_log_a
  • cur_log_b
  • cur_lnNu
  • cur_lnMu
  • cur_lnK
  • cur_lnNu_g"group_position"
  • cur_lnMu_g"group_position"
  • cur_B_"group_position"
  • cur_beta_"group_position"_"population_position"
  • cur_lnkappa_"group_position"

Please note that the group and the population position start from 0. Add also the options S_fromfile and Sg_fromfile to provide the starting values of the states. You can find the last estimates values of the S of the groups in the file "Sg_values.txt", whereas the file "S_values.txt" contains the last estimates values of the S for the higher hierarchy.

Other arguments

  • numIterations: number of iterations to perform the MCMC algorithm.
  • burnin: number of initial iterations that are not included to calculate the posterior probabilities.
  • thinning: integer value n that determines the printing of every nth parameter vector.
  • s_max: Maximum state of the Markov model. The default value is 2.
  • lnMu: Probability involved in the generating matrix to go to a different state for the higher hierarchy. The default value is -2.0.
  • lnNu: Probability involved in the generating matrix to go to a selection state from the neutral state for the higher hierarchy. The default value is -1.0.
  • lnMu_g: Probability involved in the generating matrix to go to a different state at the group level. The default value is -2.0.
  • lnNu_g: Probability involved in the generating matrix to go to a selection state from the neutral state at the group level. The default value is -1.0.
  • log_a: Describes the shape of the beta distribution of allele frequencies in the ancestral population (the peak around 1.0). The default value is 0.7. It's estimated using a lognormal positive prior, for whom we suggest to give an interval between 0 and 1: log_a="0.0,1.0".
  • log_b: Describes the shape of the beta distribution of allele frequencies in the ancestral population (the peak around 0.0). The default value is 0.7. It's estimated using a lognormal positive prior, for whom we suggest to give an interval between 0 and 1: log_a="0.0,1.0".
  • maxDist: Maximum distance to group the sites of a chromosome for the linkage. The default value is 1000000.
  • P: fix all the world ancestral allele frequencies to a value between (0, 1)
  • p: fix all the ancestral allele frequencies to a value between (0, 1)
  • input_freq: give in input a file containing the ancestral allele frequencies. This file has to be similar to the generated output file “freq_p.txt” of a simulation.
  • input_P: give in input a file containing the ancestral allele frequencies for the higher hierarchy. This file has to be similar to the generated output file “freq_ancP.txt” of a simulation.
  • sigmaProp_alphaMax: Value to determine the range of the proposal value of alpha_max. The same value is kept for the higher hierarchical parameter. The default value is 0.15.
  • sigmaProp_beta: Value to determine the range of the proposal value of beta. The same value is kept for the higher hierarchical parameter. The default value is 0.05.
  • sigmaProp_kappa: Value to determine the range of the proposal value of lnkappa. The same value is kept for the higher hierarchical parameter. The default value is 0.12.
  • sigma2_prior: value of the variance of the prior of alpha_max. It is involved in finding good starting values of the parameters. The same value is kept for the higher hierarchical parameter. The default value is 1.
  • sigmaProp_mu: Value to determine the range of the proposal value of mu. The same value is kept for the higher hierarchical parameter. The default value is 0.09.
  • sigmaProp_nu: Value to determine the range of the proposal value of nu. The same value is kept for the higher hierarchical parameter. The default value is 0.4.
  • sigmaProp_a: Value to determine the range of the proposal value of a. The same value is kept for the higher hierarchical parameter. The default value is 0.04.
  • S_fromfile: give in input a file containing the values of the ancestral states as starting values. This file has to be similar to the generated output file “S_simulated.txt” of a simulation. If it is followed by "no_S_estimation", the values of the ancestral states will not be inferred.
  • Sg"groupnumber"_fromfile: give in input a file containing the values of the states of the particular group as starting values. This file has to be similar to the generated output file “Sg_simulated.txt” of a simulation. If it is followed by "no_Sg"groupnumber"_estimation", the values of the states of the group will not be inferred.
  • iteration_jumpallS
  • lim_jumpallS
  • iteration_B

Other options

  • printPosteriorP: the program will generate an output file that is called “Posterior_P.txt” in the case of higher hierarchy and some “Posterior_Pg’GroupNumber’.txt” files as many as the number of groups. These files contain five columns showing the site index in the first one, and in the others some statistics concerning the S states of each site, precisely the mean, the mode, the 10% quantile and the 90% quantile respectively. The files will be regenerated every 10000 iterations and at the end of the parameter estimation.
  • numThreads: number of used cores. As default the program uses only one core. Please don't use it for the higher hierarchy.

Examples of useful command lines

In the case of having one group:

./Flink task=estimate log_a="0.0,1.0" log_b="0.0,1.0" numThreads=2 lnMu_g="(-10.0,-0.01)" lnNu_up_g="(-10.0,-0.01)" lnNu_down_g="(-10.0,-0.01)" s_max=14 beta="(-2.0,1.8)" alpha_max=4.0 numIterations=500000 burnin=200000 thinning=100 lnkappa="(-15,-0.1)" verbose printPosteriorS printPosteriorP logFile=logfile data=filtered_data.txt

In the case of higher hierarchy:

./Flink task=estimate numThreads=3 log_a="0.0,1.0" log_b="0.0,1.0" B="(-2.0,1.8)" A_max=3.0 lnK=-15,-0.1 lnMu="-10.0,0.0" lnNu="-10.0,0.0" lnMu_g="(-10.0,0.0)" lnNu_g="(-10.0,0.0)" s_max=10 beta="(-2.0,1.8)" alpha_max=3.0 numIterations=500000 burnin=200000 thinning=100 lnkappa="(-15,-0.1)" verbose printPosteriorS logFile=logfile data=Flink_simulations.txt

Updated