Wiki

Clone wiki

levolution / Inferring jump parameters (task infer)

Inferring jump parameters (task infer)

To infer jump parameters with levolution, specify task=infer. Then one needs to specify the mandatory parameters: tree, traits, and alpha. An example command line to infer jump parameters from a tree file myTree.tree and a traits file myTraits.traits might look like this:

./levolution task=infer tree=myTree.tree traits=myTraits.traits alpha=5

The tree file has to be in newick format. The traits file consist of two columns with no header. The first column has the tree tips names (the same as in the tree file), and the second column has the trait values (look at input formats).

EM settings

By default, the EM algorithm is started at the MLE estimates of the null model for the root state and the Brownian variance, and at a lambda corresponding to 10 jumps on the total tree. These starting values, however, can also be set manually using the argument tags rootState, brownVar and lambda. This may be interesting if knowledge on these values is available a priori, or to test the convergence of the algorithm.

In addition, the parameters governing the convergence assessment can be tweaked. For standard EM algorithms, convergence is assesses of estimates no longer chance. However, since levolution implements a stochastic variant of the EM algorithm, estimates will continue to fluctuate even after convergence. levolution there implements two different methods for assessing the convergence are currently implemented:

  • After each iteration, it is tested if the last XXX values of each parameter show a trend. If a model without trend can no longer be rejected using a T test for all parameters, the EM algorithm is considered converged. By default, the applied threshold for the T statistics is T=2. That is, if the observed T < 2, a model without trend will not be rejected. A different threshold can be specified with the argument maxT.
  • In addition levolution also checks if values increase and decrease randomly. This is expected if values are fluctuating randomly, but not if they follow a trend. We test for this using the statistics N, which measures the number of sign switches in the change between consecutive values, of a out of all possible sign switches. A value of N=1.0, for instance, would indicate that values alternate in increasing and decreasing, while a value of N=0.0 would indicate that they either always increase or always decrease. By default, the EM is considered converged if N>=0.5, but a different minimal N can be specified with the argument minN.

Note that levolution only stops the EM algorithm if convergence is assesses using both tests. Also, the two tests are always run over the last few values. By default, the last 16 values are used, but a different number of values to use can be specified with the argument numValCheckTrend. Larger values will result in a more accurate assessment, but will require longer chains. Indeed, even if a chain converged, the EM has to continue until no trend is found among the last values.

numValCheckTrend

MCMC settings

If one desires to change the MCMC arguments one can do so by adding numNVecs, burnin, and thinning. For example:

./levolution task=infer tree=myTree.tree traits=myTraits.traits alpha=5 numNVecs=5000 burnin=100 thinning=10

In this case, the initial burn-in of the MCMC will discard the first 100 samples, and then it will sample 5000 jump configurations (numNVecs) every 10 steps (thinning).

The arguments burnin, thinning, and numNVecs apply to all MCMC runs. If the user is interested in having different settings for different parts of the analysis then change them as follows.

MCMC for jump location. If one desires to change the MCMC settings only for the jump-location estimation (check section Inferring jumps) use the arguments numNVecsEB, thinningEB, and burninEB.

MCMC for estimating the log-likelihood of the model. If one desires to change the MCMC settings only for the log-likelihood estimation use the arguments numNVecsLL, thinningLL, and burninLL.

When the MCMC is finished there will be an output file with the suffix oneLineSummary.txt with the inference results of all model parameters.

Estimating alpha

Notice that the estimation step described above uses a fixed alpha value. If the user is also interested in estimating alpha there are two ways to achieve this: a grid search and using the peak finder function.

Estimating alpha using a grid search. Simply add the range and number of values to test to the argument alpha. For example, alpha=1-128:50 will sample 50 values from the range 1 to 128. The resulting output file will give the parameter model estimates corresponding to the most likely alpha. We recommend to add the argument logAlpha so that the sampled alpha values are in the log scale.

Estimating alpha with peak finder. This algorithm will search dynamically along the likelihood surface for alpha until it finds it maximum. By default it only works in the log scale. Example use:

logAlphaStart=0.5 logAlphaStep=0.5 numOptimizations=5

Here the initial value of alpha in the log scale will be 0.5, with a step size of 0.5 and a total of 5 optimizations. This option cannot be paralelized (as a main difference with the grid-search algorithm) and thus works faster when dealing with small trees, in the order of 100 leaves.

Finally, the user can also add the argument inferJumps to have an output file in newick format with the posterior probabilities of jumps per branch. This will perform the same analysis as task=inferJumps (check section Inferring jumps)

Output files

When using task=infer there will be several output files which we describe below.

The main output file has the extension oneLineSummary.txt. This file contains several columns including the input files used, the parameter estimates, the tree length, the likelihoods of the null and Lévy models, the preferred model, etc.

Additionally, the user will get another output file with the extension _EM_alpha_<alpha>.txt, where <alpha> corresponds to the alpha value (or alpha values in case of a grid search) tested. In this file the user will find a detailed output of the EM run with the values of the parameters per iteration and the convergence statistics (T and N) per parameter and per iteration. The columns are: Iteration, RootState, BrownVar, Lambda, T_RootState, T_BrownVar, T_lambda, maxT, N_RootState, N_BrownVar, N_lambda, and minN. The rows are simply given by the interation number until convergence.

Updated