Wiki

Clone wiki

ApproxWF / Estimating and plotting posterior distributions

Estimating and plotting posterior distributions

Estimation output

The estimation output is a file listing the parameter values samples during the MCMC chain. Each row of this file corresponds to one multi-dimensional parameter vector sampled from the posterior distribution. The first two columns print the index of the sample (the iteration number during the MCMC) and the log-likelihood of that parameter vector. The remaining columns list the parameter values sampled. An example of such a file looks like this:

index   LL          log10(N)   N         s_0            s_1         s_2
0       -188.069    4.92553    84243     -0.0141649     0.15224     -0.117764
10      -167.745    5.00259    100597    -0.00650263    0.159497    -0.0580638
20      -151.913    5.26158    182633     0.00982114    0.13461     -0.0230714
30      -141.208    5.12178    132366    -0.00873241    0.125176     0.00177488

Note that the population size is listed both on the log10 as well as on the natural scale. Also, the file only contains the parameters that are estimated (see Estimating parameters (task estimate)).

Plotting posterior distributions

We recommend using R to plot posterior distributions from the output of ApproxWF.

Step 1) Load output file into R

To load an output file names "ApproxWF_MCMC_output.txt" into R, use a command such as this one:

mcmc <- read.table("ApproxWF_MCMC_output.txt", header=T);

Note that this may take a long time if the file is larger. But since only a few thousand values re required to plot posterior distributions, it is recommended to subsample the MCMC output using the argument sampling (see Estimating parameters (task estimate)).

Step 2) Remove burnin

As with every MCMC analysis, the first several thousand iterations should be discarded as burnin. To remove the first 10,000 iterations use

mcmc <- mcmc[10001:length(mcmc[,1],];

Note that if you use subsampling, you will need to remove less lines from the mcmc object. When using a subsampling of 10 (printing every 10th iteration only), removing 1,000 lines from the mcmc object implies that the first 10,000 iterations are ignored.

Step 3) Plot posterior distributions

Hierarchical parameters

To plot the posterior distribution on the hierarchical parameters population size, sequencing error rate and mutation rate, simply use the function density in R. For the population size, for instance:

plot(density(mcmc$log10.N.), xlab="Log10(N)", ylab="Posterior Density", main="")

The density function will determine the range of parameter values from the sample. If you prefer to show the posterior distribution across the full prior range, you have to add the limits manually. (Note that R converts special characters such as '(' and ')' into '.').

logN_min <- 1;
logN_max <- 6;
plot(density(mcmc$log10.N.), xlab="Log10(N)", ylab="Posterior Density", main="", xlim=c(logN_min,logN_max));

You can also easily add the prior distribution for comparison. Note that all hierarchical parameters have log-uniform prior distributions, and hence the prior distribution is simply a line with density \(\frac{1}{max - min}\), where max and min correspond to the prior range. To add the prior for N to an existing plot, use

lines(c(logN_min,logN_max), y=rep(1/(logN_max - logN_min),2), lty=2);

Locus specific parameters

Posterior distributions for selection or dominance coefficients of individual loci can be plotted just as hierarchical parameters. However, note that the prior distributions is normal for both of these parameters.

s_mean <- 0.0;
s_sd <- 0.1;
plot(density(mcmc$s_0), xlab="s", ylab="Posterior Density", main="", xlim=c(-1,1));
#add prior
x <- seq(-1,1,length.out=1000);
lines(x, dnorm(x, mean=s_mean, sd=s_sd);

In case you want to plot the posterior distribution of many loci, it may be better to just plot the posterior median and the 99% (or 90%) quantiles. This can be done with the following R code (note that this applies to the above mentioned example file and has to be changed if additional parameters are estimated):

numLoci <- length(mcmc[1,]) - 4;
plot(0, type='n', ylim=c(-1,2), xlim=c(1,numLoci));
for(s in 1:numLoci){
   #plot quantile
   lines(rep(s,2), quantile(mcmc[,s+4], probs=c(0.005,0.995)));
   #add median
   points(s, median(mcmc[,s+4]), pch=19, cex=0.5);
}

Assessing convergence

Markov Chain Monte Carlo is a technique to generate samples from a particular distribution - in our case the multidimensional posterior distribution. It is important that the MCMC chain has converged.

Convergence can be assessed by looking at the trace of the MCMC for log-likelihood as well as different parameters.

mcmc <- read.table("ApproxWF_MCMC_output.txt", header=T);
par(mfrow=c(1,3))
plot(mcmc$LL, type='l', ylab="LL");
plot(mcmc$log10.N., type='l', ylab="llog10(N)", ylim=c(1,8));
plot(mcmc$s_0, type='l', ylab="s_0");

This will result in picture like this one:

convergence.pdf.png

The log-likelihood is expected to initially rise, and then to fluctuate at the top value, just as seen in the first panel.

The parameters are expected to also initially search the space (burnin), and then to fluctuate around the best value to generate samples from the posterior distribution. This behaviour is seen for s_0, but not for \(log_{10}(N)\), suggesting that 1) the burnin for N is relatively long and 2) the MCMC chain used here is much too short.

Updated