No convergence diagnostics for MCMC, examples do not converge

Issue #20 resolved
Martin Modrák created an issue

When running MCMC, it is very easy to get wrong results because the chains did not converge to the stationary distribution (model misspecification, non-identifiability, too few warmup/burnin samples, too few total samples, …).

Currently, it appears that the MCMC part of MUQ has no built-in way to check this, which would be very useful. In fact, it appears that at least some of the MCMC examples have not actually converged to the stationary distribution.

Background:

In Stan, it is considered best practice to check convergence of any MCMC run by:

  • Running multiple (usually at least 4) chains
  • Computing the split Rhat (or similar diagnostics) across the chains - there was a recent update using rank-normalized values and others (See ref below)
  • Computing effective sample size for bulk (ESS for rank normalized values using split chains.) and tail (minimum of the ESS for 5% and 95% quantiles.) of the parameters across all chains.

I think this is basically state-of-the-art, but I may be biased by my affiliation with Stan. I in no way insist on particulars, just that some form of convergence checks should be easily available to the user and shown in the examples. It is I think the least one can do to help users avoid wrong inferences.

Non-convergence of examples

To show that this is not a theoretical concern, I checked the output of some of the MCMC examples. Despite the target densities not being very difficult, I encountered (small) problems. I mostly exported the samples and used the posterior R package for the analysis.

Let’s start with SamplingAlgorithms/MCMC/Example2_GaussianInverseGamma - I exported the samples and noticed that rhat is very high. To show this in a more direct way, I added code to show the mean of first 30K and last 30K of post warmup samples (BTW is there an easier way to subset/split/thin samples?)

SampleCollection startSamples;
 SampleCollection endSamples;
 for(size_t i = 0; i < 30000; ++i) {
   startSamples.Add(samps->at(i));
   endSamples.Add(samps->at(i + samps->size() - 30000));
 }

 std::cout << "Start Mean = \n" << startSamples.Mean().transpose() << std::endl;
 std::cout << "End Mean = \n" << endSamples.Mean().transpose() << std::endl;

Here is the output from a typical run:

Sample Mean =
 1.02402  2.01825 0.656083
Start Mean =
 1.03567  2.11923 0.635161
End Mean =
 1.11773  1.89382 0.705868

So we see that at least the start of the chain has not yet reached the stationary distribution of the chain - the differences are almost certainly larger than the Monte Carlo error from 30K samples.

In SamplingAlgorithms/MCMC/Example3_MultilevelGaussianSampling/MultilevelGaussianSampling.cpp I observe similar failures to pass the diagnostics. It is a bit unclear whether the samples are after burn-in or include burn-in (see https://bitbucket.org/mituq/muq2/issues/18/inconsistent-dimensions-of-output), but even assuming the first 1000 samples should be discarded (which the example does not do), the mean of the first 4500 samples and the last 4500 samples can differ noticeably, here’s a summary for QOIs in 4 chains (computed in R after exporting to file, rhat should be < 1.01):

QOI  1 : rhat : 1.114922 , ESS bulk:  22.26576 , ESS tails:  233.6245 
    Chain  1 : Mean first:  1.125966  Mean second:  2.015556 
    Chain  2 : Mean first:  1.12977  Mean second:  1.953836 
    Chain  3 : Mean first:  1.151658  Mean second:  2.04343 
    Chain  4 : Mean first:  1.118716  Mean second:  2.028392 
QOI  2 : rhat : 1.118254 , ESS bulk:  21.71397 , ESS tails:  245.1589 
    Chain  1 : Mean first:  1.129428  Mean second:  2.022696 
    Chain  2 : Mean first:  1.118074  Mean second:  1.966311 
    Chain  3 : Mean first:  1.14056  Mean second:  2.045057 
    Chain  4 : Mean first:  1.108941  Mean second:  2.036765 

So we see that the second halves are at a very different place than the first halves.

When I throw away first 7000 samples, the stats look great, so this is probably just having too few warmup/burnin iterations

QOI  1 : rhat : 1.001316 , ESS bulk:  3029.344 , ESS tails:  3896.477 
    Chain  1 : Mean first:  1.993972  Mean second:  2.009912 
    Chain  2 : Mean first:  1.963403  Mean second:  1.950324 
    Chain  3 : Mean first:  2.110226  Mean second:  1.969987 
    Chain  4 : Mean first:  2.05629  Mean second:  2.065964 
QOI  2 : rhat : 1.001135 , ESS bulk:  3111.017 , ESS tails:  4083.119 
    Chain  1 : Mean first:  1.997199  Mean second:  2.015147 
    Chain  2 : Mean first:  1.976256  Mean second:  1.956664 
    Chain  3 : Mean first:  2.110937  Mean second:  1.969425 
    Chain  4 : Mean first:  2.040694  Mean second:  2.07161

References:

Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner (2019). Rank-normalization, folding, and localization: An improved R-hat for assessing convergence of MCMC. arXiv preprint arXiv:1903.08008.

Stan implementation C++:

https://github.com/stan-dev/stan/blob/develop/src/stan/analyze/mcmc/compute_potential_scale_reduction.hpp

https://github.com/stan-dev/stan/blob/develop/src/stan/analyze/mcmc/compute_effective_sample_size.hpp

Stan implementation R:

https://github.com/stan-dev/posterior/blob/master/R/convergence.R

Comments (4)

  1. Matthew Parno

    Thanks @Martin Modrák . I’ll dig into this. I agree that ensuring (or at least testing) for convergence is important and will think more about how we incorporate better diagnostics, like those you mentioned, into MUQ.

  2. Martin Modrák reporter

    I just retried with the newest version. It appears that Example2_GaussianInverseGamma now converges. However Example3_MultilevelGaussianSampling is still not in equilibrium - I get large Rhats for the SLMCMC chain.

    I am not sure how to assess convergence of GreedyMLMCMC. GetSamples() does not work, and there is a GetBox(index) method, but no way to get the total number of boxes. When I compute Rhat with the first two boxes, I get high values. But not sure if that is sensible - maybe each chain computes something somewhat different? (I admit I don't understand the algorithm at all).

    I similarly have no idea how to asses the convergence of StaticLoadBalancingMIMCMC in Example4.

    All of the above holds for the C++ version, did not try Python. Although I noticed that the Python version of Example2 does not include diagnostics.

  3. Matthew Parno

    Hi @Martin Modrák . We have signficantly refactored how information from MIMCMC and MLCMC can be accessed. The GetSamples() function now returns a MultiindexEstimator class with many of the same features as the SampleCollection class for standard MCMC algorithms. Means, higher order moments, ESS, and Rhat can now all be computed with the MIMCMC results in both c++ and python.

  4. Log in to comment