No convergence diagnostics for MCMC, examples do not converge
When running MCMC, it is very easy to get wrong results because the chains did not converge to the stationary distribution (model misspecification, non-identifiability, too few warmup/burnin samples, too few total samples, …).
Currently, it appears that the MCMC part of MUQ has no built-in way to check this, which would be very useful. In fact, it appears that at least some of the MCMC examples have not actually converged to the stationary distribution.
Background:
In Stan, it is considered best practice to check convergence of any MCMC run by:
- Running multiple (usually at least 4) chains
- Computing the split Rhat (or similar diagnostics) across the chains - there was a recent update using rank-normalized values and others (See ref below)
- Computing effective sample size for bulk (ESS for rank normalized values using split chains.) and tail (minimum of the ESS for 5% and 95% quantiles.) of the parameters across all chains.
I think this is basically state-of-the-art, but I may be biased by my affiliation with Stan. I in no way insist on particulars, just that some form of convergence checks should be easily available to the user and shown in the examples. It is I think the least one can do to help users avoid wrong inferences.
Non-convergence of examples
To show that this is not a theoretical concern, I checked the output of some of the MCMC examples. Despite the target densities not being very difficult, I encountered (small) problems. I mostly exported the samples and used the posterior
R package for the analysis.
Let’s start with SamplingAlgorithms/MCMC/Example2_GaussianInverseGamma
- I exported the samples and noticed that rhat
is very high. To show this in a more direct way, I added code to show the mean of first 30K and last 30K of post warmup samples (BTW is there an easier way to subset/split/thin samples?)
SampleCollection startSamples;
SampleCollection endSamples;
for(size_t i = 0; i < 30000; ++i) {
startSamples.Add(samps->at(i));
endSamples.Add(samps->at(i + samps->size() - 30000));
}
std::cout << "Start Mean = \n" << startSamples.Mean().transpose() << std::endl;
std::cout << "End Mean = \n" << endSamples.Mean().transpose() << std::endl;
Here is the output from a typical run:
Sample Mean =
1.02402 2.01825 0.656083
Start Mean =
1.03567 2.11923 0.635161
End Mean =
1.11773 1.89382 0.705868
So we see that at least the start of the chain has not yet reached the stationary distribution of the chain - the differences are almost certainly larger than the Monte Carlo error from 30K samples.
In SamplingAlgorithms/MCMC/Example3_MultilevelGaussianSampling/MultilevelGaussianSampling.cpp
I observe similar failures to pass the diagnostics. It is a bit unclear whether the samples are after burn-in or include burn-in (see https://bitbucket.org/mituq/muq2/issues/18/inconsistent-dimensions-of-output), but even assuming the first 1000 samples should be discarded (which the example does not do), the mean of the first 4500 samples and the last 4500 samples can differ noticeably, here’s a summary for QOIs in 4 chains (computed in R after exporting to file, rhat
should be < 1.01):
QOI 1 : rhat : 1.114922 , ESS bulk: 22.26576 , ESS tails: 233.6245
Chain 1 : Mean first: 1.125966 Mean second: 2.015556
Chain 2 : Mean first: 1.12977 Mean second: 1.953836
Chain 3 : Mean first: 1.151658 Mean second: 2.04343
Chain 4 : Mean first: 1.118716 Mean second: 2.028392
QOI 2 : rhat : 1.118254 , ESS bulk: 21.71397 , ESS tails: 245.1589
Chain 1 : Mean first: 1.129428 Mean second: 2.022696
Chain 2 : Mean first: 1.118074 Mean second: 1.966311
Chain 3 : Mean first: 1.14056 Mean second: 2.045057
Chain 4 : Mean first: 1.108941 Mean second: 2.036765
So we see that the second halves are at a very different place than the first halves.
When I throw away first 7000 samples, the stats look great, so this is probably just having too few warmup/burnin iterations
QOI 1 : rhat : 1.001316 , ESS bulk: 3029.344 , ESS tails: 3896.477
Chain 1 : Mean first: 1.993972 Mean second: 2.009912
Chain 2 : Mean first: 1.963403 Mean second: 1.950324
Chain 3 : Mean first: 2.110226 Mean second: 1.969987
Chain 4 : Mean first: 2.05629 Mean second: 2.065964
QOI 2 : rhat : 1.001135 , ESS bulk: 3111.017 , ESS tails: 4083.119
Chain 1 : Mean first: 1.997199 Mean second: 2.015147
Chain 2 : Mean first: 1.976256 Mean second: 1.956664
Chain 3 : Mean first: 2.110937 Mean second: 1.969425
Chain 4 : Mean first: 2.040694 Mean second: 2.07161
References:
Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner (2019). Rank-normalization, folding, and localization: An improved R-hat for assessing convergence of MCMC. arXiv preprint arXiv:1903.08008.
Stan implementation C++:
Stan implementation R:
https://github.com/stan-dev/posterior/blob/master/R/convergence.R
Comments (4)
-
-
reporter I just retried with the newest version. It appears that
Example2_GaussianInverseGamma
now converges. HoweverExample3_MultilevelGaussianSampling
is still not in equilibrium - I get large Rhats for theSLMCMC
chain.I am not sure how to assess convergence of
GreedyMLMCMC
.GetSamples()
does not work, and there is aGetBox(index)
method, but no way to get the total number of boxes. When I compute Rhat with the first two boxes, I get high values. But not sure if that is sensible - maybe each chain computes something somewhat different? (I admit I don't understand the algorithm at all).I similarly have no idea how to asses the convergence of
StaticLoadBalancingMIMCMC
in Example4.All of the above holds for the C++ version, did not try Python. Although I noticed that the Python version of Example2 does not include diagnostics.
-
Hi @Martin Modrák . We have signficantly refactored how information from MIMCMC and MLCMC can be accessed. The
GetSamples()
function now returns aMultiindexEstimator
class with many of the same features as theSampleCollection
class for standard MCMC algorithms. Means, higher order moments, ESS, and Rhat can now all be computed with the MIMCMC results in both c++ and python. -
- changed status to resolved
- Log in to comment
Thanks @Martin Modrák . I’ll dig into this. I agree that ensuring (or at least testing) for convergence is important and will think more about how we incorporate better diagnostics, like those you mentioned, into MUQ.