Cache Flush/Close out of order messages
Issue #33
new
On Eiger (at least), cache FLUSH messages are (rarely) received after the MPI_Ibarrier causing strange behaviour. For example:
pkdgrav3/mdl2/mpi/mdl.cxx:1180: void mdl::mpiClass::MessageCacheClose(mdl::mdlMessageCacheClose*): Assertion `std::all_of(countCacheInflight.begin(), countCacheInflight.end(), [](int i) { return i==0; })' failed.
This means that the MPI Barrier completed, but a cache “Isend” (a FLUSH) has not completed.
Or:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000578279 in mdl::mdlClass::MessageFlushToCore (this=0xa10410, pFlush=0xf5a580) at /users/tmeier/codePkdgrav/pkdgrav3/mdl2/mpi/mdl.cxx:1063
1063 auto c = cache[ca->cid].get();
This means that the MPI_Ibarrier completed, and the cache was closed causing the “cache helper” to be reset, then a FLUSH message was received.
This out of order behaviour is allowed by the MPI specification. It has only been seen on Eiger.