Cache Flush/Close out of order messages

Issue #33 new
Douglas Potter repo owner created an issue

On Eiger (at least), cache FLUSH messages are (rarely) received after the MPI_Ibarrier causing strange behaviour. For example:

pkdgrav3/mdl2/mpi/mdl.cxx:1180: void mdl::mpiClass::MessageCacheClose(mdl::mdlMessageCacheClose*): Assertion `std::all_of(countCacheInflight.begin(), countCacheInflight.end(), [](int i) { return i==0; })' failed.

This means that the MPI Barrier completed, but a cache “Isend” (a FLUSH) has not completed.

Or:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000578279 in mdl::mdlClass::MessageFlushToCore (this=0xa10410, pFlush=0xf5a580) at /users/tmeier/codePkdgrav/pkdgrav3/mdl2/mpi/mdl.cxx:1063
1063            auto c = cache[ca->cid].get();

This means that the MPI_Ibarrier completed, and the cache was closed causing the “cache helper” to be reset, then a FLUSH message was received.

This out of order behaviour is allowed by the MPI specification. It has only been seen on Eiger.

Comments (0)

  1. Log in to comment