kokkos_3dhalo crashes for some job layouts due to device segment limit

Issue #579 resolved
Dan Bonachea created an issue

Running the kokkos_3dhalo example on develop @ 795a2934 (and all previous releases) for some job layouts leads to device segment allocation failures that produce null pointers and subsequent crashes. The problem mostly arises for layouts with 64 ranks or more (where some processes are fully "interior" in the process grid), but can also arise for smaller layouts with certain non-default input parameters.

In particular, running with 64 ranks and the default 100^3 problem generates an obscure crash during the solve (details vary based on system configuration).

There are two separate problems here:

  1. The example is not checking the result of calls to device_allocator::allocate(), which returns a null pointer on failure. Lacking a check, the null pointers propagate forward into cascading failures later in the code.

  2. The cause of the allocation failures is the example is creating a device segment that is exactly large enough to hold the elements for a worst-case allocation of ghost zone buffers, and assumes the device_allocator does not add any padding around these objects. In reality, the device segment allocator rounds-up object sizes to maintain good object alignment. In the worst case scenario, it can waste up 4095 bytes per allocation on padding, and the example code is not accounting for this at all.

The solution to the first problem for this best-practice example is to add assertions to check that device allocations did not fail.

The solution to the second problem is probably to add 12*4KiB = 48KiB of padding to the device segment size, which is more than necessary for most layouts but should always be sufficient for this use case.

Comments (1)

  1. Log in to comment