kokkos_3dhalo example fails on multi-GPU nodes due to device mismatch

Issue #578 resolved
Dan Bonachea created an issue

Running the kokkos_3dhalo example on develop @ 795a2934 on crusher (whose nodes have 8 GPUs) with multiple PPN and no GPU restrictions generates failures like the following:

$ srun -N 1 -n 4 -l upcxx_heat_conduction 
0: NumRanks: 4 Me: 0 Grid: 1 2 2 MyPos: 0 0 0
0: Me: 0 MyNeighs: -1 -1 -1 1 -1 2
0: My Domain: 0 (0 0 0) (100 50 50)
1: NumRanks: 4 Me: 1 Grid: 1 2 2 MyPos: 0 1 0
1: Me: 1 MyNeighs: -1 -1 0 -1 -1 3
1: My Domain: 1 (0 50 0) (100 100 50)
2: NumRanks: 4 Me: 2 Grid: 1 2 2 MyPos: 0 0 1
2: Me: 2 MyNeighs: -1 -1 -1 3 0 -1
2: My Domain: 2 (0 0 50) (100 50 100)
3: NumRanks: 4 Me: 3 Grid: 1 2 2 MyPos: 0 1 1
3: Me: 3 MyNeighs: -1 -1 2 -1 1 -1
3: My Domain: 3 (0 50 50) (100 100 100)
3: :0:rocdevice.cpp            :2614: 1558771782034 us: 95644: [tid:0x7fffe32df700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to access an inaccessible address. code: 0x2b
3: *** Caught a fatal signal (proc 3): SIGABRT(6)
[crash output truncated]

The same is true for many other multi-PPN layouts (including single- or multi-node). In all cases that I've seen, appending srun option --gpus-per-task=1 resolves the crash (and this is the recommended workaround). Similarly, the problem is resolved by setting ROCR_VISIBLE_DEVICES=X inside the srun command, where X is any value in [0,7].

This behavior is unintuitive at first glance, because the workarounds mentioned above reduce the number of GPUs visible to each application process from the default of all 8 GPUs to just a single GPU per process. I believe the root cause problem here is:

There is no explicit handshake in the code or run instructions to ensure Kokkos and UPC++ are using the same GPU. Moreover when each process has visibility to more than 1 GPU and PPN > 1, the algorithms used by Kokkos and UPC++ to choose a default GPU arrive at different answers!

One can observe this effect by running with app command line option --kokkos-print-configuration (which dumps info including the GPU device ID selected by Kokkos), and adding a debugging line like:

upcxx::experimental::say() << "UPC++ using GPU " << gpu_alloc.device_id();

For many multi-PPN job layouts where more than 1 GPU is visible to each process, these outputs reveal the two libraries arriving at a different selection of default GPU device ID. This is not surprisingly exactly the case where the program crashes ; it results in the program allocating device buffers from the UPC++ device segment data on one GPU, and passing those device pointers to Kokkos which attempts to perform data packing or launch kernels accessing those buffers expecting them to live on a different GPU.

IMO we need to explore either upgrading the code to ensure both libraries "agree" on the GPU in use at each process, or carefully document how to avoid this pitfall.

Comments (9)

  1. Dan Bonachea reporter

    Some initial thoughts/observations:

    UPC++ provides programmatic flexibility in device ID:

    1. The UPC++ code is currently relying upon automatic device selection during gpu_alloc = upcxx::make_gpu_allocator(segsize).
      • However we can trivially demand any specific GPU when constructing this device_allocator by passing the device ID as the optional second argument.
    2. Regardless of how the device is chosen, the call gpu_alloc.device_id() can be used to query the GPU device ID in-use by UPC++ any time after device_allocator construction

    Kokkos unfortunately seems to have far fewer options for programmatic control of GPU device ID:

    1. --kokkos-device-id=X (docs) can be passed on the command-line to force a particular GPU ID.
      • Unfortunately that alone is actually worse than setting ROCR_VISIBLE_DEVICES (or CUDA_VISIBLE_DEVICES), because it only affects the behavior of one library.
    2. The only mechanism I can find for programmatically setting the Kokkos device ID is via Kokkos::InitializationSettings::set_device_id() (added in Kokkos 3.7). However,
      • IIUC the Kokkos device ID can only be set during initial library initialization, and
      • IIUC using this overload of Kokkos::initialize() also disables all the other Kokkos command-line argument parsing.
    3. Regardless of how the Kokkos device selection is made, as far as I can tell there is no programmatic query in the public API to report which device ID Kokkos is currently using.
      • Lacking such a query, we have no reliable way to let Kokkos make its own device selection and programmatically set UPC++ to a matching device.

    So I'm not seeing a good programmatic solution at the moment..

  2. Dan Bonachea reporter

    For future reference, the Kokkos code that decides which GPU to use based on many potential inputs lives in their internal function Kokkos::Impl::get_gpu().

    This function does not appear in a Kokkos public header. We could potentially #include <impl/Kokkos_DeviceManagement.hpp> to get it, but there would almost certainly be Kokkos version dependencies to that approach since the Kokkos initialization path got a recent overhaul in 3.7.

    Internally it appears that Kokkos squirrels the device ID away into their Kokkos::Impl::{CUDA,HIP}Internal classes, which provide no documented query function. However both Kokkos backends execute a cudaSetDevice()/hipSetDevice() to the value at initialization and never manipulate it again (at least in Kokkos v3.7.01, and apparently back at least as far as 3.0.0) ; which seems like a dubious implementation strategy for the Kokkos library, but one that we can exploit by simply retrieving the Kokkos device ID from hipGetDevice() / cudaGetDevice() after Kokkos::initialize(). So far this sounds like the most robust option.

  3. Paul Hargrove

    one that we can exploit by simply retrieving the Kokkos device ID from hipGetDevice() / cudaGetDevice() after Kokkos::initialize().

    Makes me wonder if we want a "spelling" for make_gpu_allocator for the current device? Something like make_gpu_allocator<curent_cuda_device>(segsize) and even make_gpu_allocator<curent_device>(segsize)

  4. Daniel Waters

    The device used by the Kokkos runtime can be obtained and used for UPC++ device allocation as shown below:

    #ifdef KOKKOS_ENABLE_CUDA
      int device = Kokkos::Cuda().cuda_device();
    #else // KOKKOS_ENABLE_HIP
      int device = Kokkos::Experimental::HIP().hip_device();
    #endif
    size_t segsize = ...
    gpu_alloc = upcxx::make_gpu_allocator(segsize, device);
    

  5. Dan Bonachea reporter

    @Daniel Waters : Kokkos::Cuda().cuda_device() and Kokkos::Experimental::HIP().hip_device() don't seem to appear in the Kokkos public API documentation, but they do appear to be stable for recent Kokkos versions.

    Did you receive any promise of future stability or pointer to hidden documentation from the maintainers for these queries? Or are we just guessing those are more stable than their practice of cudaSetDevice()/hipSetDevice()?

  6. Paul Hargrove

    I know past stability is not a guarantee for the future (and I'd much rather have their team comment on this than just guessing), but at least on the Cuda side the interfaces have been there since late 2018.

    I find Kokkos::Cuda().cuda_device() (and Kokkos::Cuda().cuda_stream()) were added in Nov 2018: https://github.com/kokkos/kokkos/commit/dacdffa74

    At least Trilinos (parent project of Kokkos, iiuc) has been using Kokkos::Cuda().cuda_device() since 2020 to query which GPU Kokkos is using: https://github.com/trilinos/Trilinos/pull/6840/commits/25d209a0

  7. Dan Bonachea reporter

    @Daniel Waters given the evidence Paul uncovered it's probably safe to rely on these in our example.

    The actual change should include a comment clarifying why this is needed with a link to this issue.

  8. Log in to comment