kokkos_3dhalo example fails on multi-GPU nodes due to device mismatch
Running the kokkos_3dhalo example on develop @ 795a2934 on crusher (whose nodes have 8 GPUs) with multiple PPN and no GPU restrictions generates failures like the following:
$ srun -N 1 -n 4 -l upcxx_heat_conduction
0: NumRanks: 4 Me: 0 Grid: 1 2 2 MyPos: 0 0 0
0: Me: 0 MyNeighs: -1 -1 -1 1 -1 2
0: My Domain: 0 (0 0 0) (100 50 50)
1: NumRanks: 4 Me: 1 Grid: 1 2 2 MyPos: 0 1 0
1: Me: 1 MyNeighs: -1 -1 0 -1 -1 3
1: My Domain: 1 (0 50 0) (100 100 50)
2: NumRanks: 4 Me: 2 Grid: 1 2 2 MyPos: 0 0 1
2: Me: 2 MyNeighs: -1 -1 -1 3 0 -1
2: My Domain: 2 (0 0 50) (100 50 100)
3: NumRanks: 4 Me: 3 Grid: 1 2 2 MyPos: 0 1 1
3: Me: 3 MyNeighs: -1 -1 2 -1 1 -1
3: My Domain: 3 (0 50 50) (100 100 100)
3: :0:rocdevice.cpp :2614: 1558771782034 us: 95644: [tid:0x7fffe32df700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to access an inaccessible address. code: 0x2b
3: *** Caught a fatal signal (proc 3): SIGABRT(6)
[crash output truncated]
The same is true for many other multi-PPN layouts (including single- or multi-node). In all cases that I've seen, appending srun
option --gpus-per-task=1
resolves the crash (and this is the recommended workaround). Similarly, the problem is resolved by setting ROCR_VISIBLE_DEVICES=X
inside the srun
command, where X
is any value in [0,7].
This behavior is unintuitive at first glance, because the workarounds mentioned above reduce the number of GPUs visible to each application process from the default of all 8 GPUs to just a single GPU per process. I believe the root cause problem here is:
There is no explicit handshake in the code or run instructions to ensure Kokkos and UPC++ are using the same GPU. Moreover when each process has visibility to more than 1 GPU and PPN > 1, the algorithms used by Kokkos and UPC++ to choose a default GPU arrive at different answers!
One can observe this effect by running with app command line option --kokkos-print-configuration
(which dumps info including the GPU device ID selected by Kokkos), and adding a debugging line like:
upcxx::experimental::say() << "UPC++ using GPU " << gpu_alloc.device_id();
For many multi-PPN job layouts where more than 1 GPU is visible to each process, these outputs reveal the two libraries arriving at a different selection of default GPU device ID. This is not surprisingly exactly the case where the program crashes ; it results in the program allocating device buffers from the UPC++ device segment data on one GPU, and passing those device pointers to Kokkos which attempts to perform data packing or launch kernels accessing those buffers expecting them to live on a different GPU.
IMO we need to explore either upgrading the code to ensure both libraries "agree" on the GPU in use at each process, or carefully document how to avoid this pitfall.
Comments (9)
-
reporter -
reporter For future reference, the Kokkos code that decides which GPU to use based on many potential inputs lives in their internal function Kokkos::Impl::get_gpu().
This function does not appear in a Kokkos public header. We could potentially
#include <impl/Kokkos_DeviceManagement.hpp>
to get it, but there would almost certainly be Kokkos version dependencies to that approach since the Kokkos initialization path got a recent overhaul in 3.7.Internally it appears that Kokkos squirrels the device ID away into their
Kokkos::Impl::{CUDA,HIP}Internal
classes, which provide no documented query function. However both Kokkos backends execute acudaSetDevice()
/hipSetDevice()
to the value at initialization and never manipulate it again (at least in Kokkos v3.7.01, and apparently back at least as far as 3.0.0) ; which seems like a dubious implementation strategy for the Kokkos library, but one that we can exploit by simply retrieving the Kokkos device ID fromhipGetDevice()
/cudaGetDevice()
afterKokkos::initialize()
. So far this sounds like the most robust option. -
one that we can exploit by simply retrieving the Kokkos device ID from hipGetDevice() / cudaGetDevice() after Kokkos::initialize().
Makes me wonder if we want a "spelling" for
make_gpu_allocator
for the current device? Something likemake_gpu_allocator<curent_cuda_device>(segsize)
and evenmake_gpu_allocator<curent_device>(segsize)
-
The device used by the Kokkos runtime can be obtained and used for UPC++ device allocation as shown below:
#ifdef KOKKOS_ENABLE_CUDA int device = Kokkos::Cuda().cuda_device(); #else // KOKKOS_ENABLE_HIP int device = Kokkos::Experimental::HIP().hip_device(); #endif size_t segsize = ... gpu_alloc = upcxx::make_gpu_allocator(segsize, device);
-
reporter @Daniel Waters :
Kokkos::Cuda().cuda_device()
andKokkos::Experimental::HIP().hip_device()
don't seem to appear in the Kokkos public API documentation, but they do appear to be stable for recent Kokkos versions.Did you receive any promise of future stability or pointer to hidden documentation from the maintainers for these queries? Or are we just guessing those are more stable than their practice of
cudaSetDevice()
/hipSetDevice()
? -
I know past stability is not a guarantee for the future (and I'd much rather have their team comment on this than just guessing), but at least on the Cuda side the interfaces have been there since late 2018.
I find
Kokkos::Cuda().cuda_device()
(andKokkos::Cuda().cuda_stream()
) were added in Nov 2018: https://github.com/kokkos/kokkos/commit/dacdffa74At least Trilinos (parent project of Kokkos, iiuc) has been using
Kokkos::Cuda().cuda_device()
since 2020 to query which GPU Kokkos is using: https://github.com/trilinos/Trilinos/pull/6840/commits/25d209a0 -
@Dan Bonachea I received no guarantee about those queries
-
reporter @Daniel Waters given the evidence Paul uncovered it's probably safe to rely on these in our example.
The actual change should include a comment clarifying why this is needed with a link to this issue.
-
reporter - changed status to resolved
Resolved in extras PR 48
- Log in to comment
Some initial thoughts/observations:
UPC++ provides programmatic flexibility in device ID:
gpu_alloc = upcxx::make_gpu_allocator(segsize)
.device_allocator
by passing the device ID as the optional second argument.gpu_alloc.device_id()
can be used to query the GPU device ID in-use by UPC++ any time afterdevice_allocator
constructionKokkos unfortunately seems to have far fewer options for programmatic control of GPU device ID:
--kokkos-device-id=X
(docs) can be passed on the command-line to force a particular GPU ID.ROCR_VISIBLE_DEVICES
(orCUDA_VISIBLE_DEVICES
), because it only affects the behavior of one library.Kokkos::initialize()
also disables all the other Kokkos command-line argument parsing.So I'm not seeing a good programmatic solution at the moment..