Implicit ~device_allocator may make CUDA calls after CUDA is "deinitialized"
Issue #633
new
I have recently attempted to run sympack2D_cuda
a Ubuntu 24.04 system with the distro-provided GCC (13.2) and CUDA toolchain (12.0). The result is a crash after return from main()
with the following message from every rank:
UPC++ CUDA call failed:
on process 0 (cgpu-1)
at
[prefix]/upcxx.assert1.optlev0.dbgsym1.gasnet_seq.ibv/include/upcxx/cuda_internal.hpp:56
cuCtxPushCurrent(ctx)
error=:
The attached backtrace shows the following sequence:
__run_exit_handlers() is called sometime after return from main()
~device_alocator()
~device_allocator_core()
release()
context()
cuCtxPushCurrent() returns CUDA_ERROR_DEINITIALIZED
fatalerror()
In lieu of symPACK, the following is sufficient to reproduce on this system:
#include <upcxx/upcxx.hpp>
upcxx::device_allocator<upcxx::cuda_device> gpu_allocator; // global variable is prereq
int main(void) {
upcxx::init();
gpu_allocator = upcxx::make_gpu_allocator<upcxx::gpu_default_device>(32*1024*1024);
upcxx::finalize();
return 0;
}
Comments (1)
-
reporter - Log in to comment
The UPC++ specification permits an application to elide explicit destruction of
device_allocator
objects.So, this is not a flaw in symPACK or the reproducer.
The problem is that code to release the allocated device memory was written without allowing for the possibility that CUDA had been finalized/deinitialized. Lacking any API call to do that made this a reasonable assumption, which (for whatever reason) has now been shown not to always hold.
This has not been observed/reproduced on an HPE Cray EX system with newer CUDA (but older GCC), leaving it unclear what are the preconditions for this issue to arise.
The long-term fix I recommend is to revise
upcxx::detail::device_allocator_core<upcxx::cuda_device>::release()
to tolerate theCUDA_ERROR_DEINITIALIZED
error return.Until this is fixed in UPC++, the recommended work-around is to make explicit
.destroy()
calls forupcxx::device_allocator<upcxx::cuda_device>
objects. Though this is not required by the UPC++ specification, it is sufficient to avoid this issue.