Implicit ~device_allocator may make CUDA calls after CUDA is "deinitialized"

Issue #633 new
Paul Hargrove created an issue

I have recently attempted to run sympack2D_cuda a Ubuntu 24.04 system with the distro-provided GCC (13.2) and CUDA toolchain (12.0). The result is a crash after return from main() with the following message from every rank:

UPC++ CUDA call failed:
on process 0 (cgpu-1)
at
[prefix]/upcxx.assert1.optlev0.dbgsym1.gasnet_seq.ibv/include/upcxx/cuda_internal.hpp:56

cuCtxPushCurrent(ctx)
error=:

The attached backtrace shows the following sequence:

__run_exit_handlers() is called sometime after return from main()
    ~device_alocator()
        ~device_allocator_core()
            release()
                context()
                    cuCtxPushCurrent()  returns CUDA_ERROR_DEINITIALIZED
                    fatalerror()

In lieu of symPACK, the following is sufficient to reproduce on this system:

#include <upcxx/upcxx.hpp>
upcxx::device_allocator<upcxx::cuda_device> gpu_allocator;  // global variable is prereq
int main(void) {
  upcxx::init();
  gpu_allocator = upcxx::make_gpu_allocator<upcxx::gpu_default_device>(32*1024*1024);
  upcxx::finalize();
  return 0;
}

Comments (1)

  1. Paul Hargrove reporter

    The UPC++ specification permits an application to elide explicit destruction of device_allocator objects.
    So, this is not a flaw in symPACK or the reproducer.

    The problem is that code to release the allocated device memory was written without allowing for the possibility that CUDA had been finalized/deinitialized. Lacking any API call to do that made this a reasonable assumption, which (for whatever reason) has now been shown not to always hold.

    This has not been observed/reproduced on an HPE Cray EX system with newer CUDA (but older GCC), leaving it unclear what are the preconditions for this issue to arise.

    The long-term fix I recommend is to revise upcxx::detail::device_allocator_core<upcxx::cuda_device>::release() to tolerate the CUDA_ERROR_DEINITIALIZED error return.

    Until this is fixed in UPC++, the recommended work-around is to make explicit .destroy() calls for upcxx::device_allocator<upcxx::cuda_device> objects. Though this is not required by the UPC++ specification, it is sufficient to avoid this issue.

  2. Log in to comment