`cuda_device::destroy()` incorrectly perturbs CUDA Driver context stack

UPC++ uses the CUDA driver API to perform all CUDA-relevant calls, an API that includes a stack abstraction for CUcontext device contexts. The runtime retains a reference to the CUDA device's primary context on each cuda_device open via cuDevicePrimaryCtxRetain(), and the relevant context is carefully pushed and popped around all CUDA-relevant calls for the given device.

Unfortunately the cleanup sequence in cuda_device::destroy() for an active cuda_device is currently flawed, and incorrectly perturbs the CUDA Driver API context stack. As a result of the defect, cuda_device::destroy() on an active device results in incorrectly popping the top context off the context stack, a context which is not owned by the UPC++ runtime. Consequently, user codes written using the CUDA driver API may experience a discrepancy in the context stack after calling cuda_device::destroy().

This defect should NOT affect any code executing before the first call to cuda_device::destroy() (which is commonly performed near job termination). This defect is NOT believed to affect codes that only invoke the (more commonly used) CUDA Runtime API, which maintains its own separate notion of active device.

I've confirmed via source inspection this defect dates back to the 2020.11.0 prototype release (which notably rewrote the cuda_device implementation to support multiple opens of the same device). The 2020.10.0 release and earlier versions were not affected.

Comments (2)