V100 Exclusive Mode Issue on Summit

Caused by commit 151b406

https://bitbucket.org/icl/papi/commits/151b40663e479ac8f140739b8f96ddd5c4dd1006

I discovered a problem when I was doing Kokkos measurements on Summit and using PAPI’s nvmlcap_plot tool to read power usage during the Kokkos run.

This is my setup:

./nvmlcap_plot 0 &
PAPI_PID=$!
./<kokkos_executable>
kill -20 $PAPI_PID

Since commit 151b406, I get the following error:
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaDeviceSynchronize() error( cudaErrorDevicesUnavailable): all CUDA-capable devices are busy or unavailable

We think that this is caused by creating the current default CUDA context in linux-cuda.c.
I also tried to destroy the context using the call cuCtxPopCurrent before cuCtxDestroy, but it did not work.
Through trial and error, I found that cudaFree is causing the error.

linux-cuda.c.:576 cudaErr = (*cudaFreePtr) (NULL);

If I leave only the line above and comment out the rest of commit 151b406, I already get the “all CUDA-capable devices are busy or unavailable” error.

Unfortunately, I cannot reproduce this error with a simple CUDA program. This error only happens when using nvmlcap_plot as a separate process during the Kokkos run.

By the way, if you submit your job using the bsub flag “alloc_flags gpudefault” it works fine. This flag enables multiple processes and their threads to share and submit work to the GPU simultaneously. But before commit 151b406 it also worked with the exclusive GPU mode on Summit.

‌

Comments (1)