RFE: Add device UUID to `Device::kind_info()`

Issue #608 resolved
Paul Hargrove created an issue

I often find myself concerned with GPU binding, such as in our CI testing. With the number of variations on batch systems and each center's configuration thereof, this can be challenging.

I have found the inclusion of variables like CUDA_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES in upcxx::kind_info() to be helpful. However, these environment variables are not the only means by which a resource manager such as Slurm can implement GPU binding. For instance, I believe that on Perlmutter, cgroups is being used to limit device visibility and CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES are not set.

This is a feature request to include a unique identifier in the result of upcxx::kind_info(). At a minimum, the task of validating disjoint GPU binding demands something unique with node scope (such as the device ordinal or even a PCI bus address). However, I believe all three GPU APIs we support provide a "uuid" of one sort or another, as well as command line means (such as nvidia-smi -L, rocm-smi --showuniqueid, clinfo -a | grep UUID) that could be used to map that identifier to a device ordinal (to help verify binding to the expected GPU(s)). So, I think the same id used by those tools would be the preferred option.

Comments (2)

  1. Log in to comment