- changed title to RFE: Add device UUID to `Device::kind_info()`
-
assigned issue to
RFE: Add device UUID to `Device::kind_info()`
I often find myself concerned with GPU binding, such as in our CI testing. With the number of variations on batch systems and each center's configuration thereof, this can be challenging.
I have found the inclusion of variables like CUDA_VISIBLE_DEVICES
and ROCR_VISIBLE_DEVICES
in upcxx::kind_info()
to be helpful. However, these environment variables are not the only means by which a resource manager such as Slurm can implement GPU binding. For instance, I believe that on Perlmutter, cgroups is being used to limit device visibility and CUDA_VISIBLE_DEVICES
and NVIDIA_VISIBLE_DEVICES
are not set.
This is a feature request to include a unique identifier in the result of upcxx::kind_info()
. At a minimum, the task of validating disjoint GPU binding demands something unique with node scope (such as the device ordinal or even a PCI bus address). However, I believe all three GPU APIs we support provide a "uuid" of one sort or another, as well as command line means (such as nvidia-smi -L
, rocm-smi --showuniqueid
, clinfo -a | grep UUID
) that could be used to map that identifier to a device ordinal (to help verify binding to the expected GPU(s)). So, I think the same id used by those tools would be the preferred option.
Comments (2)
-
-
- changed status to resolved
Resolved in Spec PR 103 and Impl PR 495 merged at 0ebd4d5
- Log in to comment
Proposed resolution now appears in Spec PR 103 and Impl PR 495