Update gpu_microbenchmark to use `upcxx::make_gpu_allocator()`

To the best of my knowledge, all our examples of GPU use are unconditionally passing 0 as the device number.

On a system with NVIDIA's default settings, multiple processes per node can open and share a GPU. This means that somebody running our cuda_microbenchmark on a pair of dual-GPU nodes may expect one process per GPU when running two processes per node, but will instead silently use only one GPU and see a correspondingly reduced performance.

On a system (like Summit in particular) where the GPUs have been placed in a process-exclusive mode, the scenario above can result in a failure to open the device if running with more than one GPU and process per "resource set".

On Summit the resolution is to request that jsrun construct resource sets with one process and one GPU in each. Similarly, on other systems there is usually some way to have the job launcher set the CUDA_VISIBLE_DEVICES. However, this is not always convenient, nor is it strictly necessary since the application should have sufficient information to spread processes over GPUs.

This issue is a request for (at least) an example of logic which performs "coordination" within the local_team to "do the right thing" when CUDA_VISIBLE_DEVICES contains multiple values and the number of processes in the team has more than one member. While "the right thing" is debatable, the simplest thing is probably to use the rank in local_team to select a value from CUDA_VISIBLE_DEVICES, and either "wrap" or fail if there are too few entries. Something like "blocked" assignment would also be possible.

At a minimum, it would be nice to have a simple example, but ultimately my observation regarding cuda_microbenchmark (and our other tests) should probably be addressed by including some improved logic in them.

Comments (3)