Update gpu_microbenchmark to use `upcxx::make_gpu_allocator()`

Issue #475 resolved
Paul Hargrove created an issue

To the best of my knowledge, all our examples of GPU use are unconditionally passing 0 as the device number.

On a system with NVIDIA's default settings, multiple processes per node can open and share a GPU. This means that somebody running our cuda_microbenchmark on a pair of dual-GPU nodes may expect one process per GPU when running two processes per node, but will instead silently use only one GPU and see a correspondingly reduced performance.

On a system (like Summit in particular) where the GPUs have been placed in a process-exclusive mode, the scenario above can result in a failure to open the device if running with more than one GPU and process per "resource set".

On Summit the resolution is to request that jsrun construct resource sets with one process and one GPU in each. Similarly, on other systems there is usually some way to have the job launcher set the CUDA_VISIBLE_DEVICES. However, this is not always convenient, nor is it strictly necessary since the application should have sufficient information to spread processes over GPUs.

This issue is a request for (at least) an example of logic which performs "coordination" within the local_team to "do the right thing" when CUDA_VISIBLE_DEVICES contains multiple values and the number of processes in the team has more than one member. While "the right thing" is debatable, the simplest thing is probably to use the rank in local_team to select a value from CUDA_VISIBLE_DEVICES, and either "wrap" or fail if there are too few entries. Something like "blocked" assignment would also be possible.

At a minimum, it would be nice to have a simple example, but ultimately my observation regarding cuda_microbenchmark (and our other tests) should probably be addressed by including some improved logic in them.

Comments (3)

  1. Dan Bonachea

    The deployment of upcxx::make_gpu_allocator() in 2022.3.0 makes this very easy, and its use in multi-GPU systems is covered by the updated PG examples and example/gpu_vecadd.

    I think the only remaining task is to update gpu_microbenchmark to use upcxx::make_gpu_allocator()

  2. Paul Hargrove reporter

    In response to comment including

    The deployment of upcxx::make_gpu_allocator() in 2022.3.0 makes this very easy

    I feel that some context was lost with the simultaneous change to the issue's title. The previous title was "RFE: example for use of multi-GPU system". So, the phrase "the only remaining task" refers to that original goal for which the 2020.3.0 release covered the relevant example codes except gpu_microbenchmark. Additionally, it should be noted that 2022.3.0 renamed the cuda_microbenchmark in the original description to the gpu_microbenchmark name used in the preceding comments.

  3. Log in to comment