- changed milestone to 2022.9.0 release
-
assigned issue to
- changed component to Memory Kinds
- changed version to 2022.3.0 release
- changed title to Update gpu_microbenchmark to use `upcxx::make_gpu_allocator()`
Update gpu_microbenchmark to use `upcxx::make_gpu_allocator()`
To the best of my knowledge, all our examples of GPU use are unconditionally passing 0
as the device number.
On a system with NVIDIA's default settings, multiple processes per node can open and share a GPU. This means that somebody running our cuda_microbenchmark
on a pair of dual-GPU nodes may expect one process per GPU when running two processes per node, but will instead silently use only one GPU and see a correspondingly reduced performance.
On a system (like Summit in particular) where the GPUs have been placed in a process-exclusive mode, the scenario above can result in a failure to open the device if running with more than one GPU and process per "resource set".
On Summit the resolution is to request that jsrun
construct resource sets with one process and one GPU in each. Similarly, on other systems there is usually some way to have the job launcher set the CUDA_VISIBLE_DEVICES
. However, this is not always convenient, nor is it strictly necessary since the application should have sufficient information to spread processes over GPUs.
This issue is a request for (at least) an example of logic which performs "coordination" within the local_team
to "do the right thing" when CUDA_VISIBLE_DEVICES
contains multiple values and the number of processes in the team has more than one member. While "the right thing" is debatable, the simplest thing is probably to use the rank in local_team
to select a value from CUDA_VISIBLE_DEVICES
, and either "wrap" or fail if there are too few entries. Something like "blocked" assignment would also be possible.
At a minimum, it would be nice to have a simple example, but ultimately my observation regarding cuda_microbenchmark
(and our other tests) should probably be addressed by including some improved logic in them.
Comments (3)
-
-
- changed status to resolved
issue
#475: Update gpu_microbenchmark to useupcxx::make_gpu_allocator()
Resolves issue
#475.→ <<cset c6b71eebbf98>>
-
reporter In response to comment including
The deployment of upcxx::make_gpu_allocator() in 2022.3.0 makes this very easy
I feel that some context was lost with the simultaneous change to the issue's title. The previous title was "RFE: example for use of multi-GPU system". So, the phrase "the only remaining task" refers to that original goal for which the 2020.3.0 release covered the relevant example codes except
gpu_microbenchmark
. Additionally, it should be noted that 2022.3.0 renamed thecuda_microbenchmark
in the original description to thegpu_microbenchmark
name used in the preceding comments. - Log in to comment
The deployment of
upcxx::make_gpu_allocator()
in 2022.3.0 makes this very easy, and its use in multi-GPU systems is covered by the updated PG examples and example/gpu_vecadd.I think the only remaining task is to update gpu_microbenchmark to use
upcxx::make_gpu_allocator()