slow upcxx::init when hugepages are enabled on Perlmutter GPU nodes
In contrast to Issue #580, this issue explicitly looks at the startup time of upcxx::init when hugepages are enabled.
This is the environment on 8 GPU node perlmutter job
Currently Loaded Modules:
1) craype-x86-milan 12) cray-dsmml/0.2.2
2) libfabric/1.15.2.0 13) cray-libsci/22.11.1.2
3) craype-network-ofi 14) craype/2.7.19
4) xpmem/2.5.2-2.4_3.20__gd0f7936.shasta 15) perftools-base/22.09.0
5) cpe/22.11 16) cpe-cuda/22.11
6) xalt/2.10.2 17) Nsight-Compute/2022.1.1
7) craype-accel-nvidia80 18) Nsight-Systems/2022.2.1
8) gpu/1.0 19) cudatoolkit/11.7
9) craype-hugepages256M 20) gcc/11.2.0
10) cmake/3.24.3 21) cray-mpich/8.1.22
11) PrgEnv-gnu/8.3.3 22) upcxx/bleeding-edge
GASNET_OFI_RECEIVE_BUFF_SIZE=recv
GASNET_SPAWN_CONTROL=pmi
for i in 1 2 3 ; do echo without hugepages; GASNET_USE_HUGEPAGES=0 timeout 900 /pscratch/sd/r/regan/mhm2-builds/BleedingEdgeUpcxx-${build}/install/bin/mhm2.py -r arctic_sample_*.fq ; date ; echo with hugepages; GASNET_USE_HUGEPAGES=1 timeout 900 /pscratch/sd/r/regan/mhm2-builds/BleedingEdgeUpcxx-${build}/install/bin/mhm2.py -r arctic_sample_*.fq ; date ; done
I have confirmed with without craype-hugepages
* loaded upcxx::init completes rapidly and also when GASNET_USE_HUGEPAGES=0
is set.
As expected with transparent hugepages, when craype-hugepages256M
and GASNET_USE_HUGEPAGES=0
, then mhm2 performs modestly better with overall 16-25% improvement in speed from start to stop on the 10G test 8 gpu nodes than if craype-hugepages
256M is not loaded. (80s vs 60-67s including slurm’s srun startup & teardown)
When craype-hugepages256M
module is loaded AND [GASNET_USE_HUGEPAGES is not set OR GASNET_USE_HUGEPAGES=1 ],
then upcxx::init does not complete within the 900 seconds that the above test allows.
Comments (5)
-
-
reporter So for the GPU nodes on perlmutter, we currently use srun. ( I will be looking at using upcxx-srun in the very near future).
Our mhm2.py script handles the launching and sets
UPCXX_SHARED_HEAP_SIZE='450 MB
' which should be about 10%Found 128 cpus and 2 hyperthreads from lscpu Using default cores of 64 . Ignoring tasks per node 128 from SLURM_TASKS_PER_NODE= 128(x8) This is Perlmutter GPU partition - executing srun directly and overriding UPCXX_SHARED_HEAP_SIZE= 450 MB : ['srun', '-n', '512', '-N', '8', '--gpus-per-node=4', '/pscratch/sd/r/regan/mhm 2-builds/BleedingEdgeUpcxx-Release-gnu/install/bin/mhm2-mps-wrapper-perlmutter.sh'] Executing mhm2 with job 5160593 (wrap) on 8 nodes. Executing as: /pscratch/sd/r/regan/mhm2-builds/BleedingEdgeUpcxx-Release-gnu/install/bin/mhm2.py -r arctic_sample_0.fq arctic_sample_10.fq arctic_sample_11.fq arctic_sample_1.fq arctic_sample_2.fq arctic_sample_3.fq arctic_sample_4.fq arctic_sample_5.fq arctic_sample_6.fq arctic_sample_7.fq arctic_sample_8.fq arctic_sample_9.fq Using default cores of 64 . Ignoring tasks per node 128 from SLURM_TASKS_PER_NODE= 128(x8) 2023-01-31 17:11:14.017754 executing: srun -n 512 -N 8 --gpus-per-node=4 /pscratch/sd/r/regan/mhm2-builds/BleedingEdgeUpcxx-Release-gnu/install/bin/mhm2-mps-wrapper-perlmutter.sh -- /pscratch/sd/r/regan/mhm2-builds/BleedingEdgeUpcxx-Release-gnu/install/bin/mhm2 -r arctic_sample_0.fq arctic_sample_10.fq arctic_sample_11.fq arctic_sample_1.fq arctic_sample_2.fq arctic_sample_3.fq arctic_sample_4.fq arctic_sample_5.fq arctic_sample_6.fq arctic_sample_7.fq arctic_sample_8.fq arctic_sample_9.fq
When perlmutter is actually scheduling jobs again, I’ll play around with
GASNET_PHYSMEM_MAX= .. 1/10,1/5,1/3, etc)
and also GASNET_PHYSMEM_PROBE=0 to see if that changes the behavior. -
reporter Testing
GASNET_MAX_SEGSIZE
==UPCXX_SHARED_HEAP_SIZE == '450 MB'
and the module craype-hugepages2M
Using srun to spawn, and it looks like upcxx::init starts in <9 seconds and the two runs were even a little faster at 57&58s. As expected, with those two envs set, GASNET_PHYSMEM_PROBE=0 had no effect.I’ll also try with using upcxx-srun, but I think this is the fix we need.
This supports the hypothesis that the defaults of initially requesting up to 85% of the RAM during the probe caused the long or possibly indefinite hang in upcxx::init when hugepages were being used.
-
- changed status to invalid
Resolving this as "invalid" , since bypassing
upcxx-run
in favor of the underlying spawner (which we DO support) but setting onlyUPCXX_SHARED_HEAP_SIZE
without a matchingGASNET_MAX_SEGSIZE
is technically pilot error (see documentation) -
bypassing upcxx-run in favor of the underlying spawner (which we DO support) but setting only UPCXX_SHARED_HEAP_SIZE without a matching GASNET_MAX_SEGSIZE is technically pilot error
AI PAUL: add this important wisdom to site-docs.md (done)
- Log in to comment
We suspect this issue is strongly dependent on the shared segment probe at startup.
@Rob Egan Please share the details of exactly how you are specifying the UPC++/GASNet shared segment, and what size you are asking for. I'm also very interested to know if the startup time behavior changes with much smaller
GASNET_MAX_SEGSIZE
.