slow upcxx::init when hugepages are enabled on Perlmutter GPU nodes

Issue #581 invalid
Rob Egan created an issue

In contrast to Issue #580, this issue explicitly looks at the startup time of upcxx::init when hugepages are enabled.

This is the environment on 8 GPU node perlmutter job

Currently Loaded Modules:
  1) craype-x86-milan                       12) cray-dsmml/0.2.2
  2) libfabric/1.15.2.0                     13) cray-libsci/22.11.1.2
  3) craype-network-ofi                     14) craype/2.7.19
  4) xpmem/2.5.2-2.4_3.20__gd0f7936.shasta  15) perftools-base/22.09.0
  5) cpe/22.11                              16) cpe-cuda/22.11
  6) xalt/2.10.2                            17) Nsight-Compute/2022.1.1
  7) craype-accel-nvidia80                  18) Nsight-Systems/2022.2.1
  8) gpu/1.0                                19) cudatoolkit/11.7
  9) craype-hugepages256M                   20) gcc/11.2.0
 10) cmake/3.24.3                           21) cray-mpich/8.1.22
 11) PrgEnv-gnu/8.3.3                       22) upcxx/bleeding-edge


GASNET_OFI_RECEIVE_BUFF_SIZE=recv
GASNET_SPAWN_CONTROL=pmi

 for i in 1 2 3 ; do echo without hugepages; GASNET_USE_HUGEPAGES=0 timeout 900  /pscratch/sd/r/regan/mhm2-builds/BleedingEdgeUpcxx-${build}/install/bin/mhm2.py -r arctic_sample_*.fq ; date ; echo with hugepages;  GASNET_USE_HUGEPAGES=1 timeout 900  /pscratch/sd/r/regan/mhm2-builds/BleedingEdgeUpcxx-${build}/install/bin/mhm2.py -r arctic_sample_*.fq ; date ; done

I have confirmed with without craype-hugepages* loaded upcxx::init completes rapidly and also when GASNET_USE_HUGEPAGES=0 is set.

As expected with transparent hugepages, when craype-hugepages256M and GASNET_USE_HUGEPAGES=0, then mhm2 performs modestly better with overall 16-25% improvement in speed from start to stop on the 10G test 8 gpu nodes than if craype-hugepages256M is not loaded. (80s vs 60-67s including slurm’s srun startup & teardown)

When craype-hugepages256M module is loaded AND [GASNET_USE_HUGEPAGES is not set OR GASNET_USE_HUGEPAGES=1 ],then upcxx::init does not complete within the 900 seconds that the above test allows.

Comments (5)

  1. Dan Bonachea

    We suspect this issue is strongly dependent on the shared segment probe at startup.

    @Rob Egan Please share the details of exactly how you are specifying the UPC++/GASNet shared segment, and what size you are asking for. I'm also very interested to know if the startup time behavior changes with much smaller GASNET_MAX_SEGSIZE.

  2. Rob Egan reporter

    So for the GPU nodes on perlmutter, we currently use srun. ( I will be looking at using upcxx-srun in the very near future).

    Our mhm2.py script handles the launching and sets UPCXX_SHARED_HEAP_SIZE='450 MB' which should be about 10%

    Found 128 cpus and 2 hyperthreads from lscpu
    Using default cores of  64 . Ignoring tasks per node  128  from SLURM_TASKS_PER_NODE= 128(x8)
    This is Perlmutter GPU partition - executing srun directly and overriding UPCXX_SHARED_HEAP_SIZE= 450 MB : ['srun', '-n', '512', '-N', '8', '--gpus-per-node=4', '/pscratch/sd/r/regan/mhm
    2-builds/BleedingEdgeUpcxx-Release-gnu/install/bin/mhm2-mps-wrapper-perlmutter.sh']
    Executing mhm2 with job 5160593 (wrap) on 8 nodes.
    Executing as: /pscratch/sd/r/regan/mhm2-builds/BleedingEdgeUpcxx-Release-gnu/install/bin/mhm2.py -r arctic_sample_0.fq arctic_sample_10.fq arctic_sample_11.fq arctic_sample_1.fq arctic_sample_2.fq arctic_sample_3.fq arctic_sample_4.fq arctic_sample_5.fq arctic_sample_6.fq arctic_sample_7.fq arctic_sample_8.fq arctic_sample_9.fq
    Using default cores of  64 . Ignoring tasks per node  128  from SLURM_TASKS_PER_NODE= 128(x8)
    2023-01-31 17:11:14.017754 executing:
     srun -n 512 -N 8 --gpus-per-node=4 /pscratch/sd/r/regan/mhm2-builds/BleedingEdgeUpcxx-Release-gnu/install/bin/mhm2-mps-wrapper-perlmutter.sh -- /pscratch/sd/r/regan/mhm2-builds/BleedingEdgeUpcxx-Release-gnu/install/bin/mhm2 -r arctic_sample_0.fq arctic_sample_10.fq arctic_sample_11.fq arctic_sample_1.fq arctic_sample_2.fq arctic_sample_3.fq arctic_sample_4.fq arctic_sample_5.fq arctic_sample_6.fq arctic_sample_7.fq arctic_sample_8.fq arctic_sample_9.fq
    

    When perlmutter is actually scheduling jobs again, I’ll play around with GASNET_PHYSMEM_MAX= .. 1/10,1/5,1/3, etc) and also GASNET_PHYSMEM_PROBE=0 to see if that changes the behavior.

  3. Rob Egan reporter

    Testing GASNET_MAX_SEGSIZE == UPCXX_SHARED_HEAP_SIZE == '450 MB'
    and the module craype-hugepages2M
    Using srun to spawn, and it looks like upcxx::init starts in <9 seconds and the two runs were even a little faster at 57&58s. As expected, with those two envs set, GASNET_PHYSMEM_PROBE=0 had no effect.

    I’ll also try with using upcxx-srun, but I think this is the fix we need.

    This supports the hypothesis that the defaults of initially requesting up to 85% of the RAM during the probe caused the long or possibly indefinite hang in upcxx::init when hugepages were being used.

  4. Dan Bonachea

    Resolving this as "invalid" , since bypassing upcxx-run in favor of the underlying spawner (which we DO support) but setting only UPCXX_SHARED_HEAP_SIZE without a matching GASNET_MAX_SEGSIZE is technically pilot error (see documentation)

  5. Paul Hargrove

    bypassing upcxx-run in favor of the underlying spawner (which we DO support) but setting only UPCXX_SHARED_HEAP_SIZE without a matching GASNET_MAX_SEGSIZE is technically pilot error

    AI PAUL: add this important wisdom to site-docs.md (done)

  6. Log in to comment