unreasonably long upcxx::init on perlmutter based on GASNET_OFI_NUM_RECEIVE_BUFFS

Issue #580 closed
Rob Egan created an issue

This issue was first documented in Slack (Aug 2022) as a stall within upcxx::init() that takes a variable amount of time but has been as long as >8 hours in large jobs while using 128 ppn (i.e. perlmutter cpu nodes) and the “single” workaround.
The following seems to be a good workaround GASNET_OFI_NUM_RECEIVE_BUFFS=400 and works for most cases, but sometimes 300 is needed, sometimes 200, and if it is too small the stall also occurs (though possibly in a different code path).

From slack & testing in Oct:

"It looks like using the default GASNET_OFI_NUM_RECEIVE_BUFFS always fails for as low as 16 cpu nodes at 128 ppn, but always succeeds at 112 ppn, confirmed this outcome at 32, 64, 128 and 256 nodes. at 384 nodes, default env at both 128 and 112 rpn stalled in upcxx::init.”

“with GASNET_OFI_NUM_RECEIVE_BUFFS=400, 128 rpn failed but 112 succeded. And with GASNET_OFI_NUM_RECEIVE_BUFFS=300 and 200 both 128 rpn and 112 succeeded. at 512 nodes, default env at 128 stalled but 112 rpn worked, and with GASNET_OFI_NUM_RECEIVE_BUFFS <= 400 all other tests at 128 and 112 rpn worked.”

Then a possibly different issue detected later on the GPU nodes running 64 rpn and GASNET_OFI_NUM_RECEIVE_BUFFS=400 it also is very slow, and this may be related to huge page allocation and might be just a significant slowdown instead of a stall and critically I have only seen this on the gpu nodes, not the cpu ones. (i.e. 2 gpu nodes took 341s to 187s to complete upcxx::init in repeated runs within the same job).

from slack:

“So a mixed bag on repeated runs within the same job on 2 gpu nodes:

regan@login01:/pscratch/sd/r/regan/arctic_data> grep 'init After' slurm-3940976.out
upcxx::init After=341.08/341.32/341.17/341.32 s, 1.00
upcxx::init After=307.75/307.88/307.78/307.88 s, 1.00
upcxx::init After=228.15/228.49/228.34/228.49 s, 1.00
upcxx::init After=187.43/187.58/187.48/187.58 s, 1.00
upcxx::init After=213.15/213.85/213.47/213.85 s, 1.00

I’ll post again with new experiments using the bleeding-edge build on perlmutter 1/31/2023

Comments (8)

  1. Rob Egan reporter

    4 CPU nodes using the “recv” slingshot work around:

     1) craype-x86-milan                        5) cpe/22.11               9) cmake/3.24.3           13) craype/2.7.19            17) Nsight-Systems/2022.2.1  21) upcxx/bleeding-edge
      2) libfabric/1.15.2.0                      6) xalt/2.10.2            10) PrgEnv-gnu/8.3.3       14) perftools-base/22.09.0   18) cudatoolkit/11.7
      3) craype-network-ofi                      7) craype-accel-nvidia80  11) cray-dsmml/0.2.2       15) cpe-cuda/22.11           19) gcc/11.2.0
      4) xpmem/2.5.2-2.4_3.20__gd0f7936.shasta   8) gpu/1.0                12) cray-libsci/22.11.1.2  16) Nsight-Compute/2022.1.1  20) cray-mpich/8.1.22
    
    regan@nid004230:/pscratch/sd/r/regan/arctic_data> env|grep GAS
    GASNET_OFI_RECEIVE_BUFF_SIZE=recv
    GASNET_SPAWN_CONTROL=pmi
    
    upcxx-run -n 512 -N 4 -shared-heap 10% -- ../mhm2-builds/BleedingEdgeUpcxx-Release-gnu-cpuonly/install/bin/mhm2 -r arctic_sample_0.fq -v --progress
    

    GASNET_OFI_NUM_RECEIVE_BUFFS=Timeout in upcxx::init() after 2 min.

    (no change) Timeout in upcxx::init() after 2 min.

    GASNET_OFI_NUM_RECEIVE_BUFFS=450 Timeout in upcxx::init() after 2 min.

    GASNET_OFI_NUM_RECEIVE_BUFFS=442 Timeout in upcxx::init() after 2 min.

    GASNET_OFI_NUM_RECEIVE_BUFFS=439 Timeout in upcxx::init() after 2 min.

    GASNET_OFI_NUM_RECEIVE_BUFFS=438 Timeout in upcxx::init() after 2 min.

    GASNET_OFI_NUM_RECEIVE_BUFFS=437 upcxx::init() in <7s

    GASNET_OFI_NUM_RECEIVE_BUFFS=435 upcxx::init() in <7s

    GASNET_OFI_NUM_RECEIVE_BUFFS=425 upcxx::init() in <7s

    GASNET_OFI_NUM_RECEIVE_BUFFS=400 upcxx::init() in <7s

    GASNET_OFI_NUM_RECEIVE_BUFFS=300 upcxx::init() in <7s

    GASNET_OFI_NUM_RECEIVE_BUFFS=200upcxx::init() in <7s

    GASNET_OFI_NUM_RECEIVE_BUFFS=100upcxx::init() in <7s

    GASNET_OFI_NUM_RECEIVE_BUFFS=50upcxx::init() in <7s

    GASNET_OFI_NUM_RECEIVE_BUFFS=20upcxx::init() in <7s

    GASNET_OFI_NUM_RECEIVE_BUFFS=10upcxx::init() in <7s

    in all the tests that did not timeout, mhm2 completed in about 50s

    4 GPU nodes using the “recv” slingshot workaround:

     env|grep GASNET
    GASNET_OFI_RECEIVE_BUFF_SIZE=recv
    GASNET_SPAWN_CONTROL=pmi
    
    
    timeout 90 ../mhm2-builds/BleedingEdgeUpcxx-Release-gnu/install/bin/mhm2.py -r arctic_sample_0.fq
    ...
    Using default cores of  64 . Ignoring tasks per node  128  from SLURM_TASKS_PER_NODE= 128(x4)
    This is Perlmutter GPU partition - executing srun directly and overriding UPCXX_SHARED_HEAP_SIZE= 450 MB 
    
     srun -n 256 -N 4 --gpus-per-node=4 ../mhm2-builds/BleedingEdgeUpcxx-Release-gnu/install/bin/mhm2-mps-wrapper-perlmutter.sh -- ../mhm2-builds/BleedingEdgeUpcxx-Release-gnu/install/bin/mhm2 -r arctic_sample_0.fq
    

    GASNET_OFI_NUM_RECEIVE_BUFFS= FATAL ERROR (proc 1): in gasnetc_ofi_read_env_vars() at cxx-develop/bld/GASNet-develop/ofi-conduit/gasnet_ofi.c:536: GASNET_OFI_NUM_RECEIVE_BUFFS must be at least 2

    (no change). upcxx::init() in <9s

    GASNET_OFI_NUM_RECEIVE_BUFFS=1000 Timeout upcxx::init() after 90s

    GASNET_OFI_NUM_RECEIVE_BUFFS=937 Timeout upcxx::init() after 90s

    GASNET_OFI_NUM_RECEIVE_BUFFS=929 Timeout upcxx::init() after 90s

    GASNET_OFI_NUM_RECEIVE_BUFFS=928 upcxx::init() in <9s

    GASNET_OFI_NUM_RECEIVE_BUFFS=927 upcxx::init() in <9s

    GASNET_OFI_NUM_RECEIVE_BUFFS=925 upcxx::init() in <9s

    GASNET_OFI_NUM_RECEIVE_BUFFS=921 upcxx::init() in <9s

    GASNET_OFI_NUM_RECEIVE_BUFFS=906 upcxx::init() in <9s

    GASNET_OFI_NUM_RECEIVE_BUFFS=875 upcxx::init() in <9s

    GASNET_OFI_NUM_RECEIVE_BUFFS=750 upcxx::init() in <9s

    GASNET_OFI_NUM_RECEIVE_BUFFS=450 upcxx::init() in <9s

    In all the ones that did not timeout, mhm2 completed the test in about 32s.

    I think this indicates we have 2 separate issues. 1) that upcxx::init tends to stall indefinitely with a high number of GASNET_OFI_NUM_RECEIVE_BUFFS.. and 2) there may be a performance problem allocating huge pages on the GPU machines during upcxx::init. Let’s have this Issue track the former and when I have more data, I’ll open a separate issue on the latter.

  2. Paul Hargrove

    @Rob Egan The "former issue" in which too large a buffer count leads to hangs is, in my mind, qualitatively identical to the (closed) GASNet-EX bug:
    Bug 4478 "ofi: hangs with cxi provider with 128ppn and GASNET_OFI_RECEIVE_BUFF_SIZE=single"

    Based on the data here, it appears the original fix to bug 4478 was too aggressive with its default of 450 buffers at 128ppn. That alone can be resolved with a 1-line IPR to ofi-conduit changing that to 437 or less.

    However, I believe the results for 128ppn (CPU node) vs 64ppn (GPU node) above motivate the scaling with ppn which was omitted in the initial fix for bug 4478. Though I suspect the "proper" solution requires scaling with procs-per-nic.

    So, I am inclined to open a new ofi-conduit bug in which we deploy some scaling and lower the current 450 to "400ish" for the 128 procs-per-nic case.
    @Dan Bonachea What are your thoughts?

  3. Dan Bonachea

    Paul said:

    I suspect the "proper" solution requires scaling with procs-per-nic.

    @Rob Egan sounds like it would be interesting to know if your findings change on the GPU nodes when the 64 procs/node are spread across the 4 NICs (i.e. 16 procs per NIC) instead of the runs above that appear to be using only one NIC (64 procs per NIC)

  4. Rob Egan reporter

    Checking maximum buffer counts on gpu after proper binding of cpu/nic/numa.

    Running 4 GPU nodes and using our Issue178 branch to ensure proper locality, upcxx-srun and craype-hugepages2M:

    Using default cores of  64 . Ignoring tasks per node  128  from SLURM_TASKS_PER_NODE= 128(x4)
    This is Perlmutter GPU partition - executing upcxx-srun directly and setting UPCXX_SHARED_HEAP_SIZE= 450 MB : ['upcxx-srun', '-n', '256', '-N', '4', '--gpus-per-node=4', '--', '/pscratch/sd/r/regan/mhm2-builds/Issue178-Release-gnu/install/bin/mhm2-mps-wrapper-perlmutter.sh']
    Setting GASNET_MAX_SEGSIZE == UPCXX_SHARED_HEAP_SIZE ==  450 MB  to avoid gasnet memory probe
    Executing mhm2 with job 5186335 (interactive) on 4 nodes.
    Executing as: /pscratch/sd/r/regan/mhm2-builds/Issue178-Release-gnu/install/bin/mhm2.py -r arctic_sample_0.fq
    Using default cores of  64 . Ignoring tasks per node  128  from SLURM_TASKS_PER_NODE= 128(x4)
    2023-02-01 20:39:55.706620 executing:
     upcxx-srun -n 256 -N 4 --gpus-per-node=4 -- /pscratch/sd/r/regan/mhm2-builds/Issue178-Release-gnu/install/bin/mhm2-mps-wrapper-perlmutter.sh -- /pscratch/sd/r/regan/mhm2-builds/Issue178-Release-gnu/install/bin/mhm2 -r arctic_sample_0.fq
     ...
    2023-02-01 20:40:23 [0] <utils.cpp:395> CPU Pinnings - local rank(s) 48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63: 48-63,112-127
    2023-02-01 20:40:23 [0] <utils.cpp:395> CPU Pinnings - local rank(s) 32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47: 32-47,96-111
    2023-02-01 20:40:23 [0] <utils.cpp:395> CPU Pinnings - local rank(s) 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31: 16-31,80-95
    2023-02-01 20:40:23 [0] <utils.cpp:395> CPU Pinnings - local rank(s) 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15: 0-15,64-79
    2023-02-01 20:40:23 [0] <utils.cpp:395> GASNET/UPCXX Environment - local rank(s) 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63: UPCXX_NETWORK=ofi, GASNET_OFI_DEVICE_1_2_3=cxi1, GASNET_OFI_RECEIVE_BUFF_SIZE=recv, NERSC_FAMILY_UPCXX_VERSION=nightly, GASNET_OFI_DEVICE=cxi0, UPCXX_SHARED_HEAP_SIZE=450 MB, MOD_UPCXX_VERSION=nightly, LMOD_FAMILY_UPCXX_VERSION=nightly, GASNET_SPAWN_CONTROL=pmi, LMOD_FAMILY_UPCXX=upcxx, UPCXX_INSTALL=/global/common/software/m2878/perlmutter/upcxx/generic, GASNET_OFI_DEVICE_TYPE=Node, GASNET_OFI_DEVICE_1_3=cxi1, GASNET_OFI_DEVICE_1_2=cxi1, GASNET_OFI_DEVICE_3=cxi3, GASNET_OFI_DEVICE_2=cxi2, GASNET_OFI_DEVICE_1=cxi1, GASNET_OFI_NUM_RECEIVE_BUFFS=2000, NERSC_FAMILY_UPCXX=upcxx, MOD_UPCXX_SUFFIX=, GASNET_MAX_SEGSIZE=450 MB, GASNET_OFI_DEVICE_2_3=cxi2,
    ...
    2023-02-01 20:40:24 [0] <utils.cpp:395> GPU UUID - local rank(s) 32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47: 85c4993a2593c6b10d560726a7bf5423 device0of1
    2023-02-01 20:40:24 [0] <utils.cpp:395> GPU UUID - local rank(s) 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31: a86f33da4b3200240148b7a7e810f03f device0of1
    2023-02-01 20:40:24 [0] <utils.cpp:395> GPU UUID - local rank(s) 48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63: 3c229a1e06fd47559e52db769b9e1a14 device0of1
    2023-02-01 20:40:24 [0] <utils.cpp:395> GPU UUID - local rank(s) 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15: 3b082b5642ec7efed7c9c755ade1a0d4 device0of1
    

    Varying GASNET_OFI_NUM_RECEIVE_BUFFS:

    =1000 upcxx::init After=9.76/9.84/9.90/10.04 s, 0.99

    =2000 upcxx::init After=9.50/9.50/9.58/9.80 s, 0.98

    =3000 upcxx::init After=9.79/9.96/9.94/10.14 s, 0.98

    =3500 upcxx::init After=9.37/9.37/9.52/9.67 s, 0.98

    =3750 upcxx::init After=9.70/9.74/9.78/9.99 s, 0.98

    =3765 upcxx::init After=9.60/9.61/9.74/9.95 s, 0.98

    =3773 upcxx::init After=9.74/9.75/9.80/9.99 s, 0.98

    =3777 upcxx::init After=9.88/9.89/9.94/10.14 s, 0.98

    =3779 upcxx::init After=9.51/9.60/9.58/9.74 s, 0.98

    =3780 upcxx::init After=9.68/9.89/9.82/10.06 s, 0.98

    =3781 *** FATAL ERROR (proc 4): in gasnetc_ep_bindsegment() at pcxx-develop/bld/GASNet-stable/ofi-conduit/gasnet_ofi.c:1724: fi_mr_enable failed: -28(No space left on device)

    =3782 >50s

    =3812 >50s

    =3875 >50s

    =4000 >90s

    … so 3780 works, 3781 is the magic number to trigger an OFI error and >3781 stalls indefinitely.

  5. Paul Hargrove

    Thanks, @Rob Egan

    I will note that 3781 / 4 = 945, which is close to the 928 limit you found on the GPU nodes when using only 1 NIC. So, to this is consistent with our guess that (allowing for some noise) it is some per-NIC limit (not per host) that we are running up against.

  6. Paul Hargrove

    This issue is being dealt with as a GASNet-EX bug:
    Bug 4573 - Improve GASNET_OFI_NUM_RECEIVE_BUFFS special-case defaults

    Any additional comments should be directed there

  7. Log in to comment