unreasonably long upcxx::init on perlmutter based on GASNET_OFI_NUM_RECEIVE_BUFFS
This issue was first documented in Slack (Aug 2022) as a stall within upcxx::init() that takes a variable amount of time but has been as long as >8 hours in large jobs while using 128 ppn (i.e. perlmutter cpu nodes) and the “single” workaround.
The following seems to be a good workaround GASNET_OFI_NUM_RECEIVE_BUFFS=400
and works for most cases, but sometimes 300 is needed, sometimes 200, and if it is too small the stall also occurs (though possibly in a different code path).
From slack & testing in Oct:
"It looks like using the default GASNET_OFI_NUM_RECEIVE_BUFFS always fails for as low as 16 cpu nodes at 128 ppn, but always succeeds at 112 ppn, confirmed this outcome at 32, 64, 128 and 256 nodes. at 384 nodes, default env at both 128 and 112 rpn stalled in upcxx::init.”
“with GASNET_OFI_NUM_RECEIVE_BUFFS=400, 128 rpn failed but 112 succeded. And with GASNET_OFI_NUM_RECEIVE_BUFFS=300 and 200 both 128 rpn and 112 succeeded. at 512 nodes, default env at 128 stalled but 112 rpn worked, and with GASNET_OFI_NUM_RECEIVE_BUFFS <= 400 all other tests at 128 and 112 rpn worked.”
Then a possibly different issue detected later on the GPU nodes running 64 rpn and GASNET_OFI_NUM_RECEIVE_BUFFS=400
it also is very slow, and this may be related to huge page allocation and might be just a significant slowdown instead of a stall and critically I have only seen this on the gpu nodes, not the cpu ones. (i.e. 2 gpu nodes took 341s to 187s to complete upcxx::init in repeated runs within the same job).
from slack:
“So a mixed bag on repeated runs within the same job on 2 gpu nodes:
regan@login01:/pscratch/sd/r/regan/arctic_data> grep 'init After' slurm-3940976.out
upcxx::init After=341.08/341.32/341.17/341.32 s, 1.00
upcxx::init After=307.75/307.88/307.78/307.88 s, 1.00
upcxx::init After=228.15/228.49/228.34/228.49 s, 1.00
upcxx::init After=187.43/187.58/187.48/187.58 s, 1.00
upcxx::init After=213.15/213.85/213.47/213.85 s, 1.00
“
I’ll post again with new experiments using the bleeding-edge build on perlmutter 1/31/2023
Comments (8)
-
reporter -
@Rob Egan The "former issue" in which too large a buffer count leads to hangs is, in my mind, qualitatively identical to the (closed) GASNet-EX bug:
Bug 4478 "ofi: hangs with cxi provider with 128ppn and GASNET_OFI_RECEIVE_BUFF_SIZE=single"Based on the data here, it appears the original fix to bug 4478 was too aggressive with its default of 450 buffers at 128ppn. That alone can be resolved with a 1-line IPR to ofi-conduit changing that to 437 or less.
However, I believe the results for 128ppn (CPU node) vs 64ppn (GPU node) above motivate the scaling with ppn which was omitted in the initial fix for bug 4478. Though I suspect the "proper" solution requires scaling with procs-per-nic.
So, I am inclined to open a new ofi-conduit bug in which we deploy some scaling and lower the current 450 to "400ish" for the 128 procs-per-nic case.
@Dan Bonachea What are your thoughts? -
Paul said:
I suspect the "proper" solution requires scaling with procs-per-nic.
@Rob Egan sounds like it would be interesting to know if your findings change on the GPU nodes when the 64 procs/node are spread across the 4 NICs (i.e. 16 procs per NIC) instead of the runs above that appear to be using only one NIC (64 procs per NIC)
-
Adjust title to match outstanding issue
-
reporter Checking maximum buffer counts on gpu after proper binding of cpu/nic/numa.
Running 4 GPU nodes and using our Issue178 branch to ensure proper locality, upcxx-srun and craype-hugepages2M:
Using default cores of 64 . Ignoring tasks per node 128 from SLURM_TASKS_PER_NODE= 128(x4) This is Perlmutter GPU partition - executing upcxx-srun directly and setting UPCXX_SHARED_HEAP_SIZE= 450 MB : ['upcxx-srun', '-n', '256', '-N', '4', '--gpus-per-node=4', '--', '/pscratch/sd/r/regan/mhm2-builds/Issue178-Release-gnu/install/bin/mhm2-mps-wrapper-perlmutter.sh'] Setting GASNET_MAX_SEGSIZE == UPCXX_SHARED_HEAP_SIZE == 450 MB to avoid gasnet memory probe Executing mhm2 with job 5186335 (interactive) on 4 nodes. Executing as: /pscratch/sd/r/regan/mhm2-builds/Issue178-Release-gnu/install/bin/mhm2.py -r arctic_sample_0.fq Using default cores of 64 . Ignoring tasks per node 128 from SLURM_TASKS_PER_NODE= 128(x4) 2023-02-01 20:39:55.706620 executing: upcxx-srun -n 256 -N 4 --gpus-per-node=4 -- /pscratch/sd/r/regan/mhm2-builds/Issue178-Release-gnu/install/bin/mhm2-mps-wrapper-perlmutter.sh -- /pscratch/sd/r/regan/mhm2-builds/Issue178-Release-gnu/install/bin/mhm2 -r arctic_sample_0.fq ... 2023-02-01 20:40:23 [0] <utils.cpp:395> CPU Pinnings - local rank(s) 48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63: 48-63,112-127 2023-02-01 20:40:23 [0] <utils.cpp:395> CPU Pinnings - local rank(s) 32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47: 32-47,96-111 2023-02-01 20:40:23 [0] <utils.cpp:395> CPU Pinnings - local rank(s) 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31: 16-31,80-95 2023-02-01 20:40:23 [0] <utils.cpp:395> CPU Pinnings - local rank(s) 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15: 0-15,64-79 2023-02-01 20:40:23 [0] <utils.cpp:395> GASNET/UPCXX Environment - local rank(s) 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63: UPCXX_NETWORK=ofi, GASNET_OFI_DEVICE_1_2_3=cxi1, GASNET_OFI_RECEIVE_BUFF_SIZE=recv, NERSC_FAMILY_UPCXX_VERSION=nightly, GASNET_OFI_DEVICE=cxi0, UPCXX_SHARED_HEAP_SIZE=450 MB, MOD_UPCXX_VERSION=nightly, LMOD_FAMILY_UPCXX_VERSION=nightly, GASNET_SPAWN_CONTROL=pmi, LMOD_FAMILY_UPCXX=upcxx, UPCXX_INSTALL=/global/common/software/m2878/perlmutter/upcxx/generic, GASNET_OFI_DEVICE_TYPE=Node, GASNET_OFI_DEVICE_1_3=cxi1, GASNET_OFI_DEVICE_1_2=cxi1, GASNET_OFI_DEVICE_3=cxi3, GASNET_OFI_DEVICE_2=cxi2, GASNET_OFI_DEVICE_1=cxi1, GASNET_OFI_NUM_RECEIVE_BUFFS=2000, NERSC_FAMILY_UPCXX=upcxx, MOD_UPCXX_SUFFIX=, GASNET_MAX_SEGSIZE=450 MB, GASNET_OFI_DEVICE_2_3=cxi2, ... 2023-02-01 20:40:24 [0] <utils.cpp:395> GPU UUID - local rank(s) 32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47: 85c4993a2593c6b10d560726a7bf5423 device0of1 2023-02-01 20:40:24 [0] <utils.cpp:395> GPU UUID - local rank(s) 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31: a86f33da4b3200240148b7a7e810f03f device0of1 2023-02-01 20:40:24 [0] <utils.cpp:395> GPU UUID - local rank(s) 48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63: 3c229a1e06fd47559e52db769b9e1a14 device0of1 2023-02-01 20:40:24 [0] <utils.cpp:395> GPU UUID - local rank(s) 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15: 3b082b5642ec7efed7c9c755ade1a0d4 device0of1
Varying GASNET_OFI_NUM_RECEIVE_BUFFS:
=1000 upcxx::init After=9.76/9.84/9.90/10.04 s, 0.99
=2000 upcxx::init After=9.50/9.50/9.58/9.80 s, 0.98
=3000 upcxx::init After=9.79/9.96/9.94/10.14 s, 0.98
=3500 upcxx::init After=9.37/9.37/9.52/9.67 s, 0.98
=3750 upcxx::init After=9.70/9.74/9.78/9.99 s, 0.98
=3765 upcxx::init After=9.60/9.61/9.74/9.95 s, 0.98
=3773 upcxx::init After=9.74/9.75/9.80/9.99 s, 0.98
=3777 upcxx::init After=9.88/9.89/9.94/10.14 s, 0.98
=3779 upcxx::init After=9.51/9.60/9.58/9.74 s, 0.98
=3780 upcxx::init After=9.68/9.89/9.82/10.06 s, 0.98
=3781 *** FATAL ERROR (proc 4): in gasnetc_ep_bindsegment() at pcxx-develop/bld/GASNet-stable/ofi-conduit/gasnet_ofi.c:1724: fi_mr_enable failed: -28(No space left on device)
=3782 >50s
=3812 >50s
=3875 >50s
=4000 >90s
… so 3780 works, 3781 is the magic number to trigger an OFI error and >3781 stalls indefinitely.
-
Thanks, @Rob Egan
I will note that 3781 / 4 = 945, which is close to the 928 limit you found on the GPU nodes when using only 1 NIC. So, to this is consistent with our guess that (allowing for some noise) it is some per-NIC limit (not per host) that we are running up against.
-
This issue is being dealt with as a GASNet-EX bug:
Bug 4573 - Improve GASNET_OFI_NUM_RECEIVE_BUFFS special-case defaultsAny additional comments should be directed there
-
- changed status to closed
- Log in to comment
4 CPU nodes using the “recv” slingshot work around:
GASNET_OFI_NUM_RECEIVE_BUFFS=
Timeout in upcxx::init() after 2 min.(no change) Timeout in upcxx::init() after 2 min.
GASNET_OFI_NUM_RECEIVE_BUFFS=450 Timeout in upcxx::init() after 2 min.
GASNET_OFI_NUM_RECEIVE_BUFFS=442 Timeout in upcxx::init() after 2 min.
GASNET_OFI_NUM_RECEIVE_BUFFS=439 Timeout in upcxx::init() after 2 min.
GASNET_OFI_NUM_RECEIVE_BUFFS=438 Timeout in upcxx::init() after 2 min.
GASNET_OFI_NUM_RECEIVE_BUFFS=437 upcxx::init() in <7s
GASNET_OFI_NUM_RECEIVE_BUFFS=435 upcxx::init() in <7s
GASNET_OFI_NUM_RECEIVE_BUFFS=425
upcxx::init() in <7sGASNET_OFI_NUM_RECEIVE_BUFFS=400
upcxx::init() in <7sGASNET_OFI_NUM_RECEIVE_BUFFS=300
upcxx::init() in <7sGASNET_OFI_NUM_RECEIVE_BUFFS=200
upcxx::init() in <7sGASNET_OFI_NUM_RECEIVE_BUFFS=100
upcxx::init() in <7sGASNET_OFI_NUM_RECEIVE_BUFFS=50
upcxx::init() in <7sGASNET_OFI_NUM_RECEIVE_BUFFS=20
upcxx::init() in <7sGASNET_OFI_NUM_RECEIVE_BUFFS=10
upcxx::init() in <7sin all the tests that did not timeout, mhm2 completed in about 50s
4 GPU nodes using the “recv” slingshot workaround:
GASNET_OFI_NUM_RECEIVE_BUFFS=
FATAL ERROR (proc 1): in gasnetc_ofi_read_env_vars() at cxx-develop/bld/GASNet-develop/ofi-conduit/gasnet_ofi.c:536: GASNET_OFI_NUM_RECEIVE_BUFFS must be at least 2(no change). upcxx::init() in <9s
GASNET_OFI_NUM_RECEIVE_BUFFS=1000
Timeout upcxx::init() after 90sGASNET_OFI_NUM_RECEIVE_BUFFS=937
Timeout upcxx::init() after 90sGASNET_OFI_NUM_RECEIVE_BUFFS=929
Timeout upcxx::init() after 90sGASNET_OFI_NUM_RECEIVE_BUFFS=928
upcxx::init() in <9s
GASNET_OFI_NUM_RECEIVE_BUFFS=927
upcxx::init() in <9s
GASNET_OFI_NUM_RECEIVE_BUFFS=925
upcxx::init() in <9s
GASNET_OFI_NUM_RECEIVE_BUFFS=921 upcxx::init() in <9s
GASNET_OFI_NUM_RECEIVE_BUFFS=906 upcxx::init() in <9s
GASNET_OFI_NUM_RECEIVE_BUFFS=875
upcxx::init() in <9sGASNET_OFI_NUM_RECEIVE_BUFFS=750
upcxx::init() in <9sGASNET_OFI_NUM_RECEIVE_BUFFS=450
upcxx::init() in <9sIn all the ones that did not timeout, mhm2 completed the test in about 32s.
I think this indicates we have 2 separate issues. 1) that upcxx::init tends to stall indefinitely with a high number of
GASNET_OFI_NUM_RECEIVE_BUFFS.
. and 2) there may be a performance problem allocating huge pages on the GPU machines during upcxx::init. Let’s have this Issue track the former and when I have more data, I’ll open a separate issue on the latter.