- changed status to resolved
UPC++ crashes at startup on Perlmutter ofi with libfabric/1.15.2.0
Issue #576
resolved
During a recent maintenance, NERSC Perlmutter system software was upgraded to libfabric/1.15.2.0
, which revealed a shortcoming in the GASNet-EX 2022.9.0 release that is embedded in the UPC++ 2022.9.0 release. That defect leads to the following failure message at startup of any program using the high-speed ofi
network with the default cxi
libfabric provider:
*** FATAL ERROR: Assertion failure (proc 0): in gasnetc_ofi_init() at
[...]/ofi-conduit/gasnet_ofi.c:905: has_mr_scalable ==
!(info->domain_attr->mr_mode & FI_MR_ALLOCATED)
op1 : 1 (0x00000001) == has_mr_scalable
op2 : 0 (0x00000000) == !(info->domain_attr->mr_mode & FI_MR_ALLOCATED)
More details are available in Bug 4553.
We expect the same problem will impact other HPE Cray EX systems with SlingShot-11 hardware as they are updated to libfabric/1.15.2
or later.
Comments (1)
-
reporter - Log in to comment
There is now a new GASNet-EX 2022.9.2 release that addresses this problem, and the fix will be embedded in future releases of UPC++.
The Perlmutter system installs provided by the UPC++ maintainers have already been patched to resolve this problem, and UPC++ users on Perlmutter are highly encouraged to use those installs.
Users installing their own UPC++ 2022.9.0 or earlier on an affected system can pull in the fixed GASNet release using the following configure option: