UPC++ crashes at startup on Perlmutter ofi with libfabric/1.15.2.0

Issue #576 resolved
Dan Bonachea created an issue

During a recent maintenance, NERSC Perlmutter system software was upgraded to libfabric/1.15.2.0, which revealed a shortcoming in the GASNet-EX 2022.9.0 release that is embedded in the UPC++ 2022.9.0 release. That defect leads to the following failure message at startup of any program using the high-speed ofi network with the default cxi libfabric provider:

*** FATAL ERROR: Assertion failure (proc 0): in gasnetc_ofi_init() at
[...]/ofi-conduit/gasnet_ofi.c:905: has_mr_scalable ==
!(info->domain_attr->mr_mode & FI_MR_ALLOCATED)
  op1 : 1 (0x00000001) == has_mr_scalable
  op2 : 0 (0x00000000) == !(info->domain_attr->mr_mode & FI_MR_ALLOCATED)

More details are available in Bug 4553.

We expect the same problem will impact other HPE Cray EX systems with SlingShot-11 hardware as they are updated to libfabric/1.15.2 or later.

Comments (1)

  1. Dan Bonachea reporter

    There is now a new GASNet-EX 2022.9.2 release that addresses this problem, and the fix will be embedded in future releases of UPC++.

    The Perlmutter system installs provided by the UPC++ maintainers have already been patched to resolve this problem, and UPC++ users on Perlmutter are highly encouraged to use those installs.

    Users installing their own UPC++ 2022.9.0 or earlier on an affected system can pull in the fixed GASNet release using the following configure option:

    $ configure --with-gasnet=https://gasnet.lbl.gov/EX/GASNet-2022.9.2.tar.gz  [... other normal options ... ]
    
  2. Log in to comment