Some tests of native CUDA memory kinds are crashing Perlmutter GPU nodes

Issue #588 resolved
Paul Hargrove created an issue

In the first nightly regression test runs after Perlmutter's return from maintenance last night, the reported results just end suddenly and there are output files which have non-zero length but contain only \0 bytes. Additionally, Dan observed that in his Perlmutter GitLab CI runs:

Looks like test-memory_kinds-seq-debug-ofi is repeatedly running for 6+ minutes, which is way too long and soaking up the entire job time

I have since determined that test/memory_kinds crashes nodes of Perlmutter when compiled with native CUDA memory kinds enabled. Specifically, ofi conduit crashes (even single node) while smp and mpi conduit bulds do not. This is based on testing with the upcxx-cuda/2022.9.0 environment module and the 2022.9.0 version of the memory_kinds test itself. So, recent changes to our code are ruled out as the cause.

Dan and I have discussed this in Slack, which led to runs of cuda-context with and without the test-specific work-around for two known issues originally observed on the SS-10 network:

Bug 4396 - ucx-conduit crashes w/native CUDA memory kinds and device segment free
Bug 4504 - ofi/verbs crashes w/native CUDA memory kinds and device segment free

Since the work-around (-DSKIP_DEVICE_FREE) is effective, we believe this to be a more serious form of the same problem: bad behavior in the presence of a device segment free.

In addition to memory_kinds, and cuda-context, it is suspected (but unverified) that spec-issue189 has the same problem.

As a result of my repeated crashing of their nodes, NERSC created ticket INC0200272 for this issue.

Comments (10)

  1. Paul Hargrove reporter

    GitLab CI for Perlmutter has been tweaked to remove the "bad" tests between compile and run stages.

  2. Paul Hargrove reporter

    Nightly regression testing (aka "pushbuild") has been tweaked to add the bad tests to a ban-list. Unlike the GitLab CI case, this also eliminates the build of these tests, but that probably is all that is practical at the moment.

  3. Paul Hargrove reporter

    It has been determined that memberof and h-d-remote crash nodes only when UPC++ is configured using --with-hip (and the necessary module load hip prerequisite).

  4. Dan Bonachea

    NERSC consultants report that FI_MR_CUDA_CACHE_MONITOR_ENABLED=0 might be an effective workaround.

    Assuming we can confirm this workaround, it should be added to site-docs.

  5. Paul Hargrove reporter

    I am testing now with the hope of making the following updates if things checkout:

    • Add to our upcxx-cuda environment module on Perlmutter
    • Add to site-docs.md as something one must do if using upcxx-cuda and not using the env module
    • Update GitLab CI on Perlmutter to use this instead of skipping the tests
    • Update nightly regression testing to use this instead of skipping the tests
  6. Paul Hargrove reporter

    Since the last update, NERSC and HPE have deployed an initial fix for the underlying problem on Perlmutter and promised a final fix at the next system maintenance.

    Therefore, the changes in the first two of four of the items I listed two comments back have been reverted, the work-arounds in the latter two items have been removed, and I am lowering this issue's priority from "blocker" to "minor".

    I am deferring closing this issue as "worksforme" until after Perlmutter's next maintenance and my verification that the "final fix" is as effective as the current temporary one.

    It is not known which version(s) of which vendor software components are the source of the problem and/or its fix. Given the urgency with which NERSC and HPE addressed the issue, I think it unlikely that many (if any) other HPE Cray EX systems will have the impacted versions in use for very long. So, hopefully our inability to document version information won't be a problem.

  7. Paul Hargrove reporter

    I am deferring closing this issue as "worksforme" until after Perlmutter's next maintenance and my verification that the "final fix" is as effective as the current temporary one.

    I am several months late doing so, but I am closing this issue since the "final fix" has been working fine on Perlmutter. Note that there is no "worksforme" option. So, I am just closing this as "resolved".

  8. Log in to comment