Failure of test-spec-issue160-seq-debug-ucx on SS-10 with "vanilla" UCX

Issue #590 wontfix
Paul Hargrove created an issue

When evaluating recent UCX releases, I attempted to build and test my own builds of UCX on a HPE Cray EX system with the Slingshot-10 network.

This issue is NOT present with UCX on Summit. This issue is NOT present with HPE's UCX build on the Slingshot-10 system.

In this problematic (and NOT recommended) configuration I saw the following assertion failure from test-spec-issue160-seq-debug-ucx:

Starting...
test-spec-issue160-seq-debug-ucx: /home/hargrove/upcxx/test/regression/spec-issue160.cpp:37: int main(): Assertion `*gp.local() == rank_n()' failed.
*** Caught a fatal signal (proc 0): SIGABRT(6)
*** NOTICE (proc 0): Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.

The backtrace is uninteresting:

[0] #12 0x000014c2795058d2 in __assert_fail () from /lib64/libc.so.6
[0] #13 0x0000000000407a53 in main () at /home/hargrove/upcxx/test/regression/spec-issue160.cpp:37

Dan and I worked in Slack to diagnose the problem. While we identified some UB in the test, a version adjusted to eliminate that still showed anomalous behavior.

We determined that the behavior of the UCX native atomics in this build were either (a) ignoring the memory hierarchy (and caches in particular) in such a way as to be essentially unusable, or (b) potentially performing correct atomics on the incorrect memory location. We did not follow-up beyond identifying these two likely causes.

Comments (2)

  1. Paul Hargrove reporter

    This issue exists to document the failure in the hopes of avoiding duplicate efforts to triage the same in the future.

    We do not currently recommend use of UCX conduit on the HPE Cray EX platform, and certainly do not encourage replacement of the vendor-provided UCX install with a self-built one. Therefore, we do not have any plans to pursue fixing this issue unless/until it appears in a supported configuration.

  2. Log in to comment