- changed status to wontfix
Failure of test-spec-issue160-seq-debug-ucx on SS-10 with "vanilla" UCX
When evaluating recent UCX releases, I attempted to build and test my own builds of UCX on a HPE Cray EX system with the Slingshot-10 network.
This issue is NOT present with UCX on Summit. This issue is NOT present with HPE's UCX build on the Slingshot-10 system.
In this problematic (and NOT recommended) configuration I saw the following assertion failure from test-spec-issue160-seq-debug-ucx
:
Starting...
test-spec-issue160-seq-debug-ucx: /home/hargrove/upcxx/test/regression/spec-issue160.cpp:37: int main(): Assertion `*gp.local() == rank_n()' failed.
*** Caught a fatal signal (proc 0): SIGABRT(6)
*** NOTICE (proc 0): Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.
The backtrace is uninteresting:
[0] #12 0x000014c2795058d2 in __assert_fail () from /lib64/libc.so.6
[0] #13 0x0000000000407a53 in main () at /home/hargrove/upcxx/test/regression/spec-issue160.cpp:37
Dan and I worked in Slack to diagnose the problem. While we identified some UB in the test, a version adjusted to eliminate that still showed anomalous behavior.
We determined that the behavior of the UCX native atomics in this build were either (a) ignoring the memory hierarchy (and caches in particular) in such a way as to be essentially unusable, or (b) potentially performing correct atomics on the incorrect memory location. We did not follow-up beyond identifying these two likely causes.
Comments (2)
-
reporter -
reporter In case it is useful in the future, the Slack discussion mentioned in the initial report is here
- Log in to comment
This issue exists to document the failure in the hopes of avoiding duplicate efforts to triage the same in the future.
We do not currently recommend use of UCX conduit on the HPE Cray EX platform, and certainly do not encourage replacement of the vendor-provided UCX install with a self-built one. Therefore, we do not have any plans to pursue fixing this issue unless/until it appears in a supported configuration.