- edited description
intermittent lpc-stress/opt failures on ARM64
GitLab testing has shown intermittent hangs of the lpc-stress test wombat, so far only for seq-opt-ibv. The timeout failures look like this: (complete output from the logs)
Tue May 25 23:58:38 PDT 2021
++ eval echo
+++ echo
+ /autofs/nccs-svm1_envoy_od/phargrov/.gitlab-runner/builds/Xnp8CS4f/0/anl/upcpp/bld/upcxx.assert1.optlev0.dbgsym1.gasnet_seq.ibv/bin/upcxx-run -np 4 -- timeout --foreground -k 420s 300s ./test-lpc-stress-seq-opt-ibv
Test: lpc-stress.cpp
Ranks: 4
iters = 10000 threads = 10 WORDS = 64
*** Caught a signal (proc 3): SIGTERM(15)
*** Caught a signal (proc 2): SIGTERM(15)
*** Caught a signal (proc 0): SIGTERM(15)
*** Caught a signal (proc 1): SIGTERM(15)
Wed May 26 00:03:40 PDT 2021
I've seen this intermittently in GitLab runs, including the following:
- https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/45848
- https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/46081
- https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/46082
- https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/46087
- https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/46096
- https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/46185
- clang-12.0.0 w/ develop @ a6c99df5
- https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/46243
- clang/4.0.0-gcc640 w/ develop @ a6c99df5
- notably shows a failure of both lpc-stress-seq-opt-{smp,udp}
- https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/46244
- gcc-6.4.0 w/ develop @ a6c99df5
What we know:
- These failures span compilers: clang/12.0.0-gcc1020, clang/4.0.0-gcc640, gcc6.4.0. These are all using compilers built by Paul, but (especially given the similar behavior across compilers) I have no reason to suspect the compiler.
- Failures include both with and without cuda support built into UPC++
- I think I've only seen it fail on seq-opt, despite the fact every GitLab job runs all four combinations of {seq,par}-{opt,debug}-ibv.
- The failures above are on PR branches doing work on
upcxx::copy()
because that's what I've been running lately, but the test itself makes nocopy
calls so the PR branch changes are almost certainly irrelevant.
This test is almost entirely LPC-based, it spawns threads that talk to each other via LPC. The only inter-process communication is a barrier at startup and before exit, so the test should be almost entirely independent of conduit. The test is written to be independent of seq/par mode, and given the "meat" of the test doesn't touch the backend or GASNet, I don't have a theory regarding why a failure would only show up in seq mode.
The test is designed to identify memory races in the LPC queues, which have been speculated may be present for non-x86_64 architectures (including wombat's ARM processors). The hangs might indicate such failure has been found, but there's no way to tell for sure without alot more information, like reproducing the problem in a debugger.
So far I have NOT been able to reproduce this manually at all on wombat, even when using the same compiler and configure line used by GitLab CI and performing thousands of trials. These configs are notably disabling ODP and running dual-rail, I've also tried varying those dimensions, as well as conduit (smp, udp, ibv), to no avail.
Comments (8)
-
reporter -
reporter - edited description
-
I have direct evidence that this is NOT a matter of oversubscribing cores. Below is a snapshot of a
test-lpc-stress-seq-opt-ibv
running in a Wombat Gitlab CI tester.
This shows 1100% CPU, consistent w/ 11 concurrent threads and TIME advancing much faster than wallclock, and hwloc shows the procs are NOT bound.PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 161997 phargrov 20 0 3973952 164288 147072 R 1100 0.1 10:09.95 test-lpc-stress 161998 phargrov 20 0 3973952 164224 147008 R 1100 0.1 10:09.95 test-lpc-stress {phargrov@wombat8 ~}$ hwloc-bind --get --pid 161998 0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff {phargrov@wombat8 ~}$ hwloc-bind --get --pid 161997 0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff
-
reporter I've now run over 100 GitLab trials of this test using ibv-conduit and all four build configs across all four compilers, where the only difference from the GitLab runs demonstrating the problem is the addition of configure option
--with-mpsc-queue=biglock
:https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/pipelines/5906/builds
Ignoring one job that was inadvertently cancelled, ALL of the remaining jobs passed without hang or error, whereas the same setup with the default MPSC queues has been observed to hang a large fraction of the time.
This appears to be strong evidence that the MPSC queues which are suspected to comprise the root cause are at least a contributing factor to the observed failures.
-
reporter - changed title to intermittent lpc-stress/opt failures on ARM64
-
assigned issue to
I now understand this problem, solution is on the way soon.
-
reporter Proposed solution in PR 363
-
reporter - changed status to resolved
Fix issue 479: intermittent lpc-stress/opt failures on ARM64
Turns out this was not actually a hang, but rather a pathological slow-down caused by thrashing the memory system.
After reproducing the slow-down manually on a wombat login node (Cavium ThunderX2), I used gdb to confirm the master thread was spending most of its time trying to enqueue LPC acknowledgements to producer threads who are spinning on burst (dequeue).
Inserting padding to separate the head and tail pointers in the atomic MPSC queues seems to reliably solve the problem.
Even on architectures where this wasn't causing a critical slow-down, it's likely this defect was degrading the performance of our multi-threaded LPC queues any time they exceeded one element. The queue algorithm was designed to allow the head and tail to proceed independently, but with both pointers on the same cache line false sharing between the producers and consumer was likely leading to lots of unnecessary coherence cache misses.
→ <<cset e6dc7ff19986>>
-
reporter Merge pull request #363 into develop
- lpc-queues: Update ChangeLog Fix issue 479: intermittent lpc-stress/opt failures on ARM64 intru_queue: Fix multiple relaxed memory defects in MPSC queues intru_queue: Cosmetic updates
→ <<cset d2eb43398c87>>
- Log in to comment
Description edited to add new failures seen on smp and udp conduits.
Example output from https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/46243 :