Intermittent validation failures in test/copy.cpp on dirac

Issue #241 closed
Paul Hargrove created an issue

This month our CI testing has seen at least two failures of test/copy.cpp in O3/PAR configuration.
The output from one case is shown at the end of this posting, while the other differs only in the expected value (178 vs 176).

The first was on July 1, 2019 and occurred on two nodes of Dirac using g++-6.4.0 and ancient Tesla C2050 GPUs (CUDA 9.0).

The second was on July 25, 2019. While this was also on a pair of Dirac nodes, this case was using clang-8.0.0 (w/ libstdc++ from g++ 8.3.0) and was running on a pair of Tesla M4 GPUs (CUDA 10.1).

Even the model of InfiniBand NIC differs between the node pairs which produced those two failures.

Since the toolchain, NIC, and GPU models all differ between those two cases, this is almost certainly our bug.

Running with devices=1
done round=0 initiator=0
done round=1 initiator=0
done round=2 initiator=0
done round=3 initiator=0
done round=0 initiator=0
done round=1 initiator=0
done round=2 initiator=0
done round=3 initiator=0
done round=0 initiator=1
done round=1 initiator=1
done round=2 initiator=1
done round=3 initiator=1
done round=0 initiator=1
done round=1 initiator=1
done round=2 initiator=1
done round=3 initiator=1
done round=0 initiator=2
done round=1 initiator=2
done round=2 initiator=2
done round=3 initiator=2
done round=0 initiator=2
done round=1 initiator=2
done round=2 initiator=2
done round=3 initiator=2
done round=0 initiator=3
done round=1 initiator=3
done round=2 initiator=3
done round=3 initiator=3
done round=0 initiator=3
done round=1 initiator=3
done round=2 initiator=3
done round=3 initiator=3
done round=0 initiator=4
done round=1 initiator=4
done round=2 initiator=4
done round=3 initiator=4
done round=0 initiator=4
done round=1 initiator=4
done round=2 initiator=4
done round=3 initiator=4
done round=0 initiator=5
done round=1 initiator=5
done round=2 initiator=5
done round=3 initiator=5
done round=0 initiator=5
done round=1 initiator=5
done round=2 initiator=5
done round=3 initiator=5
done round=0 initiator=6
done round=1 initiator=6
done round=2 initiator=6
done round=3 initiator=6
done round=0 initiator=6
done round=1 initiator=6
done round=2 initiator=6
done round=3 initiator=6
done round=0 initiator=7
done round=1 initiator=7
done round=2 initiator=7
done round=3 initiator=7
done round=0 initiator=7
done round=1 initiator=7
done round=2 initiator=7
done round=3 initiator=7
*** FATAL ERROR (proc 0): 
//////////////////////////////////////////////////
UPC++ assertion failure:
 rank=0
 file=/home/data2/upcnightly/dirac/EX-dirac-ibv-clang_CX5/work/opt/upcxx/test/copy.cpp:191

Expected 176 got 0

To have UPC++ freeze during these errors so you can attach a debugger, rerun the program with GASNET_FREEZE_ON_ERROR=1 in the environment.
//////////////////////////////////////////////////

NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. 
NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue.
*** Caught a fatal signal (proc 0): SIGABRT(6)

Comments (13)

  1. Paul Hargrove reporter

    I am going to spend some time today to reproduce manually and hopefully at least determine if the problem was present in our March release or not. As time allows, I am also hoping to determine if this is specific to PAR or not.
    So, look for those updates as they become available.

  2. Paul Hargrove reporter

    Approximately 8 hours of continuous running has yet to reproduce the problem in my manual testing.
    I am going to "capture" the executable from the CI run tonight to eliminate any possibility that I've failed to reproduce the compilation parameters correctly.

  3. Dan Bonachea

    CI captured a stack trace of this last night in debug mode:

    [0]   Id   Target Id         Frame 
    [0] * 1    Thread 0x7f4a004d1b00 (LWP 32755) "copy-par" 0x00007f49fda15a89 in waitpid () from /usr/lib64/libc.so.6
    [0]   2    Thread 0x7f49fba78700 (LWP 32759) "copy-par" 0x00007f49fda4420d in poll () from /usr/lib64/libc.so.6
    [0]   3    Thread 0x7f49f8cc2700 (LWP 32763) "copy-par" 0x00007f49fda4f483 in epoll_wait () from /usr/lib64/libc.so.6
    [0]   4    Thread 0x7f49cab50700 (LWP 306) "copy-par" 0x00007f49fda5033f in accept4 () from /usr/lib64/libc.so.6
    [0]   5    Thread 0x7f49ca34f700 (LWP 307) "copy-par" 0x00007f49fda4420d in poll () from /usr/lib64/libc.so.6
    [0] 
    [0] Thread 5 (Thread 0x7f49ca34f700 (LWP 307)):
    [0] #0  0x00007f49fda4420d in poll () from /usr/lib64/libc.so.6
    [0] #1  0x00007f49fe8343e3 in ?? () from /usr/lib64/libcuda.so.1
    [0] #2  0x00007f49fe8c230d in ?? () from /usr/lib64/libcuda.so.1
    [0] #3  0x00007f49fe836a58 in ?? () from /usr/lib64/libcuda.so.1
    [0] #4  0x00007f49ffea6dd5 in start_thread () from /usr/lib64/libpthread.so.0
    [0] #5  0x00007f49fda4eead in clone () from /usr/lib64/libc.so.6
    [0] 
    [0] Thread 4 (Thread 0x7f49cab50700 (LWP 306)):
    [0] #0  0x00007f49fda5033f in accept4 () from /usr/lib64/libc.so.6
    [0] #1  0x00007f49fe83538a in ?? () from /usr/lib64/libcuda.so.1
    [0] #2  0x00007f49fe82768d in ?? () from /usr/lib64/libcuda.so.1
    [0] #3  0x00007f49fe836a58 in ?? () from /usr/lib64/libcuda.so.1
    [0] #4  0x00007f49ffea6dd5 in start_thread () from /usr/lib64/libpthread.so.0
    [0] #5  0x00007f49fda4eead in clone () from /usr/lib64/libc.so.6
    [0] 
    [0] Thread 3 (Thread 0x7f49f8cc2700 (LWP 32763)):
    [0] #0  0x00007f49fda4f483 in epoll_wait () from /usr/lib64/libc.so.6
    [0] #1  0x00007f49fcd33f1d in epoll_dispatch (base=0xd83ce0, tv=<optimized out>) at ../../../../../../openmpi-4.0.0/opal/mca/event/libevent2022/libevent/epoll.c:407
    [0] #2  0x00007f49fcd36f21 in opal_libevent2022_event_base_loop (base=0xd83ce0, flags=1) at ../../../../../../openmpi-4.0.0/opal/mca/event/libevent2022/libevent/event.c:1630
    [0] #3  0x00007f49fae0fdfe in progress_engine () from /usr/local/pkg/openmpi-4.0.0/clang-8.0.1/lib/openmpi/mca_pmix_pmix3x.so
    [0] #4  0x00007f49ffea6dd5 in start_thread () from /usr/lib64/libpthread.so.0
    [0] #5  0x00007f49fda4eead in clone () from /usr/lib64/libc.so.6
    [0] 
    [0] Thread 2 (Thread 0x7f49fba78700 (LWP 32759)):
    [0] #0  0x00007f49fda4420d in poll () from /usr/lib64/libc.so.6
    [0] #1  0x00007f49fcd3f09b in poll_dispatch (base=0xd41440, tv=<optimized out>) at ../../../../../../openmpi-4.0.0/opal/mca/event/libevent2022/libevent/poll.c:165
    [0] #2  0x00007f49fcd36f21 in opal_libevent2022_event_base_loop (base=0xd41440, flags=1) at ../../../../../../openmpi-4.0.0/opal/mca/event/libevent2022/libevent/event.c:1630
    [0] #3  0x00007f49fccf202e in progress_engine () from /usr/local/pkg/openmpi-4.0.0/clang-8.0.1/lib/libopen-pal.so.40
    [0] #4  0x00007f49ffea6dd5 in start_thread () from /usr/lib64/libpthread.so.0
    [0] #5  0x00007f49fda4eead in clone () from /usr/lib64/libc.so.6
    [0] 
    [0] Thread 1 (Thread 0x7f4a004d1b00 (LWP 32755)):
    [0] #0  0x00007f49fda15a89 in waitpid () from /usr/lib64/libc.so.6
    [0] #1  0x00007f49fd993de2 in do_system () from /usr/lib64/libc.so.6
    [0] #2  0x00007f49fd994191 in system () from /usr/lib64/libc.so.6
    [0] #3  0x00000000004c9e45 in gasneti_system_redirected (cmd=0x9322f0 <gasneti_bt_gdb.cmd> "/usr/local/pkg/gdb/newest/bin/gdb -nx -batch -x /tmp/gasnet_HlOfmr '/home/data2/upcnightly/dirac/EX-dirac-ibv-clang_CX5/work/dbg/gasnet/tests/upcr-harness/external-upcxx/./copy-par' 32755", stdout_fd=43) at /scratch/upcnightly/EX-dirac-ibv-clang_CX5/runtime/src/gasnet/gasnet_tools.c:1275
    [0] #4  0x00000000004c95b9 in gasneti_bt_gdb (fd=43) at /scratch/upcnightly/EX-dirac-ibv-clang_CX5/runtime/src/gasnet/gasnet_tools.c:1531
    [0] #5  0x00000000004c0b3a in gasneti_print_backtrace (fd=2) at /scratch/upcnightly/EX-dirac-ibv-clang_CX5/runtime/src/gasnet/gasnet_tools.c:1806
    [0] #6  0x00000000004c1721 in _gasneti_print_backtrace_ifenabled (fd=2) at /scratch/upcnightly/EX-dirac-ibv-clang_CX5/runtime/src/gasnet/gasnet_tools.c:1938
    [0] #7  0x00000000004bf449 in gasneti_error_abort () at /scratch/upcnightly/EX-dirac-ibv-clang_CX5/runtime/src/gasnet/gasnet_tools.c:743
    [0] #8  0x00000000004be673 in gasneti_fatalerror (msg=0x6a0839 "\n%s") at /scratch/upcnightly/EX-dirac-ibv-clang_CX5/runtime/src/gasnet/gasnet_tools.c:772
    [0] #9  0x0000000000431a01 in upcxx::assert_failed (file=0x6a083d "/home/data2/upcnightly/dirac/EX-dirac-ibv-clang_CX5/work/dbg/upcxx/test/copy.cpp", line=191, msg=0x13dee50 "Expected 170 got 0") at src/diagnostic.cpp:42
    [0] #10 0x0000000000433218 in main () at test/copy.cpp:191
    
  4. Paul Hargrove reporter

    I propose to "punt" this issue to the 2020.9.0 release, since between now and then the Memory Kinds support in the UPC++ runtime needs to be rewritten to leverage GASNet-EX anyway.

  5. Dan Bonachea

    I've now seen this intermittent failure on dirac-CX5 with manual builds for smp/seq/{opt,debug} and 8 ranks, using the nightly cuda-enabled build of develop @ ef9843a with gcc 8.4.0:

    *** FATAL ERROR (proc 0): 
    //////////////////////////////////////////////////////////////////////
    UPC++ assertion failure:
     on process 0 (pcp-d-10)
     at copy.cpp:191
     in function: int main()
    
    Expected 174 got 0
    
    To have UPC++ freeze during these errors so you can attach a debugger,
    rerun the program with GASNET_FREEZE_ON_ERROR=1 in the environment.
    //////////////////////////////////////////////////////////////////////
    

    although seemingly at a lower frequency than with ibv-conduit, and the failures seem more likely with opt than debug.

    This evidence suggests more strongly that the bug is in the test (or possibly upcxx::copy), rather than something ibv-specific.

  6. Dan Bonachea

    Nightly CI reports this failure mode also persists in the 2020.11/memory_kinds (@ 99c7dc9 ) using the reference cuda_device and ibv-conduit with gcc 6.4.0.

    *** FATAL ERROR (proc 0): 
    //////////////////////////////////////////////////////////////////////
    UPC++ assertion failure:
     on process 0 (pcp-d-9)
     at /home/data2/upcnightly/dirac/EX-dirac-ibv-gcc_old/work/opt/upcxx/test/copy.cpp:191
     in function: int main()
    
    Expected 172 got 0
    
    To have UPC++ freeze during these errors so you can attach a debugger,
    rerun the program with GASNET_FREEZE_ON_ERROR=1 in the environment.
    //////////////////////////////////////////////////////////////////////
    
  7. Dan Bonachea

    Proposed fix in PR 336, which adds a missing device synchronization after data initialization.

    The test doesn't validate results until the very end after all the copy shuffles, so my theory is that a race during initialization resulted in propagation of garbage values through the entire test. The race relies on the use of cuda_device memory, which I believe is consistent with all reported occurrences that all used a CUDA-enabled build of UPC++ with a real GPU (the test also functions without cuda support using only host memory). The details of when the race would actually become visible are likely to depend on all of the driver, GPU/CPU hardware and timing.

    With the fix in place I've now run over 100 successful trials with dirac-CX5 with various combinations of smp, ibv-{1,2}ppn, {opt,debug} and seen no validation failures.

  8. Dan Bonachea

    test/copy: issue #241: Add missing CUDA device synchronization

    The test initializes device memory using cuMemcpyHtoD(), but fails to synchronize the transfer from pageable host memory to device using the default stream. Although this call is "synchronous", as described in: https://docs.nvidia.com/cuda/cuda-driver-api/api-sync-behavior.html this means the device memory might not be fully updated before the copy call returns. That device memory is subsequently accessed by copy() which accesses the device memory directly (via GDR) and/or via reference kinds using cuMemcpyDtoHAsync() on an independent (unordered) stream.

    This missing synchronization means there was a race on device memory that might explain the intermittent failures reported in issue #241.

    Fixes issue #241 (I hope).

    → <<cset 812b13d5dac7>>

  9. Dan Bonachea

    Completed over 70k successful runs overnight on dirac with seq-opt-smp and ibv-par-opt and zero failures.

    I'm declaring this issue closed, resolved by fixing a CUDA race in the test code leading to intermittent validation failures.

    In the event of any new validation failures for test/copy.cpp, please open a new issue.

  10. Log in to comment