Deadlock

Issue #228 resolved
Qinglei Cao created an issue

Command: mpirun -np 2 -npernode 1 ./testing_stencil_1D (tests/stencil)

43e61ed: OK

9448f00: segment fault

55d1354: segment fault

a3f8fad: deadlock

Comments (16)

  1. George Bosilca

    I an not able to reproduce the successful run with 43e61ed. Executing

    mpirun -np 2 ./testing_stencil_1D -M 1000 -t 100 -T 100 -N 1000
    

    consistently leads to a segfault in data_lookup_of_stencil_1D_task, even with 43e61ed1c7

  2. George Bosilca

    As an additional data point, 43e61ed has 4 ctest failures

     16 - unit_dtd_pingpong_mpi (Timeout)
     30 - unit_haar_tree_shm (SEGFAULT)
     31 - unit_haar_tree_mpi (Failed)
     33 - unit_merge_sort_mpi (Failed)
    

  3. Qinglei Cao reporter

    For 43e61ed, if initialized MPI_THREAD_SERIALIZED instead of MPI_THREAD_MULTIPLE, it’s OK with George’s command: mpirun -np 2 ./testing_stencil_1D -M 1000 -t 100 -T 100 -N 1000. I rolled back the commits and found it seems PR #249 caused the problem. But for a3f8fad (first commit causing deadlock as described before), there is deadlock even if initialized MPI_THREAD_SERIALIZED.

  4. Aurelien Bouteiller

    srun -N2 tests/testing_dpotrf -N 200 -t 100 -v -x -- -mca comm_verbose 100

    This results in the sender side issuing the correct ACTIVATE messages, however the receiver side shuts down the MPI engine, as-if the number of tasks was counted wrong (it will then deadlock). Notable: we go through the path in which the taskpool is not yet known at the receiver (i.e. we receive the message before the taskpool register).

    d@00000  TOPO   read_B(0, 0)[0, 0]<0>{6}        root=0  0 (d0) -? 1 (dna) @parsec_remote_dep_activate:501
    d@00000  TOPO   read_B(1, 0)[1, 0]<0>{6}        root=0  0 (d0) -? 1 (dna) @parsec_remote_dep_activate:501
    d@00000 [0:0] task read_B(0, 0)[0, 0]<0>{6} my_idx 0 idx 1 rank 1 -- send (1) @parsec_remote_dep_activate:513
    d@00000 [0:0] task read_B(1, 0)[1, 0]<0>{6} my_idx 0 idx 1 rank 1 -- send (1) @parsec_remote_dep_activate:513
    d@00000  TOPO   read_B(0, 0)[0, 0]<0>{6} flow B root=0  0 (d0) -> 1 (d1) @parsec_remote_dep_activate:521
    d@00001 remote_deps_allocate: 0x7fffd80126b0 @remote_deps_allocate:167
    d@00001 MPI:    FROM    0       Activate NoTPool        UNKNOWN_of_TASKPOOL_6   k=2     with datakey a0beb0       params 1 @remote_dep_mpi_save_activate_c:2109
    d@00000  TOPO   read_B(1, 0)[1, 0]<0>{6} flow B root=0  0 (d0) -> 1 (d1) @parsec_remote_dep_activate:521
    d@00000 fw test REMOTE rank 1 (value=0) @remote_dep_is_forwarded:77
    d@00000 fw test REMOTE rank 1 (value=0) @remote_dep_is_forwarded:77
    d@00001 remote_deps_allocate: 0x7fffd8013350 @remote_deps_allocate:167
    d@00001 MPI:    FROM    0       Activate NoTPool        UNKNOWN_of_TASKPOOL_6   k=3     with datakey 7fffb4001410 params 1 @remote_dep_mpi_save_activate_c:2109
    d@00000 DATA    read_B(0, 0)[0, 0]<-1>{6}       param 0 deps 0xa0beb0 send on demand (increase deps counter by 1 [2]) @remote_dep_mpi_pack_dep:1589
    d@00000 fw mark REMOTE rank 1 @remote_dep_mark_forwarded:59
    D@00000 MPI:    TO      1       Activate        read_B(0, 0)[0, 0]<-1>{6}
                            with datakey a0beb0     mask 1  (tag=5) short mask 0 length 0 @remote_dep_mpi_pack_dep:1600
    d@00000 Complete 1 (1 left) outputs of dep 0x7fffb4001410 @remote_dep_complete_and_cleanu:107
    d@00000 fw mark REMOTE rank 1 @remote_dep_mark_forwarded:59
    d@00000 PUT_EAGER no data for item 0x9fb2c0, freeing @remote_dep_mpi_put_eager:1788
    d@00000 Complete 1 (2 left) outputs of dep 0xa0beb0 @remote_dep_complete_and_cleanu:107
    d@00000 Complete 1 (1 left) outputs of dep 0xa0beb0 @remote_dep_complete_and_cleanu:107
    d@00000 DATA    read_B(1, 0)[1, 0]<-1>{6}       param 0 deps 0x7fffb4001410 send on demand (increase deps counter by 1 [1]) @remote_dep_mpi_pack_dep:1589
    D@00000 MPI:    TO      1       Activate        read_B(1, 0)[1, 0]<-1>{6}
                            with datakey 7fffb4001410       mask 1  (tag=6) short mask 0 length 0 @remote_dep_mpi_pack_dep:1600
    d@00000 PUT_EAGER no data for item 0x7fffb40020b0, freeing @remote_dep_mpi_put_eager:1788
    d@00000 Complete 1 (1 left) outputs of dep 0x7fffb4001410 @remote_dep_complete_and_cleanu:107
    d@00001 MPI: comm engine signalled OFF on process 1/2 @remote_dep_dequeue_off:481
    d@00001 rank 1 DISABLE MPI communication engine @remote_dep_dequeue_nothread_pr:1034
    d@00001 MPI: comm engine OFF on process 1/2 @remote_dep_dequeue_main:525
    d@00001 MPI: comm engine signalled ON on process 1/2 @remote_dep_dequeue_on:449
    d@00001 MPI: comm engine ON on process 1/2 @remote_dep_dequeue_main:517
    

  5. Qinglei Cao reporter

    Maybe two commits causing deadlock:

    1. For 43e61ed, tested with George’s command: mpirun -np 2 ./testing_stencil_1D -M 1000 -t 100 -T 100 -N 1000: if initialized MPI_THREAD_MULTIPLE, there is a deadlock; while if MPI_THREAD_SERIALIZED, it’s OK. Rolled back commits and found it seems PR #249 caused the problem.
    2. For a3f8fad (the first commit causing deadlock described above), there is a deadlock even if initialized MPI_THREAD_SERIALIZED.

  6. Aurelien Bouteiller

    can reproduce looking into it

    Looks like we UNREF the individual data too early in the MT case

  7. Qinglei Cao reporter

    The second deadlock is solved by PR #344, but it seems the first deadlock caused by PR #249 is still there: applying PR #346 to the lastest PaRSEC, deadlock if initialized MPI_THREAD_MULTIPLE (mpirun -np 2 ./testing_stencil_1D -M 1000 -t 100 -T 100 -N 1000 -m 1); while it’s OK if MPI_THREAD_SERIALIZED (-m 0).

  8. Aurelien Bouteiller

    I also have a bug when using commit #21175431 with the following command line srun -n 2 ./parsec/tests/stencil/testing_stencil_1D -M 1000 -t 100 -T 100 -N 1000 -m 0 -c 1 -- -mca comm_verbose 100 -mca debug_verbose 100 -mca runtime_comm_thread_multiple 0 -mca runtime_comm_eager_limit 0 -mca runtime_comm_short_limit 0 -mca runtime_comm_aggregate 0

    Note how multithreaded is disabled here.

    d@00001 Complete 1 (2 left) outputs of dep 0x7fffe4010160 @remote_dep_complete_and_cleanu:107
    d@00001 Release data copy 0x968340 at /home/bouteill/parsec/dplasma.debug/parsec/tests/stencil/stencil_1D.c:1368 @release_deps_of_stencil_1D_tas:1368
    d@00001 Found item 0x7fffe4019960/task(0, 6) into hash table 0x951b60 in bucket 18 @parsec_hash_table_nolock_find:458
    d@00001 entry 0x7fffe4019960/task(0, 6) of HT task_repo has 1/2 usage count and is not retained: not freeing it at /home/bouteill/parsec/dplasma.debug/parsec/tests/stencil/stencil_1D.c:1385 @__data_repo_entry_used_once:117
    d@00001 Release data copy 0x7fffe4010e00 at /home/bouteill/parsec/dplasma.debug/parsec/tests/stencil/stencil_1D.c:1389 @release_deps_of_stencil_1D_tas:1389
    d@00001 Destruct data copy 0x7fffe4010e00 (attached to 0x7fffe4013cb0) @parsec_data_copy_destruct:38
    d@00001 Release data 0x7fffe4013cb0 @parsec_data_destruct:74
    d@00001 Arena:  push a data of size 800 from arena 0x8b9d10, aligned by 16, base ptr 0x7fffe40158a0, data ptr 0x7fffe4015900, sizeof prefix 96(96) @parsec_arena_release_chunk:160
    d@00001 Found item 0x7fffe40181a0/task(0, 8) into hash table 0x951b60 in bucket 14 @parsec_hash_table_nolock_find:458
    d@00001 entry 0x7fffe40181a0/task(0, 8) of hash table task_repo has a usage count of 2/2 and is not retained: freeing it at /home/bouteill/parsec/dplasma.debug/parsec/tests/stencil/stencil_1D.c:1392 @__data_repo_entry_used_once:110
    d@00001 Removed item 0x7fffe40181a0/task(0, 8) from hash table 0x951b60 in bucket 14 @parsec_hash_table_nolock_remov:498
    d@00001 Release data copy 0x7fffe4013c20 at /home/bouteill/parsec/dplasma.debug/parsec/tests/stencil/stencil_1D.c:1396 @release_deps_of_stencil_1D_tas:1396
    testing_stencil_1D: /home/bouteill/parsec/dplasma.debug/parsec/tests/stencil/stencil_1D.c:1396: release_deps_of_stencil_1D_task: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((parsec_object_t *) ((this_task->data._f_AR.data_in)))->obj_magic_id' failed.
    [c00:21999] *** Process received signal ***
    [c00:21999] Signal: Aborted (6)
    [c00:21999] Signal code:  (-6)
    

  9. Qinglei Cao reporter

    I think this is another bug. If run without “runtime_comm_short_limit“ , it’s OK. And the deadlock is still there.

  10. Aurelien Bouteiller

    THis defect is still present in current master. The ASAN report may help pinpoint with more details what’s going on.

     salloc -N2 mpiexec --tag-output -np 2 -npernode 1 tests/stencil/testing_stencil_1D -M 1000 -t 100 -T 100 -N 1000 -m 0 2>&1 | grep 1,1
    [1,1]<stderr>:=================================================================
    [1,1]<stderr>:==16945==ERROR: AddressSanitizer: heap-use-after-free on address 0x60d0000201c8 at pc 0x000000411bfa bp 0x7fffe0dff2b0 sp 0x7fffe0dff2a8
    [1,1]<stderr>:==16945==AddressSanitizer: while reporting a bug found another one. Ignoring.
    [1,1]<stderr>:READ of size 8 at 0x60d0000201c8 thread T5
    [1,1]<stderr>:==16945==AddressSanitizer: while reporting a bug found another one. Ignoring.
    [1,1]<stderr>:==16945==AddressSanitizer: while reporting a bug found another one. Ignoring.
    [1,1]<stderr>:==16945==AddressSanitizer: while reporting a bug found another one. Ignoring.
    [1,1]<stderr>:==16945==AddressSanitizer: while reporting a bug found another one. Ignoring.
    [1,1]<stderr>:==16945==AddressSanitizer: while reporting a bug found another one. Ignoring.
    [1,1]<stderr>:    #0 0x411bf9 in release_deps_of_stencil_1D_task tests/stencil/stencil_1D.c:1628
    [1,1]<stderr>:    #1 0x416707 in complete_hook_of_stencil_1D_task tests/stencil/stencil_1D.c:2076
    [1,1]<stderr>:    #2 0x7ffff6989cd1 in __parsec_complete_execution /home/bouteill/parsec/master/parsec/parsec/scheduling.c:391
    [1,1]<stderr>:    #3 0x7ffff6989fdb in __parsec_task_progress /home/bouteill/parsec/master/parsec/parsec/scheduling.c:442
    [1,1]<stderr>:    #4 0x7ffff698ac53 in __parsec_context_wait /home/bouteill/parsec/master/parsec/parsec/scheduling.c:565
    [1,1]<stderr>:    #5 0x7ffff694010d in __parsec_thread_init /home/bouteill/parsec/master/parsec/parsec/parsec.c:287
    [1,1]<stderr>:    #6 0x7ffff51b6dd4 in start_thread (/usr/lib64/libpthread.so.0+0x7dd4)
    [1,1]<stderr>:    #7 0x7ffff4edfeac in __clone (/usr/lib64/libc.so.6+0xfdeac)
    [1,1]<stderr>:
    [1,1]<stderr>:0x60d0000201c8 is located 8 bytes inside of 136-byte region [0x60d0000201c0,0x60d000020248)
    [1,1]<stderr>:freed by thread T9 here:
    [1,1]<stderr>:    #0 0x7ffff6f01258 in __interceptor_free /spack/var/spack/stage/gcc-7.3.0-ox22fqdausgn5up7wb5d3mli5ngcbgri/gcc-7.3.0/libsanitizer/asan/asan_malloc_linux.cc:45
    [1,1]<stderr>:    #1 0x4118af in release_deps_of_stencil_1D_task tests/stencil/stencil_1D.c:1622
    [1,1]<stderr>:    #2 0x416707 in complete_hook_of_stencil_1D_task tests/stencil/stencil_1D.c:2076
    [1,1]<stderr>:    #3 0x7ffff6989cd1 in __parsec_complete_execution /home/bouteill/parsec/master/parsec/parsec/scheduling.c:391
    [1,1]<stderr>:    #4 0x7ffff6989fdb in __parsec_task_progress /home/bouteill/parsec/master/parsec/parsec/scheduling.c:442
    [1,1]<stderr>:    #5 0x7ffff698ac53 in __parsec_context_wait /home/bouteill/parsec/master/parsec/parsec/scheduling.c:565
    [1,1]<stderr>:    #6 0x7ffff694010d in __parsec_thread_init /home/bouteill/parsec/master/parsec/parsec/parsec.c:287
    [1,1]<stderr>:    #7 0x7ffff51b6dd4 in start_thread (/usr/lib64/libpthread.so.0+0x7dd4)
    [1,1]<stderr>:
    [1,1]<stderr>:previously allocated by thread T4 here:
    [1,1]<stderr>:    #0 0x7ffff6f015b0 in __interceptor_malloc /spack/var/spack/stage/gcc-7.3.0-ox22fqdausgn5up7wb5d3mli5ngcbgri/gcc-7.3.0/libsanitizer/asan/asan_malloc_linux.cc:62
    [1,1]<stderr>:    #1 0x7ffff6952cb3 in parsec_obj_new /home/bouteill/parsec/master/parsec/parsec/class/parsec_object.h:466
    [1,1]<stderr>:    #2 0x7ffff6952931 in parsec_obj_new_debug /home/bouteill/parsec/master/parsec/parsec/class/parsec_object.h:249
    [1,1]<stderr>:    #3 0x7ffff6955db7 in parsec_data_copy_new /home/bouteill/parsec/master/parsec/parsec/data.c:209
    [1,1]<stderr>:    #4 0x7ffff693c179 in parsec_arena_get_copy /home/bouteill/parsec/master/parsec/parsec/arena.c:229
    [1,1]<stderr>:    #5 0x7ffff697015c in remote_dep_copy_allocate /home/bouteill/parsec/master/parsec/parsec/remote_dep_mpi.c:650
    [1,1]<stderr>:    #6 0x7ffff6981774 in remote_dep_mpi_get_start /home/bouteill/parsec/master/parsec/parsec/remote_dep_mpi.c:2265
    [1,1]<stderr>:    #7 0x7ffff697f78a in remote_dep_mpi_recv_activate /home/bouteill/parsec/master/parsec/parsec/remote_dep_mpi.c:2097
    [1,1]<stderr>:    #8 0x7ffff6980437 in remote_dep_mpi_save_activate_cb /home/bouteill/parsec/master/parsec/parsec/remote_dep_mpi.c:2150
    [1,1]<stderr>:    #9 0x7ffff697a940 in remote_dep_mpi_progress /home/bouteill/parsec/master/parsec/parsec/remote_dep_mpi.c:1734
    [1,1]<stderr>:    #10 0x7ffff69747bc in remote_dep_dequeue_nothread_progress /home/bouteill/parsec/master/parsec/parsec/remote_dep_mpi.c:1036
    [1,1]<stderr>:    #11 0x7ffff696e762 in remote_dep_dequeue_main /home/bouteill/parsec/master/parsec/parsec/remote_dep_mpi.c:535
    [1,1]<stderr>:    #12 0x7ffff51b6dd4 in start_thread (/usr/lib64/libpthread.so.0+0x7dd4)
    [1,1]<stderr>:
    [1,1]<stderr>:Thread T5 created by T0 here:
    [1,1]<stderr>:    #0 0x7ffff6e5ed20 in __interceptor_pthread_create /spack/var/spack/stage/gcc-7.3.0-ox22fqdausgn5up7wb5d3mli5ngcbgri/gcc-7.3.0/libsanitizer/asan/asan_interceptors.cc:243
    [1,1]<stderr>:    #1 0x7ffff6943589 in parsec_init /home/bouteill/parsec/master/parsec/parsec/parsec.c:786
    [1,1]<stderr>:    #2 0x405093 in main /home/bouteill/parsec/master/parsec/tests/stencil/testing_stencil_1D.c:95
    [1,1]<stderr>:    #3 0x7ffff4e043d4 in __libc_start_main (/usr/lib64/libc.so.6+0x223d4)
    [1,1]<stderr>:
    [1,1]<stderr>:Thread T9 created by T0 here:
    [1,1]<stderr>:    #0 0x7ffff6e5ed20 in __interceptor_pthread_create /spack/var/spack/stage/gcc-7.3.0-ox22fqdausgn5up7wb5d3mli5ngcbgri/gcc-7.3.0/libsanitizer/asan/asan_interceptors.cc:243
    [1,1]<stderr>:    #1 0x7ffff6943589 in parsec_init /home/bouteill/parsec/master/parsec/parsec/parsec.c:786
    [1,1]<stderr>:    #2 0x405093 in main /home/bouteill/parsec/master/parsec/tests/stencil/testing_stencil_1D.c:95
    [1,1]<stderr>:    #3 0x7ffff4e043d4 in __libc_start_main (/usr/lib64/libc.so.6+0x223d4)
    [1,1]<stderr>:
    [1,1]<stderr>:Thread T4 created by T0 here:
    [1,1]<stderr>:    #0 0x7ffff6e5ed20 in __interceptor_pthread_create /spack/var/spack/stage/gcc-7.3.0-ox22fqdausgn5up7wb5d3mli5ngcbgri/gcc-7.3.0/libsanitizer/asan/asan_interceptors.cc:243
    [1,1]<stderr>:    #1 0x7ffff696d5e5 in remote_dep_dequeue_init /home/bouteill/parsec/master/parsec/parsec/remote_dep_mpi.c:390
    [1,1]<stderr>:    #2 0x7ffff69832fa in parsec_remote_dep_init /home/bouteill/parsec/master/parsec/parsec/remote_dep.c:252
    [1,1]<stderr>:    #3 0x7ffff6942a8f in parsec_init /home/bouteill/parsec/master/parsec/parsec/parsec.c:714
    [1,1]<stderr>:    #4 0x405093 in main /home/bouteill/parsec/master/parsec/tests/stencil/testing_stencil_1D.c:95
    [1,1]<stderr>:    #5 0x7ffff4e043d4 in __libc_start_main (/usr/lib64/libc.so.6+0x223d4)
    [1,1]<stderr>:
    [1,1]<stderr>:SUMMARY: AddressSanitizer: heap-use-after-free tests/stencil/stencil_1D.c:1628 in release_deps_of_stencil_1D_task
    [1,1]<stderr>:Shadow bytes around the buggy address:
    [1,1]<stderr>:  0x0c1a7fffbfe0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    [1,1]<stderr>:  0x0c1a7fffbff0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    [1,1]<stderr>:  0x0c1a7fffc000: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
    [1,1]<stderr>:  0x0c1a7fffc010: 00 00 00 00 00 00 fa fa fa fa fa fa fa fa 00 00
    [1,1]<stderr>:  0x0c1a7fffc020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fa
    [1,1]<stderr>:=>0x0c1a7fffc030: fa fa fa fa fa fa fa fa fd[fd]fd fd fd fd fd fd
    [1,1]<stderr>:  0x0c1a7fffc040: fd fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa
    [1,1]<stderr>:  0x0c1a7fffc050: fa fa fd fd fd fd fd fd fd fd fd fd fd fd fd fd
    [1,1]<stderr>:  0x0c1a7fffc060: fd fd fd fa fa fa fa fa fa fa fa fa fd fd fd fd
    [1,1]<stderr>:  0x0c1a7fffc070: fd fd fd fd fd fd fd fd fd fd fd fd fd fa fa fa
    [1,1]<stderr>:  0x0c1a7fffc080: fa fa fa fa fa fa fd fd fd fd fd fd fd fd fd fd
    [1,1]<stderr>:Shadow byte legend (one shadow byte represents 8 application bytes):
    [1,1]<stderr>:  Addressable:           00
    [1,1]<stderr>:  Partially addressable: 01 02 03 04 05 06 07
    [1,1]<stderr>:  Heap left redzone:       fa
    [1,1]<stderr>:  Freed heap region:       fd
    [1,1]<stderr>:  Stack left redzone:      f1
    [1,1]<stderr>:  Stack mid redzone:       f2
    [1,1]<stderr>:  Stack right redzone:     f3
    [1,1]<stderr>:  Stack after return:      f5
    [1,1]<stderr>:  Stack use after scope:   f8
    [1,1]<stderr>:  Global redzone:          f9
    [1,1]<stderr>:  Global init order:       f6
    [1,1]<stderr>:  Poisoned by user:        f7
    [1,1]<stderr>:  Container overflow:      fc
    [1,1]<stderr>:  Array cookie:            ac
    [1,1]<stderr>:  Intra object redzone:    bb
    [1,1]<stderr>:  ASan internal:           fe
    [1,1]<stderr>:  Left alloca redzone:     ca
    [1,1]<stderr>:  Right alloca redzone:    cb
    [1,1]<stderr>:==16945==ABORTING
    

  11. Log in to comment