Deadlock
Comments (16)
-
-
reporter No, on Saturn with OpenMPI 4.0.0.
-
I an not able to reproduce the successful run with 43e61ed. Executing
mpirun -np 2 ./testing_stencil_1D -M 1000 -t 100 -T 100 -N 1000consistently leads to a segfault in data_lookup_of_stencil_1D_task, even with 43e61ed1c7
-
As an additional data point, 43e61ed has 4 ctest failures
16 - unit_dtd_pingpong_mpi (Timeout) 30 - unit_haar_tree_shm (SEGFAULT) 31 - unit_haar_tree_mpi (Failed) 33 - unit_merge_sort_mpi (Failed)
-
reporter For 43e61ed, if initialized MPI_THREAD_SERIALIZED instead of MPI_THREAD_MULTIPLE, it’s OK with George’s command:
mpirun -np 2 ./testing_stencil_1D -M 1000 -t 100 -T 100 -N 1000. I rolled back the commits and found it seems PR #249 caused the problem. But for a3f8fad (first commit causing deadlock as described before), there is deadlock even if initialized MPI_THREAD_SERIALIZED. -
srun -N2 tests/testing_dpotrf -N 200 -t 100 -v -x -- -mca comm_verbose 100
This results in the sender side issuing the correct ACTIVATE messages, however the receiver side shuts down the MPI engine, as-if the number of tasks was counted wrong (it will then deadlock). Notable: we go through the path in which the taskpool is not yet known at the receiver (i.e. we receive the message before the taskpool register).
d@00000 TOPO read_B(0, 0)[0, 0]<0>{6} root=0 0 (d0) -? 1 (dna) @parsec_remote_dep_activate:501 d@00000 TOPO read_B(1, 0)[1, 0]<0>{6} root=0 0 (d0) -? 1 (dna) @parsec_remote_dep_activate:501 d@00000 [0:0] task read_B(0, 0)[0, 0]<0>{6} my_idx 0 idx 1 rank 1 -- send (1) @parsec_remote_dep_activate:513 d@00000 [0:0] task read_B(1, 0)[1, 0]<0>{6} my_idx 0 idx 1 rank 1 -- send (1) @parsec_remote_dep_activate:513 d@00000 TOPO read_B(0, 0)[0, 0]<0>{6} flow B root=0 0 (d0) -> 1 (d1) @parsec_remote_dep_activate:521 d@00001 remote_deps_allocate: 0x7fffd80126b0 @remote_deps_allocate:167 d@00001 MPI: FROM 0 Activate NoTPool UNKNOWN_of_TASKPOOL_6 k=2 with datakey a0beb0 params 1 @remote_dep_mpi_save_activate_c:2109 d@00000 TOPO read_B(1, 0)[1, 0]<0>{6} flow B root=0 0 (d0) -> 1 (d1) @parsec_remote_dep_activate:521 d@00000 fw test REMOTE rank 1 (value=0) @remote_dep_is_forwarded:77 d@00000 fw test REMOTE rank 1 (value=0) @remote_dep_is_forwarded:77 d@00001 remote_deps_allocate: 0x7fffd8013350 @remote_deps_allocate:167 d@00001 MPI: FROM 0 Activate NoTPool UNKNOWN_of_TASKPOOL_6 k=3 with datakey 7fffb4001410 params 1 @remote_dep_mpi_save_activate_c:2109 d@00000 DATA read_B(0, 0)[0, 0]<-1>{6} param 0 deps 0xa0beb0 send on demand (increase deps counter by 1 [2]) @remote_dep_mpi_pack_dep:1589 d@00000 fw mark REMOTE rank 1 @remote_dep_mark_forwarded:59 D@00000 MPI: TO 1 Activate read_B(0, 0)[0, 0]<-1>{6} with datakey a0beb0 mask 1 (tag=5) short mask 0 length 0 @remote_dep_mpi_pack_dep:1600 d@00000 Complete 1 (1 left) outputs of dep 0x7fffb4001410 @remote_dep_complete_and_cleanu:107 d@00000 fw mark REMOTE rank 1 @remote_dep_mark_forwarded:59 d@00000 PUT_EAGER no data for item 0x9fb2c0, freeing @remote_dep_mpi_put_eager:1788 d@00000 Complete 1 (2 left) outputs of dep 0xa0beb0 @remote_dep_complete_and_cleanu:107 d@00000 Complete 1 (1 left) outputs of dep 0xa0beb0 @remote_dep_complete_and_cleanu:107 d@00000 DATA read_B(1, 0)[1, 0]<-1>{6} param 0 deps 0x7fffb4001410 send on demand (increase deps counter by 1 [1]) @remote_dep_mpi_pack_dep:1589 D@00000 MPI: TO 1 Activate read_B(1, 0)[1, 0]<-1>{6} with datakey 7fffb4001410 mask 1 (tag=6) short mask 0 length 0 @remote_dep_mpi_pack_dep:1600 d@00000 PUT_EAGER no data for item 0x7fffb40020b0, freeing @remote_dep_mpi_put_eager:1788 d@00000 Complete 1 (1 left) outputs of dep 0x7fffb4001410 @remote_dep_complete_and_cleanu:107 d@00001 MPI: comm engine signalled OFF on process 1/2 @remote_dep_dequeue_off:481 d@00001 rank 1 DISABLE MPI communication engine @remote_dep_dequeue_nothread_pr:1034 d@00001 MPI: comm engine OFF on process 1/2 @remote_dep_dequeue_main:525 d@00001 MPI: comm engine signalled ON on process 1/2 @remote_dep_dequeue_on:449 d@00001 MPI: comm engine ON on process 1/2 @remote_dep_dequeue_main:517
-
-
reporter Maybe two commits causing deadlock:
- For 43e61ed, tested with George’s command: mpirun -np 2 ./testing_stencil_1D -M 1000 -t 100 -T 100 -N 1000: if initialized MPI_THREAD_MULTIPLE, there is a deadlock; while if MPI_THREAD_SERIALIZED, it’s OK. Rolled back commits and found it seems PR #249 caused the problem.
- For a3f8fad (the first commit causing deadlock described above), there is a deadlock even if initialized MPI_THREAD_SERIALIZED.
-
can reproduce looking into it
Looks like we UNREF the individual data too early in the MT case
-
See PR #346
-
reporter The second deadlock is solved by PR #344, but it seems the first deadlock caused by PR #249 is still there: applying PR #346 to the lastest PaRSEC, deadlock if initialized MPI_THREAD_MULTIPLE (mpirun -np 2 ./testing_stencil_1D -M 1000 -t 100 -T 100 -N 1000 -m 1); while it’s OK if MPI_THREAD_SERIALIZED (-m 0).
-
I also have a bug when using commit #21175431 with the following command line
srun -n 2 ./parsec/tests/stencil/testing_stencil_1D -M 1000 -t 100 -T 100 -N 1000 -m 0 -c 1 -- -mca comm_verbose 100 -mca debug_verbose 100 -mca runtime_comm_thread_multiple 0 -mca runtime_comm_eager_limit 0 -mca runtime_comm_short_limit 0 -mca runtime_comm_aggregate 0Note how multithreaded is disabled here.
d@00001 Complete 1 (2 left) outputs of dep 0x7fffe4010160 @remote_dep_complete_and_cleanu:107 d@00001 Release data copy 0x968340 at /home/bouteill/parsec/dplasma.debug/parsec/tests/stencil/stencil_1D.c:1368 @release_deps_of_stencil_1D_tas:1368 d@00001 Found item 0x7fffe4019960/task(0, 6) into hash table 0x951b60 in bucket 18 @parsec_hash_table_nolock_find:458 d@00001 entry 0x7fffe4019960/task(0, 6) of HT task_repo has 1/2 usage count and is not retained: not freeing it at /home/bouteill/parsec/dplasma.debug/parsec/tests/stencil/stencil_1D.c:1385 @__data_repo_entry_used_once:117 d@00001 Release data copy 0x7fffe4010e00 at /home/bouteill/parsec/dplasma.debug/parsec/tests/stencil/stencil_1D.c:1389 @release_deps_of_stencil_1D_tas:1389 d@00001 Destruct data copy 0x7fffe4010e00 (attached to 0x7fffe4013cb0) @parsec_data_copy_destruct:38 d@00001 Release data 0x7fffe4013cb0 @parsec_data_destruct:74 d@00001 Arena: push a data of size 800 from arena 0x8b9d10, aligned by 16, base ptr 0x7fffe40158a0, data ptr 0x7fffe4015900, sizeof prefix 96(96) @parsec_arena_release_chunk:160 d@00001 Found item 0x7fffe40181a0/task(0, 8) into hash table 0x951b60 in bucket 14 @parsec_hash_table_nolock_find:458 d@00001 entry 0x7fffe40181a0/task(0, 8) of hash table task_repo has a usage count of 2/2 and is not retained: freeing it at /home/bouteill/parsec/dplasma.debug/parsec/tests/stencil/stencil_1D.c:1392 @__data_repo_entry_used_once:110 d@00001 Removed item 0x7fffe40181a0/task(0, 8) from hash table 0x951b60 in bucket 14 @parsec_hash_table_nolock_remov:498 d@00001 Release data copy 0x7fffe4013c20 at /home/bouteill/parsec/dplasma.debug/parsec/tests/stencil/stencil_1D.c:1396 @release_deps_of_stencil_1D_tas:1396 testing_stencil_1D: /home/bouteill/parsec/dplasma.debug/parsec/tests/stencil/stencil_1D.c:1396: release_deps_of_stencil_1D_task: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((parsec_object_t *) ((this_task->data._f_AR.data_in)))->obj_magic_id' failed. [c00:21999] *** Process received signal *** [c00:21999] Signal: Aborted (6) [c00:21999] Signal code: (-6)
-
reporter I think this is another bug. If run without “
runtime_comm_short_limit“ , it’s OK. And the deadlock is still there.
-
THis defect is still present in current master. The ASAN report may help pinpoint with more details what’s going on.
salloc -N2 mpiexec --tag-output -np 2 -npernode 1 tests/stencil/testing_stencil_1D -M 1000 -t 100 -T 100 -N 1000 -m 0 2>&1 | grep 1,1 [1,1]<stderr>:================================================================= [1,1]<stderr>:==16945==ERROR: AddressSanitizer: heap-use-after-free on address 0x60d0000201c8 at pc 0x000000411bfa bp 0x7fffe0dff2b0 sp 0x7fffe0dff2a8 [1,1]<stderr>:==16945==AddressSanitizer: while reporting a bug found another one. Ignoring. [1,1]<stderr>:READ of size 8 at 0x60d0000201c8 thread T5 [1,1]<stderr>:==16945==AddressSanitizer: while reporting a bug found another one. Ignoring. [1,1]<stderr>:==16945==AddressSanitizer: while reporting a bug found another one. Ignoring. [1,1]<stderr>:==16945==AddressSanitizer: while reporting a bug found another one. Ignoring. [1,1]<stderr>:==16945==AddressSanitizer: while reporting a bug found another one. Ignoring. [1,1]<stderr>:==16945==AddressSanitizer: while reporting a bug found another one. Ignoring. [1,1]<stderr>: #0 0x411bf9 in release_deps_of_stencil_1D_task tests/stencil/stencil_1D.c:1628 [1,1]<stderr>: #1 0x416707 in complete_hook_of_stencil_1D_task tests/stencil/stencil_1D.c:2076 [1,1]<stderr>: #2 0x7ffff6989cd1 in __parsec_complete_execution /home/bouteill/parsec/master/parsec/parsec/scheduling.c:391 [1,1]<stderr>: #3 0x7ffff6989fdb in __parsec_task_progress /home/bouteill/parsec/master/parsec/parsec/scheduling.c:442 [1,1]<stderr>: #4 0x7ffff698ac53 in __parsec_context_wait /home/bouteill/parsec/master/parsec/parsec/scheduling.c:565 [1,1]<stderr>: #5 0x7ffff694010d in __parsec_thread_init /home/bouteill/parsec/master/parsec/parsec/parsec.c:287 [1,1]<stderr>: #6 0x7ffff51b6dd4 in start_thread (/usr/lib64/libpthread.so.0+0x7dd4) [1,1]<stderr>: #7 0x7ffff4edfeac in __clone (/usr/lib64/libc.so.6+0xfdeac) [1,1]<stderr>: [1,1]<stderr>:0x60d0000201c8 is located 8 bytes inside of 136-byte region [0x60d0000201c0,0x60d000020248) [1,1]<stderr>:freed by thread T9 here: [1,1]<stderr>: #0 0x7ffff6f01258 in __interceptor_free /spack/var/spack/stage/gcc-7.3.0-ox22fqdausgn5up7wb5d3mli5ngcbgri/gcc-7.3.0/libsanitizer/asan/asan_malloc_linux.cc:45 [1,1]<stderr>: #1 0x4118af in release_deps_of_stencil_1D_task tests/stencil/stencil_1D.c:1622 [1,1]<stderr>: #2 0x416707 in complete_hook_of_stencil_1D_task tests/stencil/stencil_1D.c:2076 [1,1]<stderr>: #3 0x7ffff6989cd1 in __parsec_complete_execution /home/bouteill/parsec/master/parsec/parsec/scheduling.c:391 [1,1]<stderr>: #4 0x7ffff6989fdb in __parsec_task_progress /home/bouteill/parsec/master/parsec/parsec/scheduling.c:442 [1,1]<stderr>: #5 0x7ffff698ac53 in __parsec_context_wait /home/bouteill/parsec/master/parsec/parsec/scheduling.c:565 [1,1]<stderr>: #6 0x7ffff694010d in __parsec_thread_init /home/bouteill/parsec/master/parsec/parsec/parsec.c:287 [1,1]<stderr>: #7 0x7ffff51b6dd4 in start_thread (/usr/lib64/libpthread.so.0+0x7dd4) [1,1]<stderr>: [1,1]<stderr>:previously allocated by thread T4 here: [1,1]<stderr>: #0 0x7ffff6f015b0 in __interceptor_malloc /spack/var/spack/stage/gcc-7.3.0-ox22fqdausgn5up7wb5d3mli5ngcbgri/gcc-7.3.0/libsanitizer/asan/asan_malloc_linux.cc:62 [1,1]<stderr>: #1 0x7ffff6952cb3 in parsec_obj_new /home/bouteill/parsec/master/parsec/parsec/class/parsec_object.h:466 [1,1]<stderr>: #2 0x7ffff6952931 in parsec_obj_new_debug /home/bouteill/parsec/master/parsec/parsec/class/parsec_object.h:249 [1,1]<stderr>: #3 0x7ffff6955db7 in parsec_data_copy_new /home/bouteill/parsec/master/parsec/parsec/data.c:209 [1,1]<stderr>: #4 0x7ffff693c179 in parsec_arena_get_copy /home/bouteill/parsec/master/parsec/parsec/arena.c:229 [1,1]<stderr>: #5 0x7ffff697015c in remote_dep_copy_allocate /home/bouteill/parsec/master/parsec/parsec/remote_dep_mpi.c:650 [1,1]<stderr>: #6 0x7ffff6981774 in remote_dep_mpi_get_start /home/bouteill/parsec/master/parsec/parsec/remote_dep_mpi.c:2265 [1,1]<stderr>: #7 0x7ffff697f78a in remote_dep_mpi_recv_activate /home/bouteill/parsec/master/parsec/parsec/remote_dep_mpi.c:2097 [1,1]<stderr>: #8 0x7ffff6980437 in remote_dep_mpi_save_activate_cb /home/bouteill/parsec/master/parsec/parsec/remote_dep_mpi.c:2150 [1,1]<stderr>: #9 0x7ffff697a940 in remote_dep_mpi_progress /home/bouteill/parsec/master/parsec/parsec/remote_dep_mpi.c:1734 [1,1]<stderr>: #10 0x7ffff69747bc in remote_dep_dequeue_nothread_progress /home/bouteill/parsec/master/parsec/parsec/remote_dep_mpi.c:1036 [1,1]<stderr>: #11 0x7ffff696e762 in remote_dep_dequeue_main /home/bouteill/parsec/master/parsec/parsec/remote_dep_mpi.c:535 [1,1]<stderr>: #12 0x7ffff51b6dd4 in start_thread (/usr/lib64/libpthread.so.0+0x7dd4) [1,1]<stderr>: [1,1]<stderr>:Thread T5 created by T0 here: [1,1]<stderr>: #0 0x7ffff6e5ed20 in __interceptor_pthread_create /spack/var/spack/stage/gcc-7.3.0-ox22fqdausgn5up7wb5d3mli5ngcbgri/gcc-7.3.0/libsanitizer/asan/asan_interceptors.cc:243 [1,1]<stderr>: #1 0x7ffff6943589 in parsec_init /home/bouteill/parsec/master/parsec/parsec/parsec.c:786 [1,1]<stderr>: #2 0x405093 in main /home/bouteill/parsec/master/parsec/tests/stencil/testing_stencil_1D.c:95 [1,1]<stderr>: #3 0x7ffff4e043d4 in __libc_start_main (/usr/lib64/libc.so.6+0x223d4) [1,1]<stderr>: [1,1]<stderr>:Thread T9 created by T0 here: [1,1]<stderr>: #0 0x7ffff6e5ed20 in __interceptor_pthread_create /spack/var/spack/stage/gcc-7.3.0-ox22fqdausgn5up7wb5d3mli5ngcbgri/gcc-7.3.0/libsanitizer/asan/asan_interceptors.cc:243 [1,1]<stderr>: #1 0x7ffff6943589 in parsec_init /home/bouteill/parsec/master/parsec/parsec/parsec.c:786 [1,1]<stderr>: #2 0x405093 in main /home/bouteill/parsec/master/parsec/tests/stencil/testing_stencil_1D.c:95 [1,1]<stderr>: #3 0x7ffff4e043d4 in __libc_start_main (/usr/lib64/libc.so.6+0x223d4) [1,1]<stderr>: [1,1]<stderr>:Thread T4 created by T0 here: [1,1]<stderr>: #0 0x7ffff6e5ed20 in __interceptor_pthread_create /spack/var/spack/stage/gcc-7.3.0-ox22fqdausgn5up7wb5d3mli5ngcbgri/gcc-7.3.0/libsanitizer/asan/asan_interceptors.cc:243 [1,1]<stderr>: #1 0x7ffff696d5e5 in remote_dep_dequeue_init /home/bouteill/parsec/master/parsec/parsec/remote_dep_mpi.c:390 [1,1]<stderr>: #2 0x7ffff69832fa in parsec_remote_dep_init /home/bouteill/parsec/master/parsec/parsec/remote_dep.c:252 [1,1]<stderr>: #3 0x7ffff6942a8f in parsec_init /home/bouteill/parsec/master/parsec/parsec/parsec.c:714 [1,1]<stderr>: #4 0x405093 in main /home/bouteill/parsec/master/parsec/tests/stencil/testing_stencil_1D.c:95 [1,1]<stderr>: #5 0x7ffff4e043d4 in __libc_start_main (/usr/lib64/libc.so.6+0x223d4) [1,1]<stderr>: [1,1]<stderr>:SUMMARY: AddressSanitizer: heap-use-after-free tests/stencil/stencil_1D.c:1628 in release_deps_of_stencil_1D_task [1,1]<stderr>:Shadow bytes around the buggy address: [1,1]<stderr>: 0x0c1a7fffbfe0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa [1,1]<stderr>: 0x0c1a7fffbff0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa [1,1]<stderr>: 0x0c1a7fffc000: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00 [1,1]<stderr>: 0x0c1a7fffc010: 00 00 00 00 00 00 fa fa fa fa fa fa fa fa 00 00 [1,1]<stderr>: 0x0c1a7fffc020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fa [1,1]<stderr>:=>0x0c1a7fffc030: fa fa fa fa fa fa fa fa fd[fd]fd fd fd fd fd fd [1,1]<stderr>: 0x0c1a7fffc040: fd fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa [1,1]<stderr>: 0x0c1a7fffc050: fa fa fd fd fd fd fd fd fd fd fd fd fd fd fd fd [1,1]<stderr>: 0x0c1a7fffc060: fd fd fd fa fa fa fa fa fa fa fa fa fd fd fd fd [1,1]<stderr>: 0x0c1a7fffc070: fd fd fd fd fd fd fd fd fd fd fd fd fd fa fa fa [1,1]<stderr>: 0x0c1a7fffc080: fa fa fa fa fa fa fd fd fd fd fd fd fd fd fd fd [1,1]<stderr>:Shadow byte legend (one shadow byte represents 8 application bytes): [1,1]<stderr>: Addressable: 00 [1,1]<stderr>: Partially addressable: 01 02 03 04 05 06 07 [1,1]<stderr>: Heap left redzone: fa [1,1]<stderr>: Freed heap region: fd [1,1]<stderr>: Stack left redzone: f1 [1,1]<stderr>: Stack mid redzone: f2 [1,1]<stderr>: Stack right redzone: f3 [1,1]<stderr>: Stack after return: f5 [1,1]<stderr>: Stack use after scope: f8 [1,1]<stderr>: Global redzone: f9 [1,1]<stderr>: Global init order: f6 [1,1]<stderr>: Poisoned by user: f7 [1,1]<stderr>: Container overflow: fc [1,1]<stderr>: Array cookie: ac [1,1]<stderr>: Intra object redzone: bb [1,1]<stderr>: ASan internal: fe [1,1]<stderr>: Left alloca redzone: ca [1,1]<stderr>: Right alloca redzone: cb [1,1]<stderr>:==16945==ABORTING
-
reporter It seems PR #456 solves this issue. Would you please confirm? @Aurelien Bouteiller
-
- changed status to resolved
- Log in to comment
Are you using MPICH?