Deadlock in test_matrix test_indent_zeros, test of petsc error handling

Issue #396 resolved
Martin Sandve Alnæs created an issue

When running unit tests with mpirun -n 4, the test

la/test_matrix.py:246: TestMatrixForAnyBackend.test_ident_zeros[False-any_backend0] 

deadlocked. This test checks that a expected petsc error gets propagated as an exception but this doesn't seem to work robustly. In the deadlock, the processes were in these states:

Process 1:

la/test_matrix.py:246: TestMatrixForAnyBackend.test_ident_zeros[False-any_backend0] [1]PETSC ERROR: --------------------- Error Message ------------------------------------
[1]PETSC ERROR: Object is in wrong state!
[1]PETSC ERROR: Matrix is missing diagonal entry in row 6!
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Petsc Release Version 3.4.2, Jul, 02, 2013 
[1]PETSC ERROR: See docs/changes/index.html for recent updates.
[1]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[1]PETSC ERROR: See docs/index.html for manual pages.
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Unknown Name on a linux-gnu-cxx-opt named martinal-mc by martinal Tue Oct 21 13:28:30 2014
[1]PETSC ERROR: Libraries linked from /home/martinal/opt/fenics/dorsal-dev-1410/lib
[1]PETSC ERROR: Configure run at Tue Oct 21 10:51:41 2014
[1]PETSC ERROR: Configure options --prefix=/home/martinal/opt/fenics/dorsal-dev-1410 COPTFLAGS=-O2 --with-debugging=0 --with-shared-libraries=1 --with-clanguage=cxx --with-c-support=1 --download-umfpack=1 --download-hypre=1 --download-mumps=1 --download-scalapack=1 --download-blacs=1 --download-ptscotch=1 --download-scotch=1 --download-metis=1 --download-parmetis=1 --with-ml=1 --with-ml-lib=/home/martinal/opt/fenics/dorsal-dev-1410/lib/libml.so --with-ml-include=/home/martinal/opt/fenics/dorsal-dev-1410/include/trilinos
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: MatZeroRows_SeqAIJ() line 1680 in ../src/mat/impls/aij/seq/aij.c
[1]PETSC ERROR: MatZeroRows() line 5386 in ../src/mat/interface/matrix.c
[1]PETSC ERROR: MatZeroRows_MPIAIJ() line 878 in ../src/mat/impls/aij/mpi/mpiaij.c
[1]PETSC ERROR: MatZeroRows() line 5386 in ../src/mat/interface/matrix.c
[1]PETSC ERROR: --------------------- Error Message ------------------------------------
[1]PETSC ERROR: Argument out of range!
[1]PETSC ERROR: New nonzero at (8,513) caused a malloc!
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Petsc Release Version 3.4.2, Jul, 02, 2013 
[1]PETSC ERROR: See docs/changes/index.html for recent updates.
[1]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[1]PETSC ERROR: See docs/index.html for manual pages.
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Unknown Name on a linux-gnu-cxx-opt named martinal-mc by martinal Tue Oct 21 13:28:30 2014
[1]PETSC ERROR: Libraries linked from /home/martinal/opt/fenics/dorsal-dev-1410/lib
[1]PETSC ERROR: Configure run at Tue Oct 21 10:51:41 2014
[1]PETSC ERROR: Configure options --prefix=/home/martinal/opt/fenics/dorsal-dev-1410 COPTFLAGS=-O2 --with-debugging=0 --with-shared-libraries=1 --with-clanguage=cxx --with-c-support=1 --download-umfpack=1 --download-hypre=1 --download-mumps=1 --download-scalapack=1 --download-blacs=1 --download-ptscotch=1 --download-scotch=1 --download-metis=1 --download-parmetis=1 --with-ml=1 --with-ml-lib=/home/martinal/opt/fenics/dorsal-dev-1410/lib/libml.so --with-ml-include=/home/martinal/opt/fenics/dorsal-dev-1410/include/trilinos
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: MatSetValues_MPIAIJ() line 572 in ../src/mat/impls/aij/mpi/mpiaij.c
[1]PETSC ERROR: MatSetValues() line 1106 in ../src/mat/interface/matrix.c
FAILED
la/test_matrix.py:246: TestMatrixForAnyBackend.test_ident_zeros[True-any_backend0] ^C
Program received signal SIGINT, Interrupt.
0x00007fffee87b99f in opal_progress () from /usr/lib/libmpi.so.1
A debugging session is active.

        Inferior 1 [process 1312] will be killed.

Quit anyway? (y or n) n
Not confirmed.
(gdb) where
#0  0x00007fffee87b99f in opal_progress () from /usr/lib/libmpi.so.1
#1  0x00007fffee7c91f5 in ompi_request_default_wait_all ()
   from /usr/lib/libmpi.so.1
#2  0x00007fffcab3d302 in ompi_coll_tuned_sendrecv_actual ()
   from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#3  0x00007fffcab4506e in ompi_coll_tuned_barrier_intra_recursivedoubling ()
   from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#4  0x00007fffee7d656b in PMPI_Barrier () from /usr/lib/libmpi.so.1
#5  0x00007ffff0959671 in _wrap_MPI_barrier (
    args=0x7fffeeaec610 <ompi_request_lock+16>) at modulePYTHON_wrap.cxx:7160
#6  0x000000000052f936 in PyEval_EvalFrameEx ()

Process 0 (omitting 2 and 3 which are basically in the same state):

la/test_matrix.py:246: TestMatrixForAnyBackend.test_ident_zeros[False-any_backend0] Number of global vertices: 528
Number of global cells: 966
[0]PETSC ERROR: --------------------- Error Message ------------------------------------
[0]PETSC ERROR: Object is in wrong state!
[0]PETSC ERROR: Matrix is missing diagonal entry in row 5!
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Petsc Release Version 3.4.2, Jul, 02, 2013 
[0]PETSC ERROR: See docs/changes/index.html for recent updates.
[0]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[0]PETSC ERROR: See docs/index.html for manual pages.
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Unknown Name on a linux-gnu-cxx-opt named martinal-mc by martinal Tue Oct 21 13:28:30 2014
[0]PETSC ERROR: Libraries linked from /home/martinal/opt/fenics/dorsal-dev-1410/lib
[0]PETSC ERROR: Configure run at Tue Oct 21 10:51:41 2014
[0]PETSC ERROR: Configure options --prefix=/home/martinal/opt/fenics/dorsal-dev-1410 COPTFLAGS=-O2 --with-debugging=0 --with-shared-libraries=1 --with-clanguage=cxx --with-c-support=1 --download-umfpack=1 --download-hypre=1 --download-mumps=1 --download-scalapack=1 --download-blacs=1 --download-ptscotch=1 --download-scotch=1 --download-metis=1 --download-parmetis=1 --with-ml=1 --with-ml-lib=/home/martinal/opt/fenics/dorsal-dev-1410/lib/libml.so --with-ml-include=/home/martinal/opt/fenics/dorsal-dev-1410/include/trilinos
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: MatZeroRows_SeqAIJ() line 1680 in ../src/mat/impls/aij/seq/aij.c
[0]PETSC ERROR: MatZeroRows() line 5386 in ../src/mat/interface/matrix.c
[0]PETSC ERROR: MatZeroRows_MPIAIJ() line 878 in ../src/mat/impls/aij/mpi/mpiaij.c
[0]PETSC ERROR: MatZeroRows() line 5386 in ../src/mat/interface/matrix.c
Number of global vertices: 528
Number of global cells: 966
^C
Program received signal SIGINT, Interrupt.
0x00007fffee87b9a6 in opal_progress () from /usr/lib/libmpi.so.1
A debugging session is active.

        Inferior 1 [process 1320] will be killed.

Quit anyway? (y or n) n
Not confirmed.
(gdb) w
Ambiguous command "w": watch, wh, whatis, where, while, while-stepping, winheight, ws.
(gdb) 
Ambiguous command "w": watch, wh, whatis, where, while, while-stepping, winheight, ws.
(gdb) where
#0  0x00007fffee87b9a6 in opal_progress () from /usr/lib/libmpi.so.1
#1  0x00007fffee7c91f5 in ompi_request_default_wait_all ()
   from /usr/lib/libmpi.so.1
#2  0x00007fffcab3f8bf in ompi_coll_tuned_allreduce_intra_recursivedoubling ()
   from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#3  0x00007fffee7d5775 in PMPI_Allreduce () from /usr/lib/libmpi.so.1
#4  0x00007fffeef36204 in MatAssemblyBegin_MPIAIJ(_p_Mat*, MatAssemblyType) ()
   from /home/martinal/opt/fenics/dorsal-dev-1410/lib/libpetsc.so
#5  0x00007fffeef98f04 in MatAssemblyBegin ()
   from /home/martinal/opt/fenics/dorsal-dev-1410/lib/libpetsc.so
#6  0x00007ffff035ff9d in dolfin::PETScMatrix::apply (
    this=this@entry=0x2ca12b0, mode=...) at ../../dolfin/la/PETScMatrix.cpp:624
#7  0x00007ffff036a91c in dolfin::Matrix::apply (this=this@entry=0x28a68b0, 
    mode=...) at ../../dolfin/la/Matrix.h:87
#8  0x00007ffff05e4fef in dolfin::AssemblerBase::init_global_tensor (
    this=this@entry=0x1bc6850, A=..., a=...)
    at ../../dolfin/fem/AssemblerBase.cpp:148
#9  0x00007ffff05998b8 in dolfin::Assembler::assemble (
    this=this@entry=0x1bc6850, A=..., a=...)
    at ../../dolfin/fem/Assembler.cpp:96
#10 0x00007fffd067e153 in _wrap_Assembler_assemble (args=<optimized out>)
    at modulePYTHON_wrap.cxx:27305
#11 0x0000000000530825 in PyEval_EvalFrameEx ()

The installation is built today with latest dorsal/fenics master, which uses petsc 3.4.2.

Comments (7)

  1. Martin Sandve Alnæs reporter

    Consistently reproducable:

    mpirun -n 4 python -B -m pytest -sv la/test_matrix.py

    I can try the pull request.

  2. Martin Sandve Alnæs reporter

    I've merged master into mliertzer/fix-dirichletbc-zero and fixed up the new test and a couple of warnings, and pushed the branch to dolfin.

    This issue is unaffected.

  3. Martin Sandve Alnæs reporter

    Running

    mpirun -n 3 python -B -m pytest -sv la/test_matrix.py

    can also trigger this error. In an automated test sweep it took 8 runs before it happened for me.

    The problem seems to be that somehow one process gets past the collective matrix apply() call at the end of assemble.

  4. Log in to comment