Deadlock in test_matrix test_indent_zeros, test of petsc error handling
When running unit tests with mpirun -n 4, the test
la/test_matrix.py:246: TestMatrixForAnyBackend.test_ident_zeros[False-any_backend0]
deadlocked. This test checks that a expected petsc error gets propagated as an exception but this doesn't seem to work robustly. In the deadlock, the processes were in these states:
Process 1:
la/test_matrix.py:246: TestMatrixForAnyBackend.test_ident_zeros[False-any_backend0] [1]PETSC ERROR: --------------------- Error Message ------------------------------------
[1]PETSC ERROR: Object is in wrong state!
[1]PETSC ERROR: Matrix is missing diagonal entry in row 6!
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Petsc Release Version 3.4.2, Jul, 02, 2013
[1]PETSC ERROR: See docs/changes/index.html for recent updates.
[1]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[1]PETSC ERROR: See docs/index.html for manual pages.
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Unknown Name on a linux-gnu-cxx-opt named martinal-mc by martinal Tue Oct 21 13:28:30 2014
[1]PETSC ERROR: Libraries linked from /home/martinal/opt/fenics/dorsal-dev-1410/lib
[1]PETSC ERROR: Configure run at Tue Oct 21 10:51:41 2014
[1]PETSC ERROR: Configure options --prefix=/home/martinal/opt/fenics/dorsal-dev-1410 COPTFLAGS=-O2 --with-debugging=0 --with-shared-libraries=1 --with-clanguage=cxx --with-c-support=1 --download-umfpack=1 --download-hypre=1 --download-mumps=1 --download-scalapack=1 --download-blacs=1 --download-ptscotch=1 --download-scotch=1 --download-metis=1 --download-parmetis=1 --with-ml=1 --with-ml-lib=/home/martinal/opt/fenics/dorsal-dev-1410/lib/libml.so --with-ml-include=/home/martinal/opt/fenics/dorsal-dev-1410/include/trilinos
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: MatZeroRows_SeqAIJ() line 1680 in ../src/mat/impls/aij/seq/aij.c
[1]PETSC ERROR: MatZeroRows() line 5386 in ../src/mat/interface/matrix.c
[1]PETSC ERROR: MatZeroRows_MPIAIJ() line 878 in ../src/mat/impls/aij/mpi/mpiaij.c
[1]PETSC ERROR: MatZeroRows() line 5386 in ../src/mat/interface/matrix.c
[1]PETSC ERROR: --------------------- Error Message ------------------------------------
[1]PETSC ERROR: Argument out of range!
[1]PETSC ERROR: New nonzero at (8,513) caused a malloc!
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Petsc Release Version 3.4.2, Jul, 02, 2013
[1]PETSC ERROR: See docs/changes/index.html for recent updates.
[1]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[1]PETSC ERROR: See docs/index.html for manual pages.
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Unknown Name on a linux-gnu-cxx-opt named martinal-mc by martinal Tue Oct 21 13:28:30 2014
[1]PETSC ERROR: Libraries linked from /home/martinal/opt/fenics/dorsal-dev-1410/lib
[1]PETSC ERROR: Configure run at Tue Oct 21 10:51:41 2014
[1]PETSC ERROR: Configure options --prefix=/home/martinal/opt/fenics/dorsal-dev-1410 COPTFLAGS=-O2 --with-debugging=0 --with-shared-libraries=1 --with-clanguage=cxx --with-c-support=1 --download-umfpack=1 --download-hypre=1 --download-mumps=1 --download-scalapack=1 --download-blacs=1 --download-ptscotch=1 --download-scotch=1 --download-metis=1 --download-parmetis=1 --with-ml=1 --with-ml-lib=/home/martinal/opt/fenics/dorsal-dev-1410/lib/libml.so --with-ml-include=/home/martinal/opt/fenics/dorsal-dev-1410/include/trilinos
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: MatSetValues_MPIAIJ() line 572 in ../src/mat/impls/aij/mpi/mpiaij.c
[1]PETSC ERROR: MatSetValues() line 1106 in ../src/mat/interface/matrix.c
FAILED
la/test_matrix.py:246: TestMatrixForAnyBackend.test_ident_zeros[True-any_backend0] ^C
Program received signal SIGINT, Interrupt.
0x00007fffee87b99f in opal_progress () from /usr/lib/libmpi.so.1
A debugging session is active.
Inferior 1 [process 1312] will be killed.
Quit anyway? (y or n) n
Not confirmed.
(gdb) where
#0 0x00007fffee87b99f in opal_progress () from /usr/lib/libmpi.so.1
#1 0x00007fffee7c91f5 in ompi_request_default_wait_all ()
from /usr/lib/libmpi.so.1
#2 0x00007fffcab3d302 in ompi_coll_tuned_sendrecv_actual ()
from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#3 0x00007fffcab4506e in ompi_coll_tuned_barrier_intra_recursivedoubling ()
from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#4 0x00007fffee7d656b in PMPI_Barrier () from /usr/lib/libmpi.so.1
#5 0x00007ffff0959671 in _wrap_MPI_barrier (
args=0x7fffeeaec610 <ompi_request_lock+16>) at modulePYTHON_wrap.cxx:7160
#6 0x000000000052f936 in PyEval_EvalFrameEx ()
Process 0 (omitting 2 and 3 which are basically in the same state):
la/test_matrix.py:246: TestMatrixForAnyBackend.test_ident_zeros[False-any_backend0] Number of global vertices: 528
Number of global cells: 966
[0]PETSC ERROR: --------------------- Error Message ------------------------------------
[0]PETSC ERROR: Object is in wrong state!
[0]PETSC ERROR: Matrix is missing diagonal entry in row 5!
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Petsc Release Version 3.4.2, Jul, 02, 2013
[0]PETSC ERROR: See docs/changes/index.html for recent updates.
[0]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[0]PETSC ERROR: See docs/index.html for manual pages.
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Unknown Name on a linux-gnu-cxx-opt named martinal-mc by martinal Tue Oct 21 13:28:30 2014
[0]PETSC ERROR: Libraries linked from /home/martinal/opt/fenics/dorsal-dev-1410/lib
[0]PETSC ERROR: Configure run at Tue Oct 21 10:51:41 2014
[0]PETSC ERROR: Configure options --prefix=/home/martinal/opt/fenics/dorsal-dev-1410 COPTFLAGS=-O2 --with-debugging=0 --with-shared-libraries=1 --with-clanguage=cxx --with-c-support=1 --download-umfpack=1 --download-hypre=1 --download-mumps=1 --download-scalapack=1 --download-blacs=1 --download-ptscotch=1 --download-scotch=1 --download-metis=1 --download-parmetis=1 --with-ml=1 --with-ml-lib=/home/martinal/opt/fenics/dorsal-dev-1410/lib/libml.so --with-ml-include=/home/martinal/opt/fenics/dorsal-dev-1410/include/trilinos
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: MatZeroRows_SeqAIJ() line 1680 in ../src/mat/impls/aij/seq/aij.c
[0]PETSC ERROR: MatZeroRows() line 5386 in ../src/mat/interface/matrix.c
[0]PETSC ERROR: MatZeroRows_MPIAIJ() line 878 in ../src/mat/impls/aij/mpi/mpiaij.c
[0]PETSC ERROR: MatZeroRows() line 5386 in ../src/mat/interface/matrix.c
Number of global vertices: 528
Number of global cells: 966
^C
Program received signal SIGINT, Interrupt.
0x00007fffee87b9a6 in opal_progress () from /usr/lib/libmpi.so.1
A debugging session is active.
Inferior 1 [process 1320] will be killed.
Quit anyway? (y or n) n
Not confirmed.
(gdb) w
Ambiguous command "w": watch, wh, whatis, where, while, while-stepping, winheight, ws.
(gdb)
Ambiguous command "w": watch, wh, whatis, where, while, while-stepping, winheight, ws.
(gdb) where
#0 0x00007fffee87b9a6 in opal_progress () from /usr/lib/libmpi.so.1
#1 0x00007fffee7c91f5 in ompi_request_default_wait_all ()
from /usr/lib/libmpi.so.1
#2 0x00007fffcab3f8bf in ompi_coll_tuned_allreduce_intra_recursivedoubling ()
from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#3 0x00007fffee7d5775 in PMPI_Allreduce () from /usr/lib/libmpi.so.1
#4 0x00007fffeef36204 in MatAssemblyBegin_MPIAIJ(_p_Mat*, MatAssemblyType) ()
from /home/martinal/opt/fenics/dorsal-dev-1410/lib/libpetsc.so
#5 0x00007fffeef98f04 in MatAssemblyBegin ()
from /home/martinal/opt/fenics/dorsal-dev-1410/lib/libpetsc.so
#6 0x00007ffff035ff9d in dolfin::PETScMatrix::apply (
this=this@entry=0x2ca12b0, mode=...) at ../../dolfin/la/PETScMatrix.cpp:624
#7 0x00007ffff036a91c in dolfin::Matrix::apply (this=this@entry=0x28a68b0,
mode=...) at ../../dolfin/la/Matrix.h:87
#8 0x00007ffff05e4fef in dolfin::AssemblerBase::init_global_tensor (
this=this@entry=0x1bc6850, A=..., a=...)
at ../../dolfin/fem/AssemblerBase.cpp:148
#9 0x00007ffff05998b8 in dolfin::Assembler::assemble (
this=this@entry=0x1bc6850, A=..., a=...)
at ../../dolfin/fem/Assembler.cpp:96
#10 0x00007fffd067e153 in _wrap_Assembler_assemble (args=<optimized out>)
at modulePYTHON_wrap.cxx:27305
#11 0x0000000000530825 in PyEval_EvalFrameEx ()
The installation is built today with latest dorsal/fenics master, which uses petsc 3.4.2.
Comments (7)
-
-
reporter Consistently reproducable:
mpirun -n 4 python -B -m pytest -sv la/test_matrix.py
I can try the pull request.
-
reporter I've merged master into mliertzer/fix-dirichletbc-zero and fixed up the new test and a couple of warnings, and pushed the branch to dolfin.
This issue is unaffected.
-
reporter Running
mpirun -n 3 python -B -m pytest -sv la/test_matrix.py
can also trigger this error. In an automated test sweep it took 8 runs before it happened for me.
The problem seems to be that somehow one process gets past the collective matrix apply() call at the end of assemble.
-
reporter Looks like this is fixed by https://bitbucket.org/fenics-project/dolfin/pull-request/174/fix-issue-392/diff
-
reporter - changed status to resolved
-
- removed milestone
Removing milestone: 1.5 (automated comment)
- Log in to comment
The pull request https://bitbucket.org/fenics-project/dolfin/pull-request/173 might fix this.