Unit tests observed to segfault or hang in several occasions
Running unittest_py_mpi with python 2:
...
fem/test_form.py Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
.Segmentation fault (core dumped)
Built target run_unittests_py_mpi
Running unittest_py_mpi with python 3:
...
fem/test_form.py Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
.Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
.sssssssssssSegmentation fault (core dumped)
Built target run_unittests_py_mpi
This was with master + some minor changes, but a segfault was observed on the buildbots a short time ago (as well as some tests hanging).
Investigation is needed, please try to reproduce on different setups. This is what I did:
cd <dolfin-build-dir>
export INSTANT_CACHE_DIR=`pwd`/instant-cache
export DIJITSO_CACHE_DIR=`pwd`/dijitso-cache
make run_unittests_py_mpi
Comments (21)
-
reporter -
reporter The tests now segfaulted here for the py3 build:
--- Instant: compiling --- .ssCalling FFC just-in-time (JIT) compiler, this may take some time. . function/test_function.py Segmentation fault (core dumped) Built target run_unittests_py_mpi
and here for the py2 build:
fem/test_system_assembler.py .Calling FFC just-in-time (JIT) compiler, this may take some time. ..Calling FFC just-in-time (JIT) compiler, this may take some time. Calling FFC just-in-time (JIT) compiler, this may take some time. .--- Instant: compiling --- Segmentation fault (core dumped) Built target run_unittests_py_mpi
Seems fairly reproducable (although a bit long-winded), so hopefully it can be pinpointed with a debugger session.
Anyone else seen this?
-
reporter @logg @johannes_ring this is important and should be fixed before the release! (@chris_richardson and @garth-wells looks like I originally pinged some other chris and garth and jan here...)
-
reporter Some debugging info:
fem/test_form.py::test_coefficient_derivatives Thread 1 "python" received signal SIGSEGV, Segmentation fault. 0x00007fffdd4e0545 in dolfin::PETScVector::_init (this=0x234db50, range=..., local_to_global_map=std::vector of length 13, capacity 13 = {...}, ghost_indices=std::vector of length 2, capacity 2 = {...}) at ../../dolfin/la/PETScVector.cpp:884 884 ../../dolfin/la/PETScVector.cpp: No such file or directory. A debugging session is active.
(gdb) where #0 0x00007fffdd4e0545 in dolfin::PETScVector::_init (this=0x234db50, range=..., local_to_global_map=std::vector of length 13, capacity 13 = {...}, ghost_indices=std::vector of length 2, capacity 2 = {...}) at ../../dolfin/la/PETScVector.cpp:884 #1 0x00007fffdd21f930 in dolfin::GenericVector::init (this=0x234db50, tensor_layout=...) at ../../dolfin/la/GenericVector.h:87 #2 0x00007fffdd5726d6 in dolfin::Function::init_vector ( this=this@entry=0x235eff0) at ../../dolfin/function/Function.cpp:594 #3 0x00007fffdd575c65 in dolfin::Function::Function (this=0x235eff0, V=...) at ../../dolfin/function/Function.cpp:65 #4 0x00007fffd5dad1d3 in _wrap_new_Function__SWIG_0 (swig_obj=0x7fffffff52b0, nobjs=1) at modulePYTHON_wrap.cxx:11171
-
@martinal, could you push the commit, so that we can pull the build from quay and try to reproduce?
-
reporter PETScVector.cpp:884 is the strcmp line here:
// Add ghost points if Vec type is MPI (throw an error if Vec is not // VECMPI and ghost entry vector is not empty) if (strcmp(vec_type, VECMPI) == 0)
git log shows this file hasn't been touched for half a year.
@chris_richardson want to take a look?
-
reporter @blechta it's current master (52ffb5dc87474b539a2ef9095281b2763c8de0bd) with https://bitbucket.org/fenics-project/dolfin/pull-requests/312/generate-equal-cmakefiles-and/diff and https://bitbucket.org/fenics-project/dolfin/pull-requests/313/make-more-robust-string-from/diff merged in, neither of which are likely relevant so try master.
As mentioned there was a segfault on the buildbots recently, probably the same.
To reproduce with debugger try this:
cd <dolfin-build-dir>/test/unit/python export INSTANT_CACHE_DIR=$(pwd)/instant-cache export DIJITSO_CACHE_DIR=$(pwd)/dijitso-cache mpirun -n 3 xterm -e gdb -ex r -ex q -args python -B -m pytest -sv .
-
Why the exports?
-
I'm getting deadlocks in
fem/test_form.py:test_coefficient_derivatives
instead on my local build. It seems that it is first line of the test:f = Function(V)
. Two processes are onVecCreate(comm, &_x)
while one is inVecSetSizes(_x, local_size, PETSC_DECIDE)
.EIDT:
VecCreate
ones try toPetscCommDuplicate, ompi_comm_dup
whileVecSetSizes
is possibly waiting for MPI reduction to compute global size. -
reporter Cache dir exports are to ensure a clean slate for reproducing, in case it was jit related. Maybe it's not necessary.
-
I could reproduce this in
quay.io/fenicsproject_dev/dolfin:master
using:cd ${FENICS_SRC_DIR}/dolfin/build/test/unit/python/fem mpirun -n 3 bash -c '${FENICS_PYTHON} -B -m pytest -svl test_form.py'
The error does not trigger if the cache is generated first by a serial run. However, running
dijitso clean
after the serial run will trigger the error when running in parallel. Running onlyinstant-clean
after a serial run and then running in parallel works fine. -
I cannot reproduce this in either
quay.io/fenicsproject_dev/dolfin:master
orquay.io/fenicsproject_dev/dolfin:py3-master
anymore.EDIT: I cannot reproduce using the commands in my comment above, but running the full Python unit test suite seems to always lead to a segmentation fault or a hang:
cd ${FENICS_SRC_DIR}/dolfin/build/test/unit/python mpirun -n 3 bash -c '${FENICS_PYTHON} -B -m pytest -svl .'
-
Why
bash -c
wrapping the command? -
Not sure actually. It is what I have been using on bamboo. I see the same problem without it.
-
reporter @garth-wells care to take a look?
-
@martinal Unlikely that I'll have time for a few weeks.
-
Is this a blocker for the release of 2016.2?
-
I have tried many different FEniCS installations today, hoping to find out where this was first introduced. All of them fails with a segmentation fault - also 2016.1.0. FEniCS-dev crash on
fem/test_form.py
, while 2016.1.0 crash onfunction/test_expression.py
. FEniCS-dev also fails on the latter if continuing after the first segfault. Even 1.6.0 fails (no segfault, but it hangs onfem/test_local_solver.py
). All tests were performed in Docker using either ofquay.io/fenicsproject_dev/dolfin:master
,quay.io/fenicsproject/dev-env
,quay.io/fenicsproject/stable
(tried all different tags for 2016.1.0) orquay.io/fenicsproject/stable:1.6.0
. -
reporter With this PR better diagnostics will be available by checking more error codes:
https://bitbucket.org/fenics-project/dolfin/pull-requests/336/check-petscvector-errorcodes/diff
-
reporter - changed status to resolved
I consider this to be resolved now, and next time we get similar problems at least petscvector checks the petsc errors more carefully.
-
- Log in to comment
Running
make run_unittests_py_mpi
again seems to continue past the previous segfault location.