Unit tests observed to segfault or hang in several occasions

Issue #775 resolved
Martin Sandve Alnæs created an issue

Running unittest_py_mpi with python 2:

...
fem/test_form.py Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
.Segmentation fault (core dumped)
Built target run_unittests_py_mpi

Running unittest_py_mpi with python 3:

...
fem/test_form.py Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
.Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
.sssssssssssSegmentation fault (core dumped)
Built target run_unittests_py_mpi

This was with master + some minor changes, but a segfault was observed on the buildbots a short time ago (as well as some tests hanging).

Investigation is needed, please try to reproduce on different setups. This is what I did:

cd <dolfin-build-dir>
export INSTANT_CACHE_DIR=`pwd`/instant-cache
export DIJITSO_CACHE_DIR=`pwd`/dijitso-cache
make run_unittests_py_mpi

Comments (21)

  1. Martin Sandve Alnæs reporter

    Running make run_unittests_py_mpi again seems to continue past the previous segfault location.

  2. Martin Sandve Alnæs reporter

    The tests now segfaulted here for the py3 build:

    --- Instant: compiling ---
    .ssCalling FFC just-in-time (JIT) compiler, this may take some time.
    .
    function/test_function.py Segmentation fault (core dumped)
    Built target run_unittests_py_mpi
    

    and here for the py2 build:

    fem/test_system_assembler.py .Calling FFC just-in-time (JIT) compiler, this may take some time.
    ..Calling FFC just-in-time (JIT) compiler, this may take some time.
    Calling FFC just-in-time (JIT) compiler, this may take some time.
    .--- Instant: compiling ---
    Segmentation fault (core dumped)
    Built target run_unittests_py_mpi
    

    Seems fairly reproducable (although a bit long-winded), so hopefully it can be pinpointed with a debugger session.

    Anyone else seen this?

  3. Martin Sandve Alnæs reporter

    @logg @johannes_ring this is important and should be fixed before the release! (@chris_richardson and @garth-wells looks like I originally pinged some other chris and garth and jan here...)

  4. Martin Sandve Alnæs reporter

    Some debugging info:

    fem/test_form.py::test_coefficient_derivatives 
    Thread 1 "python" received signal SIGSEGV, Segmentation fault.
    0x00007fffdd4e0545 in dolfin::PETScVector::_init (this=0x234db50, range=..., 
        local_to_global_map=std::vector of length 13, capacity 13 = {...}, 
        ghost_indices=std::vector of length 2, capacity 2 = {...})
        at ../../dolfin/la/PETScVector.cpp:884
    884     ../../dolfin/la/PETScVector.cpp: No such file or directory.
    A debugging session is active.
    
    (gdb) where
    #0  0x00007fffdd4e0545 in dolfin::PETScVector::_init (this=0x234db50, 
        range=..., 
        local_to_global_map=std::vector of length 13, capacity 13 = {...}, 
        ghost_indices=std::vector of length 2, capacity 2 = {...})
        at ../../dolfin/la/PETScVector.cpp:884
    #1  0x00007fffdd21f930 in dolfin::GenericVector::init (this=0x234db50, 
        tensor_layout=...) at ../../dolfin/la/GenericVector.h:87
    #2  0x00007fffdd5726d6 in dolfin::Function::init_vector (
        this=this@entry=0x235eff0) at ../../dolfin/function/Function.cpp:594
    #3  0x00007fffdd575c65 in dolfin::Function::Function (this=0x235eff0, V=...)
        at ../../dolfin/function/Function.cpp:65
    #4  0x00007fffd5dad1d3 in _wrap_new_Function__SWIG_0 (swig_obj=0x7fffffff52b0, 
        nobjs=1) at modulePYTHON_wrap.cxx:11171
    
  5. Jan Blechta

    @martinal, could you push the commit, so that we can pull the build from quay and try to reproduce?

  6. Martin Sandve Alnæs reporter

    PETScVector.cpp:884 is the strcmp line here:

      // Add ghost points if Vec type is MPI (throw an error if Vec is not
      // VECMPI and ghost entry vector is not empty)
      if (strcmp(vec_type, VECMPI) == 0)
    

    git log shows this file hasn't been touched for half a year.

    @chris_richardson want to take a look?

  7. Martin Sandve Alnæs reporter

    @blechta it's current master (52ffb5dc87474b539a2ef9095281b2763c8de0bd) with https://bitbucket.org/fenics-project/dolfin/pull-requests/312/generate-equal-cmakefiles-and/diff and https://bitbucket.org/fenics-project/dolfin/pull-requests/313/make-more-robust-string-from/diff merged in, neither of which are likely relevant so try master.

    As mentioned there was a segfault on the buildbots recently, probably the same.

    To reproduce with debugger try this:

    cd <dolfin-build-dir>/test/unit/python
    export INSTANT_CACHE_DIR=$(pwd)/instant-cache
    export DIJITSO_CACHE_DIR=$(pwd)/dijitso-cache
    mpirun -n 3 xterm -e gdb -ex r -ex q -args python -B -m pytest -sv .
    
  8. Jan Blechta

    I'm getting deadlocks in fem/test_form.py:test_coefficient_derivatives instead on my local build. It seems that it is first line of the test: f = Function(V). Two processes are on VecCreate(comm, &_x) while one is in VecSetSizes(_x, local_size, PETSC_DECIDE).

    EIDT: VecCreate ones try to PetscCommDuplicate, ompi_comm_dup while VecSetSizes is possibly waiting for MPI reduction to compute global size.

  9. Martin Sandve Alnæs reporter

    Cache dir exports are to ensure a clean slate for reproducing, in case it was jit related. Maybe it's not necessary.

  10. Johannes Ring

    I could reproduce this in quay.io/fenicsproject_dev/dolfin:master using:

    cd ${FENICS_SRC_DIR}/dolfin/build/test/unit/python/fem
    mpirun -n 3 bash -c '${FENICS_PYTHON} -B -m pytest -svl test_form.py'
    

    The error does not trigger if the cache is generated first by a serial run. However, running dijitso clean after the serial run will trigger the error when running in parallel. Running only instant-clean after a serial run and then running in parallel works fine.

  11. Johannes Ring

    I cannot reproduce this in either quay.io/fenicsproject_dev/dolfin:master or quay.io/fenicsproject_dev/dolfin:py3-master anymore.

    EDIT: I cannot reproduce using the commands in my comment above, but running the full Python unit test suite seems to always lead to a segmentation fault or a hang:

    cd ${FENICS_SRC_DIR}/dolfin/build/test/unit/python
    mpirun -n 3 bash -c '${FENICS_PYTHON} -B -m pytest -svl .'
    
  12. Johannes Ring

    Not sure actually. It is what I have been using on bamboo. I see the same problem without it.

  13. Johannes Ring

    I have tried many different FEniCS installations today, hoping to find out where this was first introduced. All of them fails with a segmentation fault - also 2016.1.0. FEniCS-dev crash on fem/test_form.py, while 2016.1.0 crash on function/test_expression.py. FEniCS-dev also fails on the latter if continuing after the first segfault. Even 1.6.0 fails (no segfault, but it hangs on fem/test_local_solver.py). All tests were performed in Docker using either of quay.io/fenicsproject_dev/dolfin:master, quay.io/fenicsproject/dev-env, quay.io/fenicsproject/stable (tried all different tags for 2016.1.0) or quay.io/fenicsproject/stable:1.6.0.

  14. Martin Sandve Alnæs reporter

    I consider this to be resolved now, and next time we get similar problems at least petscvector checks the petsc errors more carefully.

  15. Log in to comment