Not all processes exit when creation of small mesh fails in parallel

Issue #476 resolved
Martin Sandve Alnæs created an issue

Running

from dolfin import *
mesh = UnitSquareMesh(1,1)

with 3 mpi processes I get the error:

Number of global vertices: 4
Number of global cells: 2
Traceback (most recent call last):
  File "parallel_bug.py", line 2, in <module>
    mesh = UnitSquareMesh(1,1)
  File "/home/martinal/opt/fenics-dev/ze6zfklaqao4/dev-2015-02-20/lib/python2.7/site-packages/dolfin/cpp/mesh.py", line 8783, in __init__
Traceback (most recent call last):
  File "parallel_bug.py", line 2, in <module>
    mesh = UnitSquareMesh(1,1)
  File "/home/martinal/opt/fenics-dev/ze6zfklaqao4/dev-2015-02-20/lib/python2.7/site-packages/dolfin/cpp/mesh.py", line 8783, in __init__
    _mesh.UnitSquareMesh_swiginit(self,_mesh.new_UnitSquareMesh(*args))
RuntimeError: 

*** -------------------------------------------------------------------------
*** DOLFIN encountered an error. If you are not able to resolve this issue
*** using the information listed below, you can ask for help at
***
***     fenics@fenicsproject.org
***
*** Remember to include the error message listed below and, if possible,
*** include a *minimal* running example to reproduce the error.
***
*** -------------------------------------------------------------------------
*** Error:   Unable to complete call to function compute_vertex_mapping().
*** Reason:  Assertion cell_vertices.size() != 0 failed.
*** Where:   This error was encountered inside /home/martinal/dev/fenics-dev/dolfin/dolfin/mesh/MeshPartitioning.cpp (line 793).
*** Process: unknown
*** 
*** DOLFIN version: 1.5.0+
*** Git changeset:  ab134c7f5d94478bebecb04cd25da64dc95aa0c4
*** -------------------------------------------------------------------------

    _mesh.UnitSquareMesh_swiginit(self,_mesh.new_UnitSquareMesh(*args))
RuntimeError: 

*** -------------------------------------------------------------------------
*** DOLFIN encountered an error. If you are not able to resolve this issue
*** using the information listed below, you can ask for help at
***
***     fenics@fenicsproject.org
***
*** Remember to include the error message listed below and, if possible,
*** include a *minimal* running example to reproduce the error.
***
*** -------------------------------------------------------------------------
*** Error:   Unable to complete call to function compute_vertex_mapping().
*** Reason:  Assertion cell_vertices.size() != 0 failed.
*** Where:   This error was encountered inside /home/martinal/dev/fenics-dev/dolfin/dolfin/mesh/MeshPartitioning.cpp (line 793).
*** Process: unknown
*** 
*** DOLFIN version: 1.5.0+
*** Git changeset:  ab134c7f5d94478bebecb04cd25da64dc95aa0c4
*** -------------------------------------------------------------------------

Note how only two of the three processes raise exceptions. The mpirun command never exits. I believe this is what happens on some buildbots.

Comments (25)

  1. Prof Garth Wells

    One probable cause is that ParMETIS cannot handle distributed graphs where one process has no data. This was at least the case until recently.

  2. Martin Sandve Alnæs reporter

    Dolfin should at least crash properly. This is part of a wider issue of how to make dolfin abort all mpi processes in case of failure on a subset of the processes.

  3. Chris Richardson

    This is not related to ParMETIS, in this case. Maybe that assert should be removed...

    It now crashes without deadlocking.

  4. Lawrence Mitchell

    When running under MPI you should abort with MPI_Abort(MPI_COMM_WORLD) which makes a best effort attempt to abort all processes. I guess dolfin_error code do this.

  5. Johannes Ring
    • changed status to open

    The error on the precise-i386 buildbot seems to be related to this. When running test_p13_box_2 in test/unit/python/book in parallel I get the following error:

    buildbot@prceise32-bbot:~/buildslave/dolfin-master-full-precise-i386/build/test/unit/python/book$ mpirun -np 3 py.test -vs -k test_p13_box_2
    ============================= test session starts ==============================
    platform linux2 -- Python 2.7.3 -- py-1.4.23 -- pytest-2.6.1 -- /usr/bin/python
    ============================= test session starts ==============================
    platform linux2 -- Python 2.7.3 -- py-1.4.23 -- pytest-2.6.1 -- /usr/bin/python
    ============================= test session starts ==============================
    platform linux2 -- Python 2.7.3 -- py-1.4.23 -- pytest-2.6.1 -- /usr/bin/python
    collected 114 items 
    collected 114 items 
    collected 114 items
    
    test_chapter_10.py::test_p13_box_2 
    test_chapter_10.py::test_p13_box_2 
    test_chapter_10.py::test_p13_box_2 Number of global vertices: 9
    Number of global cells: 8
    *** Warning: Mesh is empty, unable to create entities of dimension 1.
    *** Warning: Mesh is empty, unable to create connectivity 1 --> 2.
    *** Warning: Mesh is empty, unable to create entities of dimension 1.
    *** Warning: Mesh is empty, unable to create entities of dimension 1.
    PASSEDPASSEDFAILED
    
    
    
    =================================== FAILURES ===================================
    ________________________________ test_p13_box_2 ________________________________
    
        @use_gc_barrier
        def test_p13_box_2():
    >       mesh = Mesh(os.path.join(os.path.dirname(__file__), "mesh.xml"))
    
    test_chapter_10.py:242: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    /home/buildbot/fenicsbbot/master/dolfin-full/lib/python2.7/site-packages/dolfin/mesh/meshes.py:59: in __init__
        cpp.Mesh.__cppinit__(self, *args, **kwargs)
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    
    self = <dolfin.cpp.mesh.Mesh;  >
    args = ('/home/buildbot/buildslave/dolfin-master-full-precise-i386/build/test/unit/python/book/mesh.xml',)
    
        def __init__(self, *args):
            """
                **Overloaded versions**
    
                * Mesh\ ()
    
                  Create empty mesh
    
                * Mesh\ (comm)
    
    
    ================== 113 tests deselected by '-ktest_p13_box_2' ==================
    =================== 1 passed, 113 deselected in 0.78 seconds ===================
    
    ================== 113 tests deselected by '-ktest_p13_box_2' ==================
                  Create empty mesh
    
                * Mesh\ (mesh)
    
                  Copy constructor.
    
                  *Arguments*
                      mesh (:py:class:`Mesh`)
                          Object to be copied.
    
                * Mesh\ (filename)
    
                  Create mesh from data file.
    
                  *Arguments*
                      filename (str)
                          Name of file to load.
    
                * Mesh\ (comm, filename)
    
                  Create mesh from data file.
    
                  *Arguments*
                      comm (:py:class:`MPI`)
                          The MPI communicator
                      filename (str)
                          Name of file to load.
    
                * Mesh\ (comm, local_mesh_data)
    
                  Create a distributed mesh from local (per process) data.
    
                  *Arguments*
                      comm (:py:class:`MPI`)
                          MPI communicator for the mesh.
                      local_mesh_data (:py:class:`LocalMeshData`)
                          Data from which to build the mesh.
    
                """
    >       _mesh.Mesh_swiginit(self,_mesh.new_Mesh(*args))
    E       RuntimeError: 
    E       
    E       *** -------------------------------------------------------------------------
    E       *** DOLFIN encountered an error. If you are not able to resolve this issue
    E       *** using the information listed below, you can ask for help at
    E       ***
    E       ***     fenics@fenicsproject.org
    E       ***
    E       *** Remember to include the error message listed below and, if possible,
    E       *** include a *minimal* running example to reproduce the error.
    E       ***
    E       *** -------------------------------------------------------------------------
    E       *** Error:   Unable to create mesh entity.
    E       *** Reason:  Mesh entity index 0 out of range [0, 0] for entity of dimension 1.
    E       *** Where:   This error was encountered inside MeshEntity.cpp.
    E       *** Process: unknown
    E       *** 
    E       *** DOLFIN version: 1.6.0dev
    E       *** Git changeset:  a94191e563f2ac0a7536ec46bd2163d4516c4e4e
    E       *** -------------------------------------------------------------------------
    
  6. Jan Blechta

    @wence, calling MPI_Abort(MPI_COMM_WORLD) within dolfin_error has one significant drawback - exceptions can't be catched anymore. There could parameter for this and in python it could be done elegantly with context manager like:

    with uninstall_mpiabort_error_handler:
        try:
            raising_routine()
        except RuntimeError:
            handle_exception()
    

    If everybody agrees I can try implementing this.

  7. Martin Sandve Alnæs reporter

    When is this a problem? We usually don't strive to clean up safely when exceptions occur in dolfin anyway.

    The fundamental problem is that exceptions are not safe in parallel because they usually do not occur on all processes and thus mess up the synchronous execution that MPI programs typically depends on.

    I think we should still allow exceptions to propagate in serial for debugging.

  8. Jan Blechta

    I thought that there's a lot of exception catching in tests (even in parallel). Other application is:

    try:
        solver.solve(args)
    except RuntimeError:
        handle_nonconvergence()
    

    You could imagine that user also wants to catch non-collective exceptions in parallel, like fiddling with dofs, entity iterators, etc.

    This could be partially improved by having more exceptions and not calling MPI_Abort for exceptions which are thrown collectively (like solvers, first example). But this does not solve the second use case I mentioned. User (and we in tests?) could still like to be able to uninstall MPI_Abort.

  9. Martin Sandve Alnæs reporter

    I'm confused. Does dolfinerror call mpiabort or is it installed as a system wide exception handler or what is happening? Surely dolfinerror shouldn't be used to signal return results anywhere.

  10. Jan Blechta

    No, MPI_Abort is not installed anywhere in DOLFIN. But we are discussing a way to

    1. shut down whole application in a case that dolfin_error is called non-collectively
    2. keep the possibility of catching the errors

    Now we have 2. and not 1. If we blindly add MPI_Abort to dolfin_error we will loose 2. That's why some mechanism is needed for installing/uninstalling MPI_Abort. I don't think that

    if MPI.size(MPI_COMM_WORLD) > 1:
        MPI_Abort(MPI_COMM_WORLD)
    

    is a sufficient. There must be a mechanism to control this

    if user_wants_shutdown_on_err():
        MPI_Abort(MPI_COMM_WORLD)
    

    (I suggested implementing user_wants_shutdown_on_err() by parameter and context manager on python side.)

  11. Martin Sandve Alnæs reporter

    What does it matter if the user doesn't want shutdown, if that wish results in race conditions or deadlocks?

    If specific places in the dolfin code need to throw user-catchable exceptions, they

    1) shouldn't use dolfin_error but rather a more communicative and documented exception type

    2) should do the necessary extra work to guarantee safe collective behaviour without deadlocks and race conditions

  12. Jan Blechta

    I don't agree with the assertion: exception should always be collective. It depends on circumstances. One example:

    class E(Expression):
        def eval(self, y, x):
            try:
                self.f.eval(y, x)
            except RuntimeError:
                self._handle_extrapolation(self.f, y, x)
    

    I agree that we could have more exception types than just runtime error thrown by dofin_error. But this does not solve the dilema that sometimes you want MPI_Abort and sometimes be able to catch (even non-collectively).

  13. Martin Sandve Alnæs reporter

    I agree that the assertion "exceptions should always be collective" is too strict.

    However we cannot allow exceptions to be thrown from anywhere without knowing that either

    1) It is done collectively and handled safely across all processes.

    2) Someone (user or library) will catch it and make sure it is handled safely either locally or across all processes.

    3) It aborts the program safely across all processes.

    If we throw an exception on a subset of the processes, and that exception is not caught on all those processes before the next collective operation, we have a deadlock. Thus allowing exceptions to leave the library boundary mean we not merely allow the user to catch exceptions, but we also require that the user catch exceptions for parallel safe behaviour.

  14. Martin Sandve Alnæs reporter

    Your example Expression eval implementation is parallel safe if self.f.eval() and self._handle_extrapolation() are local operations, resulting in clear boundaries inside the control of your eval() where exceptions should not propagate out of.

    (I don't like this way of using exceptions for an expected situation though but that's a separate issue)

  15. Martin Sandve Alnæs reporter

    Another example of what I meant by

    "2) should do the necessary extra work to guarantee safe collective behaviour without deadlocks and race conditions"

    would be if the solver should use exceptions to communicate failure to converge. That is a collective situation: either the solver converged globally or it didn't. The solver code in dolfin must therefore make sure exceptions are thrown collectively if failure to converge occurs.

  16. Martin Sandve Alnæs reporter

    If, on a subset of the processes, an exception leaves the exterior boundary of dolfin (or any self-contained part of dolfin):

    A) dolfin must take care not have any collective calls waiting that didn't finish, or we are in deadlock.

    B) the user must take care to catch the exception and get back to synchronous operation, or we will soon be in a deadlock.

    Assuming deadlocks are not acceptable, (A) means we must be careful how we allow exceptions to be thrown, and (B) means we must document a requirement that the user catches non-collective exceptions where this can occur.

    Myself I'd much prefer the "don't throw exceptions" approach, which means finding better ways to communicate eval and solver failure as return values.

  17. Jan Blechta

    We missed one possible mechanism: installing error handler (MPI_Abort) by set_terminate. Then exceptions is either caught or handler is called. This seems to work on C++ side but not with Python yet, see 30fbfb2.

  18. Jan Blechta

    Now working also on Python side in branch jan/mpi-abort. Now MPI_Abort is called (if Python is non-interactive) on uncaught exception. Catching of exceptions is still possible.

  19. Log in to comment