Not all processes exit when creation of small mesh fails in parallel
Running
from dolfin import *
mesh = UnitSquareMesh(1,1)
with 3 mpi processes I get the error:
Number of global vertices: 4
Number of global cells: 2
Traceback (most recent call last):
File "parallel_bug.py", line 2, in <module>
mesh = UnitSquareMesh(1,1)
File "/home/martinal/opt/fenics-dev/ze6zfklaqao4/dev-2015-02-20/lib/python2.7/site-packages/dolfin/cpp/mesh.py", line 8783, in __init__
Traceback (most recent call last):
File "parallel_bug.py", line 2, in <module>
mesh = UnitSquareMesh(1,1)
File "/home/martinal/opt/fenics-dev/ze6zfklaqao4/dev-2015-02-20/lib/python2.7/site-packages/dolfin/cpp/mesh.py", line 8783, in __init__
_mesh.UnitSquareMesh_swiginit(self,_mesh.new_UnitSquareMesh(*args))
RuntimeError:
*** -------------------------------------------------------------------------
*** DOLFIN encountered an error. If you are not able to resolve this issue
*** using the information listed below, you can ask for help at
***
*** fenics@fenicsproject.org
***
*** Remember to include the error message listed below and, if possible,
*** include a *minimal* running example to reproduce the error.
***
*** -------------------------------------------------------------------------
*** Error: Unable to complete call to function compute_vertex_mapping().
*** Reason: Assertion cell_vertices.size() != 0 failed.
*** Where: This error was encountered inside /home/martinal/dev/fenics-dev/dolfin/dolfin/mesh/MeshPartitioning.cpp (line 793).
*** Process: unknown
***
*** DOLFIN version: 1.5.0+
*** Git changeset: ab134c7f5d94478bebecb04cd25da64dc95aa0c4
*** -------------------------------------------------------------------------
_mesh.UnitSquareMesh_swiginit(self,_mesh.new_UnitSquareMesh(*args))
RuntimeError:
*** -------------------------------------------------------------------------
*** DOLFIN encountered an error. If you are not able to resolve this issue
*** using the information listed below, you can ask for help at
***
*** fenics@fenicsproject.org
***
*** Remember to include the error message listed below and, if possible,
*** include a *minimal* running example to reproduce the error.
***
*** -------------------------------------------------------------------------
*** Error: Unable to complete call to function compute_vertex_mapping().
*** Reason: Assertion cell_vertices.size() != 0 failed.
*** Where: This error was encountered inside /home/martinal/dev/fenics-dev/dolfin/dolfin/mesh/MeshPartitioning.cpp (line 793).
*** Process: unknown
***
*** DOLFIN version: 1.5.0+
*** Git changeset: ab134c7f5d94478bebecb04cd25da64dc95aa0c4
*** -------------------------------------------------------------------------
Note how only two of the three processes raise exceptions. The mpirun command never exits. I believe this is what happens on some buildbots.
Comments (25)
-
-
reporter Dolfin should at least crash properly. This is part of a wider issue of how to make dolfin abort all mpi processes in case of failure on a subset of the processes.
-
This is not related to ParMETIS, in this case. Maybe that
assert
should be removed...It now crashes without deadlocking.
-
When running under MPI you should abort with MPI_Abort(MPI_COMM_WORLD) which makes a best effort attempt to abort all processes. I guess dolfin_error code do this.
-
- changed status to resolved
Fixed in 1f9f814
-
- changed status to open
The error on the precise-i386 buildbot seems to be related to this. When running
test_p13_box_2
intest/unit/python/book
in parallel I get the following error:buildbot@prceise32-bbot:~/buildslave/dolfin-master-full-precise-i386/build/test/unit/python/book$ mpirun -np 3 py.test -vs -k test_p13_box_2 ============================= test session starts ============================== platform linux2 -- Python 2.7.3 -- py-1.4.23 -- pytest-2.6.1 -- /usr/bin/python ============================= test session starts ============================== platform linux2 -- Python 2.7.3 -- py-1.4.23 -- pytest-2.6.1 -- /usr/bin/python ============================= test session starts ============================== platform linux2 -- Python 2.7.3 -- py-1.4.23 -- pytest-2.6.1 -- /usr/bin/python collected 114 items collected 114 items collected 114 items test_chapter_10.py::test_p13_box_2 test_chapter_10.py::test_p13_box_2 test_chapter_10.py::test_p13_box_2 Number of global vertices: 9 Number of global cells: 8 *** Warning: Mesh is empty, unable to create entities of dimension 1. *** Warning: Mesh is empty, unable to create connectivity 1 --> 2. *** Warning: Mesh is empty, unable to create entities of dimension 1. *** Warning: Mesh is empty, unable to create entities of dimension 1. PASSEDPASSEDFAILED =================================== FAILURES =================================== ________________________________ test_p13_box_2 ________________________________ @use_gc_barrier def test_p13_box_2(): > mesh = Mesh(os.path.join(os.path.dirname(__file__), "mesh.xml")) test_chapter_10.py:242: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /home/buildbot/fenicsbbot/master/dolfin-full/lib/python2.7/site-packages/dolfin/mesh/meshes.py:59: in __init__ cpp.Mesh.__cppinit__(self, *args, **kwargs) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <dolfin.cpp.mesh.Mesh; > args = ('/home/buildbot/buildslave/dolfin-master-full-precise-i386/build/test/unit/python/book/mesh.xml',) def __init__(self, *args): """ **Overloaded versions** * Mesh\ () Create empty mesh * Mesh\ (comm) ================== 113 tests deselected by '-ktest_p13_box_2' ================== =================== 1 passed, 113 deselected in 0.78 seconds =================== ================== 113 tests deselected by '-ktest_p13_box_2' ================== Create empty mesh * Mesh\ (mesh) Copy constructor. *Arguments* mesh (:py:class:`Mesh`) Object to be copied. * Mesh\ (filename) Create mesh from data file. *Arguments* filename (str) Name of file to load. * Mesh\ (comm, filename) Create mesh from data file. *Arguments* comm (:py:class:`MPI`) The MPI communicator filename (str) Name of file to load. * Mesh\ (comm, local_mesh_data) Create a distributed mesh from local (per process) data. *Arguments* comm (:py:class:`MPI`) MPI communicator for the mesh. local_mesh_data (:py:class:`LocalMeshData`) Data from which to build the mesh. """ > _mesh.Mesh_swiginit(self,_mesh.new_Mesh(*args)) E RuntimeError: E E *** ------------------------------------------------------------------------- E *** DOLFIN encountered an error. If you are not able to resolve this issue E *** using the information listed below, you can ask for help at E *** E *** fenics@fenicsproject.org E *** E *** Remember to include the error message listed below and, if possible, E *** include a *minimal* running example to reproduce the error. E *** E *** ------------------------------------------------------------------------- E *** Error: Unable to create mesh entity. E *** Reason: Mesh entity index 0 out of range [0, 0] for entity of dimension 1. E *** Where: This error was encountered inside MeshEntity.cpp. E *** Process: unknown E *** E *** DOLFIN version: 1.6.0dev E *** Git changeset: a94191e563f2ac0a7536ec46bd2163d4516c4e4e E *** -------------------------------------------------------------------------
-
Issue
#477was marked as a duplicate of this issue. -
@wence, calling
MPI_Abort(MPI_COMM_WORLD)
withindolfin_error
has one significant drawback - exceptions can't be catched anymore. There could parameter for this and in python it could be done elegantly with context manager like:with uninstall_mpiabort_error_handler: try: raising_routine() except RuntimeError: handle_exception()
If everybody agrees I can try implementing this.
-
reporter When is this a problem? We usually don't strive to clean up safely when exceptions occur in dolfin anyway.
The fundamental problem is that exceptions are not safe in parallel because they usually do not occur on all processes and thus mess up the synchronous execution that MPI programs typically depends on.
I think we should still allow exceptions to propagate in serial for debugging.
-
I thought that there's a lot of exception catching in tests (even in parallel). Other application is:
try: solver.solve(args) except RuntimeError: handle_nonconvergence()
You could imagine that user also wants to catch non-collective exceptions in parallel, like fiddling with dofs, entity iterators, etc.
This could be partially improved by having more exceptions and not calling MPI_Abort for exceptions which are thrown collectively (like solvers, first example). But this does not solve the second use case I mentioned. User (and we in tests?) could still like to be able to uninstall MPI_Abort.
-
reporter I'm confused. Does dolfinerror call mpiabort or is it installed as a system wide exception handler or what is happening? Surely dolfinerror shouldn't be used to signal return results anywhere.
-
No,
MPI_Abort
is not installed anywhere in DOLFIN. But we are discussing a way to- shut down whole application in a case that
dolfin_error
is called non-collectively - keep the possibility of catching the errors
Now we have 2. and not 1. If we blindly add
MPI_Abort
todolfin_error
we will loose 2. That's why some mechanism is needed for installing/uninstallingMPI_Abort
. I don't think thatif MPI.size(MPI_COMM_WORLD) > 1: MPI_Abort(MPI_COMM_WORLD)
is a sufficient. There must be a mechanism to control this
if user_wants_shutdown_on_err(): MPI_Abort(MPI_COMM_WORLD)
(I suggested implementing
user_wants_shutdown_on_err()
by parameter and context manager on python side.) - shut down whole application in a case that
-
reporter What does it matter if the user doesn't want shutdown, if that wish results in race conditions or deadlocks?
If specific places in the dolfin code need to throw user-catchable exceptions, they
1) shouldn't use dolfin_error but rather a more communicative and documented exception type
2) should do the necessary extra work to guarantee safe collective behaviour without deadlocks and race conditions
-
I don't agree with the assertion: exception should always be collective. It depends on circumstances. One example:
class E(Expression): def eval(self, y, x): try: self.f.eval(y, x) except RuntimeError: self._handle_extrapolation(self.f, y, x)
I agree that we could have more exception types than just runtime error thrown by dofin_error. But this does not solve the dilema that sometimes you want MPI_Abort and sometimes be able to catch (even non-collectively).
-
reporter I agree that the assertion "exceptions should always be collective" is too strict.
However we cannot allow exceptions to be thrown from anywhere without knowing that either
1) It is done collectively and handled safely across all processes.
2) Someone (user or library) will catch it and make sure it is handled safely either locally or across all processes.
3) It aborts the program safely across all processes.
If we throw an exception on a subset of the processes, and that exception is not caught on all those processes before the next collective operation, we have a deadlock. Thus allowing exceptions to leave the library boundary mean we not merely allow the user to catch exceptions, but we also require that the user catch exceptions for parallel safe behaviour.
-
reporter Your example Expression eval implementation is parallel safe if self.f.eval() and self._handle_extrapolation() are local operations, resulting in clear boundaries inside the control of your eval() where exceptions should not propagate out of.
(I don't like this way of using exceptions for an expected situation though but that's a separate issue)
-
reporter Another example of what I meant by
"2) should do the necessary extra work to guarantee safe collective behaviour without deadlocks and race conditions"
would be if the solver should use exceptions to communicate failure to converge. That is a collective situation: either the solver converged globally or it didn't. The solver code in dolfin must therefore make sure exceptions are thrown collectively if failure to converge occurs.
-
reporter If, on a subset of the processes, an exception leaves the exterior boundary of dolfin (or any self-contained part of dolfin):
A) dolfin must take care not have any collective calls waiting that didn't finish, or we are in deadlock.
B) the user must take care to catch the exception and get back to synchronous operation, or we will soon be in a deadlock.
Assuming deadlocks are not acceptable, (A) means we must be careful how we allow exceptions to be thrown, and (B) means we must document a requirement that the user catches non-collective exceptions where this can occur.
Myself I'd much prefer the "don't throw exceptions" approach, which means finding better ways to communicate eval and solver failure as return values.
-
We missed one possible mechanism: installing error handler (MPI_Abort) by
set_terminate
. Then exceptions is either caught or handler is called. This seems to work on C++ side but not with Python yet, see 30fbfb2. -
Now working also on Python side in branch
jan/mpi-abort
. Now MPI_Abort is called (if Python is non-interactive) on uncaught exception. Catching of exceptions is still possible. -
- changed component to common
-
assigned issue to
- marked as minor
-
Fix in pull request #219.
-
- changed milestone to 1.7
Will be merged after 1.6 release.
-
- changed status to resolved
Solved by 7f84a07
-
- removed milestone
Removing milestone: 1.7 (automated comment)
- Log in to comment
One probable cause is that ParMETIS cannot handle distributed graphs where one process has no data. This was at least the case until recently.