Handling exceptions in mpi4py programs

Issue #97 resolved
Former user created an issue

Hi,

I am building a system to that can run mpi4py scripts, and I would like mpi4py scripts to fail quickly if any mpi process fails, but in some cases, execution hangs.

In the code below, the mpi_comm.gather step hangs. The code is meant to simulate some failure in an mpi4py script.

mpirun -mca orte_abort_on_non_zero_status 1 -n 2 python mpi4py_repro_with_timeout.py
def main():
    import mpi4py.MPI
    mpi_comm = mpi4py.MPI.COMM_WORLD
    if mpi_comm.rank == 1:
        raise ValueError('failure!')
    processor = mpi4py.MPI.Get_processor_name()
    mpi_comm.gather(processor)


if __name__ == '__main__':
    main()

I tried to set a timeout that catches a SIGALRM signal, but the handler is never run:

import signal
from contextlib import contextmanager


class TimeoutError(Exception):
    pass


@contextmanager
def timeout(seconds=0, minutes=0, hours=0):
    """
    Add a signal-based timeout to any block of code.
    If multiple time units are specified, they will be added together to determine time limit.
    Usage:
    with timeout(seconds=5):
        my_slow_function(...)
    Args:
        - seconds: The time limit, in seconds.
        - minutes: The time limit, in minutes.
        - hours: The time limit, in hours.
    """

    limit = seconds + 60 * minutes + 3600 * hours

    def handler(signum, frame):
        raise TimeoutError('timed out after {} seconds'.format(limit))

    try:
        signal.signal(signal.SIGALRM, handler)
        signal.setitimer(signal.ITIMER_REAL, limit)
        yield
    finally:
        signal.alarm(0)

def main():
    import mpi4py.MPI
    mpi_comm = mpi4py.MPI.COMM_WORLD
    if mpi_comm.rank == 1:
        raise ValueError('failure!')
    with timeout(seconds=3):
        processor = mpi4py.MPI.Get_processor_name()
        mpi_comm.gather(processor)


if __name__ == '__main__':
    main()

I am using OpenMPI 2.1.2 (though I have also tried 1.10 and 3.0.0) and mpi4py 3.0.0, on Ubuntu 16.04.

How would I be able to get the behavior I would like (ideally: failing quickly, but acceptably: failing with a timeout)? What is going wrong?

Thank you!

Comments (4)

  1. Lisandro Dalcin

    There is no elegant and at the same time performant and standard solution for this. You have to basically try/except in the outermost call you care about and do allreduce to check for errors on any process and reraise coordinately.

    The MPI API is not designed to handle timeouts. Your signal trick is cute, but the underlying MPI implementation has to be prepared to honor signals the way you expect. You should ask about it to the corresponding MPI implementors.

  2. Log in to comment