Disconnect() hangs with openmp3.1 / Python 3.8

Issue #176 resolved
bchareyre created an issue

Hello,

I keep failing at disconnecting spawned processes in a program that, otherwise, seems to work. The OS is canonical ubuntu 20.04.
I figured out the deb packaged mpi4py actually failed at runtests.py. I thus uninstalled and compiled latest master (70333ef76db05).

The compiled one passes the tests, but it still hangs on disconnect. I use for a testscript the code Lisandro posted in another thread. It is reproduced at the end of this message.

Any hint on how to workaround this?

Best regards

Bruno

export OMPI_MCA_rmaps_base_oversubscribe=yes
python3 test/runtests.py
Python 3.8 (/usr/bin/python3)
MPI 3.1 (Open MPI 4.0.3)
mpi4py 3.1.0a0 (build/lib.linux-x86_64-3.8/mpi4py)
--------------------------------------------------------------------------
The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this release.
Workarounds are to run on a single node, or to use a system with an RDMA
capable network such as Infiniband.
...

Ran 1239 tests in 53.792s

OK (skipped=186)

python3 testDisonnect.py
Hello from: 0 over population of 3
Hello from: 2 over population of 3
Hello from: 1 over population of 3

_________

#!/usr/bin/python3
import sys
from mpi4py import MPI

def execute(nproc=6,**kwargs):
    comm_slave = MPI.COMM_SELF.Spawn(sys.executable, args=[__file__,"slave"], maxprocs=nproc-1)
    comm_world = comm_slave.Merge()
    common_job(comm_world)
    comm_world.Disconnect()
    comm_slave.Disconnect()

def slave_job():
    comm_slave = MPI.Comm.Get_parent()
    comm_world = comm_slave.Merge()
    common_job(comm_world)
    comm_world.Disconnect()
    comm_slave.Disconnect()

def common_job(comm):
    comm.Barrier()
    print("Hello from: {} over population of {}".format(comm.rank,comm.size))
    comm.Barrier()

if (__name__=="__main__"):
    if ("slave" in sys.argv):
        slave_job()
    else:
        execute()

Comments (5)

  1. Lisandro Dalcin

    Can you try commenting out the barriers. Next, can you replace comm_world.Disconnect() by comm_world.Free() ? Does any of that work? Sorry, but this is not an mpi4py issue, nor an issue with the code, looks like it is just MPI not doing its job. Did you try MPICH? Did you try things on a conda environment with conda-forge packages?

  2. bchareyre reporter

    Something which seems to work is to not disconnect comm_world, disconnecting comm_slave alone is ok. I guess comm_world is broken badly after that, but at least it returns to python prompt and it exits smoothly…

  3. bchareyre reporter

    Commenting out barriers: same failure.

    comm_world.Free(): solves the problem.

    No surprise if it is not mpi4py related. Your experience on such issue is appreciated anyway.
    Thanks for quick feedback!

    And no, didn’t try MPICH. Would you advise so?

  4. Lisandro Dalcin

    Well, I’m a MPICH user it since 2003, so I obviously have a bias. But let’s be practical. If you have issues in some system with Open MPI, and you do not really care about using one implementation or the other, then switching to MPICH can save your day.

  5. Log in to comment