Disconnect() hangs with openmp3.1 / Python 3.8
Hello,
I keep failing at disconnecting spawned processes in a program that, otherwise, seems to work. The OS is canonical ubuntu 20.04.
I figured out the deb packaged mpi4py actually failed at runtests.py. I thus uninstalled and compiled latest master (70333ef76db05).
The compiled one passes the tests, but it still hangs on disconnect. I use for a testscript the code Lisandro posted in another thread. It is reproduced at the end of this message.
Any hint on how to workaround this?
Best regards
Bruno
export OMPI_MCA_rmaps_base_oversubscribe=yes
python3 test/runtests.py
Python 3.8 (/usr/bin/python3)
MPI 3.1 (Open MPI 4.0.3)
mpi4py 3.1.0a0 (build/lib.linux-x86_64-3.8/mpi4py)
--------------------------------------------------------------------------
The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this release.
Workarounds are to run on a single node, or to use a system with an RDMA
capable network such as Infiniband.
...
Ran 1239 tests in 53.792s
OK (skipped=186)
python3 testDisonnect.py
Hello from: 0 over population of 3
Hello from: 2 over population of 3
Hello from: 1 over population of 3
_________
#!/usr/bin/python3
import sys
from mpi4py import MPI
def execute(nproc=6,**kwargs):
comm_slave = MPI.COMM_SELF.Spawn(sys.executable, args=[__file__,"slave"], maxprocs=nproc-1)
comm_world = comm_slave.Merge()
common_job(comm_world)
comm_world.Disconnect()
comm_slave.Disconnect()
def slave_job():
comm_slave = MPI.Comm.Get_parent()
comm_world = comm_slave.Merge()
common_job(comm_world)
comm_world.Disconnect()
comm_slave.Disconnect()
def common_job(comm):
comm.Barrier()
print("Hello from: {} over population of {}".format(comm.rank,comm.size))
comm.Barrier()
if (__name__=="__main__"):
if ("slave" in sys.argv):
slave_job()
else:
execute()
Comments (5)
-
-
reporter Something which seems to work is to not disconnect
comm_world
, disconnectingcomm_slave
alone is ok. I guesscomm_world
is broken badly after that, but at least it returns to python prompt and it exits smoothly… -
reporter Commenting out barriers: same failure.
comm_world.Free(): solves the problem.
No surprise if it is not mpi4py related. Your experience on such issue is appreciated anyway.
Thanks for quick feedback!And no, didn’t try MPICH. Would you advise so?
-
Well, I’m a MPICH user it since 2003, so I obviously have a bias. But let’s be practical. If you have issues in some system with Open MPI, and you do not really care about using one implementation or the other, then switching to MPICH can save your day.
-
- changed status to resolved
- Log in to comment
Can you try commenting out the barriers. Next, can you replace
comm_world.Disconnect()
bycomm_world.Free()
? Does any of that work? Sorry, but this is not an mpi4py issue, nor an issue with the code, looks like it is just MPI not doing its job. Did you try MPICH? Did you try things on a conda environment with conda-forge packages?