MPI.COMM_SELF.Spawn Cannot spawned when called in a script in background (might be a BUG)

Issue #66 invalid
Shanbo created an issue

Hi,

I was testing the MPI4PY unit tests in mpi4py/test, and found there might be a bug here

What I tested is test_spawn.py. I have 3 test cases as following:

  1. It works fine when python test_spawn.py
  2. It still works fine when mpirun --oversubscribe -np 2 -H host1,host2 python test_spawn.py or make it running in background, like mpirun --oversubscribe -np 2 -H host1,host2 python test_spawn.py &. I set both host1 and host2 with the same environment. (Although there might be some tmp file mismatch, it still could pass a few tests)
  3. Here comes the bug (I think it is): I use two scripts, namely script.sh and run.sh, respectively.

script.sh is like

mpirun --oversubscribe -np 2 -H host1,host2 python test_spawn.py

run.sh is just:

sh script.sh &

In this case, there will be a timeout exception:

[@nmyjs_104_22 test]$ [nmyjs_104_37:34159] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 193
E[nmyjs_104_22:23226] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 193
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Timeout" (-15) instead of "Success" (0)
--------------------------------------------------------------------------
[warn] Epoll ADD(4) on fd 35 failed.  Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 51 failed.  Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 48 failed.  Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 30 failed.  Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor

I'm thinking that this is a bug, because the only difference between case2 and case3 is that case3 invoke mpirun ... in background from a bash script.

I saw this exception in my own project too, so I'm guessing it a bug of MPI4PY or openmpi.

Comments (5)

  1. Lisandro Dalcin

    I'm almost sure this issue is not mpi4py's fault, but the backend MPI implementation. You seem to be using Open MPI, however you have not stated its version. I guess you have to ask Open MPI folks about it, IIRC, Open MPI 2.x releases had issues with spawning.

  2. Lisandro Dalcin

    I'm marking this issue as invalid. If you can provide actual evidence that this is indeed a bug in mpi4py, then I'll reopen it and work on any required fixes.

  3. Shanbo reporter

    Hi Lisandro, I'm using openmpi 2.0.2. I'll try to use other versions to see if that works. Thank you

  4. Log in to comment