unit tests fail with MPICH-master

Issue #19 closed
Rob Latham created an issue

Guess who just started playing with mpi4py today...

If I execute the unit tests, with a single process:

$ ~/work/soft/mpich/bin/mpiexec -np 1 python test/runtests.py -v
[...] omit many lines [...]
...
testGet (test_info.TestInfo) ... ok
testGetNKeys (test_info.TestInfo) ... ok
testGetSetDelete (test_info.TestInfo) ... ok
testPyMethods (test_info.TestInfo) ... ok
testTruth (test_info.TestInfo) ... ok
testPyMethods (test_info.TestInfoEnv) ... ok
testTruth (test_info.TestInfoEnv) ... ok
testPyMethods (test_info.TestInfoNull) ... ok
testTruth (test_info.TestInfoNull) ... ok
testIReadIWrite (test_io.TestIOSelf) ... overflow in finalize stack!
internal ABORT - process 0

If I run the I/O tests by themselves, things are OK:

$  ~/work/soft/mpich/bin/mpiexec -np 1 python test/runtests.py -v --include io'
[... omit many lines ... ]
Ran 69 tests in 12.018s

Comments (18)

  1. Lisandro Dalcin

    @roblatham Sorry for the late answer. For some unknown reason, Bitbucket unsubscribed me from issue notifications from all my repos, so I didn't catch this one until now.

    This is what happens when a test suite exercises almost all of the MPI API in a single run. Some static array is likely not large enough to register finalizer callbacks, then you get that error, but only if enough finalizers are registered to trigger exhaustion. Does it make sense? IIRC, I reported a similar (or same?) issue to the devel list.

  2. Rob Latham reporter

    I see that devel message. sure seems related. Sorry you got no response. I'll open a ticket at least... that's not a part of the code I know a whole lot about.

  3. Rob Latham reporter

    OK, with MPICH commit http://git.mpich.org/mpich.git/commit/a4aa06759 there is no longer an overflow in finalize stack

    However, I'm getting another failure. Have you seen this one?

    % ~/work/soft/mpich/bin/mpiexec -n 4 -l python test/runtests.py -v
    ...  /* many tests ... */
    [1] testPutProcNull (test_rma_nb.TestRMAWorld) ... [2] testPutProcNull (test_rma_nb.TestRMAWorld) ... [3] testPutProcNull (test_rma_nb.TestRMAWorld)[3]  ... [0]
    [0] testPutProcNull (test_rma_nb.TestRMAWorld)[0]  ... [0] ok[1] ok
    [2] ok
    [3] ok
    [2] testArgsOnlyAtRoot (test_spawn.TestSpawnSelf)[3] testArgsOnlyAtRoot (test_spawn.TestSpawnSelf)[0]
    [1] testArgsOnlyAtRoot (test_spawn.TestSpawnSelf) ... [3]  ... [0] testArgsOnlyAtRoot (test_spawn.TestSpawnSelf)[0]  ... [2]  ... [mpiexec@cobb] handle_pmi_cmd (/home/robl/work/mpich/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:52): Unrecognized PMI command:  | cleaning up processes
    [mpiexec@cobb] control_cb (/home/robl/work/mpich/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:280): unable to process PMI command
    [mpiexec@cobb] HYDT_dmxu_poll_wait_for_event (/home/robl/work/mpich/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
    [mpiexec@cobb] HYD_pmci_wait_for_completion (/home/robl/work/mpich/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
    [mpiexec@cobb] main (/home/robl/work/mpich/src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion
    

    Hydra is segfaulting, which should not happen and suggests poor if any error checking on the part of MPICH. But what is it about mpi4py that is feeding MPICH this unusual data?

  4. Lisandro Dalcin

    This is a test that tries to follow what the MPI standard says about spawning processes with arguments that should be relevant only at the root process. Perhaps these particular tests are actually bad code on my side. I guess I should write a C version of them for you to take a look.

  5. Lisandro Dalcin

    Wrote a C version. Tested with 3.1.4 from Homebrew in Mac. Cannot reproduce the error either with the C version or mpi4py testsuite.

  6. Lisandro Dalcin

    FYI, this error shows up from time to time in Linux, but I never could figure out what's going on. Maybe some kind of race condition in Hydra?

  7. Rob Latham reporter

    It's consistent for me... but only if I run the entire test suite. ~/work/soft/mpich/bin/mpiexec -n 4 python test/runtests.py -v --include spawn --include io and ~/work/soft/mpich/bin/mpiexec -n 4 python test/runtests.py -v --include spawn don't trigger it. Darn it.

  8. Lisandro Dalcin

    I can reproduce in a Linux box with an old MPICH 3.0.4. Increase the number of processes as I did, and just run the specific test TestSpawnSelf.testArgsOnlyAtRoot as shown below. It smells like Hydra cannot cope with many spawns originating from COMM_SELF at near the same time. If I ask for TestSpawnWorld, then I don't get the nasty errors (even if I use as much as 30 proces in a old 4 core desktop machine).

    mpiexec -n 10 /usr/bin/python /home/dalcinl/Devel/mpi4py-dev/test/runtests.py -q -i spawn TestSpawnSelf.testArgsOnlyAtRoot
    
  9. Ken Raffenetti

    Just pushed a fix for the MPICH ticket Rob opened. Would appreciate if you could confirm a fix on your end.

    Thanks, Ken

  10. Lisandro Dalcin

    @raffenet I could not manage to make it break in my desktop, so I think you got it right :-) I should test it in my low-core-count home laptop, though.

  11. Rob Latham reporter

    On my low-core laptop:

    %  ~/work/soft/mpich/bin/mpiexec -n 4 -l python test/runtests.py -v
    ....  
    Ran 1110 tests in 41.802s
    

    Are there any remaining tests mpi4py disables for MPICH ? the test/ directory has some 'if name == MPICH' blocks but the few I looked at were checking for older versions.

  12. Lisandro Dalcin
    • test/test_pack.py: Datatype.{Pack_unpack}_external() fails for MPI_DOUBLE and MPI_LONG. I think you guys you never implemented fully the external32 format?

    • test/test_{dynproc|environ|spawn}.py: If the MPI_APNUM attribute is not set in MPI_COMM_WORLD, the tests assume the Python process was not launched with mpiexec, then some tests are disabled, they used to fail in previous MPICH versions, I would need to check if they still fail with 3.2. All this is related to the singleton-init feature. Does Hydra support singleton-init these days?

  13. Lisandro Dalcin

    @roblatham BTW, why Datatype.{Pack_unpack}_external() fails for MPI_DOUBLE? Please apply the following one-line patch and try yourself:

    $ git diff
    diff --git a/test/test_pack.py b/test/test_pack.py
    index ae00216..4fa8eb1 100644
    --- a/test/test_pack.py
    +++ b/test/test_pack.py
    @@ -104,7 +104,7 @@ class TestPackExternal(BaseTestPackExternal, unittest.TestCase):
     name, version = MPI.get_vendor()
     if name =='MPICH' or name == 'MPICH2' or name == 'DeinoMPI':
         BaseTestPackExternal.skipdtype += ['l']
    -    BaseTestPackExternal.skipdtype += ['d']
    +    #BaseTestPackExternal.skipdtype += ['d']
     elif name == 'Intel MPI':
         BaseTestPackExternal.skipdtype += ['l']
         BaseTestPackExternal.skipdtype += ['d']
    
  14. Rob Latham reporter

    http://trac.mpich.org/projects/mpich/ticket/1754 : I don't know how well we test MPI_LONG and external32: if the native size differs from external32 there could be problems, but it's been a few years since I looked closely.

    Excellent: MPICH reports "Conversion of types whose size is not the same as the size in external32 is not supported", which is better than a silent error.

    Double does seem to work today. I bet this recent commit fixed that (http://git.mpich.org/mpich.git/commit/62f750cca300118d22a4ba2d77968343b0acfd53)

    hydra and singleton init is https://trac.mpich.org/projects/mpich/ticket/1074 (and me complaining about the error message: https://trac.mpich.org/projects/mpich/ticket/2175 )

  15. Log in to comment