- edited description
unit tests fail with MPICH-master
Guess who just started playing with mpi4py today...
If I execute the unit tests, with a single process:
$ ~/work/soft/mpich/bin/mpiexec -np 1 python test/runtests.py -v
[...] omit many lines [...]
...
testGet (test_info.TestInfo) ... ok
testGetNKeys (test_info.TestInfo) ... ok
testGetSetDelete (test_info.TestInfo) ... ok
testPyMethods (test_info.TestInfo) ... ok
testTruth (test_info.TestInfo) ... ok
testPyMethods (test_info.TestInfoEnv) ... ok
testTruth (test_info.TestInfoEnv) ... ok
testPyMethods (test_info.TestInfoNull) ... ok
testTruth (test_info.TestInfoNull) ... ok
testIReadIWrite (test_io.TestIOSelf) ... overflow in finalize stack!
internal ABORT - process 0
If I run the I/O tests by themselves, things are OK:
$ ~/work/soft/mpich/bin/mpiexec -np 1 python test/runtests.py -v --include io'
[... omit many lines ... ]
Ran 69 tests in 12.018s
Comments (18)
-
reporter -
@roblatham Sorry for the late answer. For some unknown reason, Bitbucket unsubscribed me from issue notifications from all my repos, so I didn't catch this one until now.
This is what happens when a test suite exercises almost all of the MPI API in a single run. Some static array is likely not large enough to register finalizer callbacks, then you get that error, but only if enough finalizers are registered to trigger exhaustion. Does it make sense? IIRC, I reported a similar (or same?) issue to the devel list.
-
reporter I see that devel message. sure seems related. Sorry you got no response. I'll open a ticket at least... that's not a part of the code I know a whole lot about.
-
reporter OK, with MPICH commit http://git.mpich.org/mpich.git/commit/a4aa06759 there is no longer an overflow in finalize stack
However, I'm getting another failure. Have you seen this one?
% ~/work/soft/mpich/bin/mpiexec -n 4 -l python test/runtests.py -v ... /* many tests ... */ [1] testPutProcNull (test_rma_nb.TestRMAWorld) ... [2] testPutProcNull (test_rma_nb.TestRMAWorld) ... [3] testPutProcNull (test_rma_nb.TestRMAWorld)[3] ... [0] [0] testPutProcNull (test_rma_nb.TestRMAWorld)[0] ... [0] ok[1] ok [2] ok [3] ok [2] testArgsOnlyAtRoot (test_spawn.TestSpawnSelf)[3] testArgsOnlyAtRoot (test_spawn.TestSpawnSelf)[0] [1] testArgsOnlyAtRoot (test_spawn.TestSpawnSelf) ... [3] ... [0] testArgsOnlyAtRoot (test_spawn.TestSpawnSelf)[0] ... [2] ... [mpiexec@cobb] handle_pmi_cmd (/home/robl/work/mpich/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:52): Unrecognized PMI command: | cleaning up processes [mpiexec@cobb] control_cb (/home/robl/work/mpich/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:280): unable to process PMI command [mpiexec@cobb] HYDT_dmxu_poll_wait_for_event (/home/robl/work/mpich/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec@cobb] HYD_pmci_wait_for_completion (/home/robl/work/mpich/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec@cobb] main (/home/robl/work/mpich/src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion
Hydra is segfaulting, which should not happen and suggests poor if any error checking on the part of MPICH. But what is it about mpi4py that is feeding MPICH this unusual data?
-
This is a test that tries to follow what the MPI standard says about spawning processes with arguments that should be relevant only at the root process. Perhaps these particular tests are actually bad code on my side. I guess I should write a C version of them for you to take a look.
-
reporter i'm working on a c version now.
-
Wrote a C version. Tested with 3.1.4 from Homebrew in Mac. Cannot reproduce the error either with the C version or mpi4py testsuite.
-
FYI, this error shows up from time to time in Linux, but I never could figure out what's going on. Maybe some kind of race condition in Hydra?
-
reporter It's consistent for me... but only if I run the entire test suite.
~/work/soft/mpich/bin/mpiexec -n 4 python test/runtests.py -v --include spawn --include io
and~/work/soft/mpich/bin/mpiexec -n 4 python test/runtests.py -v --include spawn
don't trigger it. Darn it. -
I can reproduce in a Linux box with an old MPICH 3.0.4. Increase the number of processes as I did, and just run the specific test
TestSpawnSelf.testArgsOnlyAtRoot
as shown below. It smells like Hydra cannot cope with many spawns originating from COMM_SELF at near the same time. If I ask forTestSpawnWorld
, then I don't get the nasty errors (even if I use as much as 30 proces in a old 4 core desktop machine).mpiexec -n 10 /usr/bin/python /home/dalcinl/Devel/mpi4py-dev/test/runtests.py -q -i spawn TestSpawnSelf.testArgsOnlyAtRoot
-
reporter thanks. Can confirm that command causes recent MPICH to segfault, too. I've opened http://trac.mpich.org/projects/mpich/ticket/2282 (but it probably won't get fixed before the next release).
-
- changed status to closed
Closing as the issue is not related to mpi4py.
-
Just pushed a fix for the MPICH ticket Rob opened. Would appreciate if you could confirm a fix on your end.
Thanks, Ken
-
@raffenet I could not manage to make it break in my desktop, so I think you got it right :-) I should test it in my low-core-count home laptop, though.
-
reporter On my low-core laptop:
% ~/work/soft/mpich/bin/mpiexec -n 4 -l python test/runtests.py -v .... Ran 1110 tests in 41.802s
Are there any remaining tests mpi4py disables for MPICH ? the test/ directory has some 'if name == MPICH' blocks but the few I looked at were checking for older versions.
-
-
test/test_pack.py
:Datatype.{Pack_unpack}_external()
fails forMPI_DOUBLE
andMPI_LONG
. I think you guys you never implemented fully theexternal32
format? -
test/test_{dynproc|environ|spawn}.py
: If theMPI_APNUM
attribute is not set inMPI_COMM_WORLD
, the tests assume the Python process was not launched withmpiexec
, then some tests are disabled, they used to fail in previous MPICH versions, I would need to check if they still fail with 3.2. All this is related to the singleton-init feature. Does Hydra support singleton-init these days?
-
-
@roblatham BTW, why
Datatype.{Pack_unpack}_external()
fails forMPI_DOUBLE
? Please apply the following one-line patch and try yourself:$ git diff diff --git a/test/test_pack.py b/test/test_pack.py index ae00216..4fa8eb1 100644 --- a/test/test_pack.py +++ b/test/test_pack.py @@ -104,7 +104,7 @@ class TestPackExternal(BaseTestPackExternal, unittest.TestCase): name, version = MPI.get_vendor() if name =='MPICH' or name == 'MPICH2' or name == 'DeinoMPI': BaseTestPackExternal.skipdtype += ['l'] - BaseTestPackExternal.skipdtype += ['d'] + #BaseTestPackExternal.skipdtype += ['d'] elif name == 'Intel MPI': BaseTestPackExternal.skipdtype += ['l'] BaseTestPackExternal.skipdtype += ['d']
-
reporter http://trac.mpich.org/projects/mpich/ticket/1754 : I don't know how well we test MPI_LONG and external32: if the native size differs from external32 there could be problems, but it's been a few years since I looked closely.
Excellent: MPICH reports "Conversion of types whose size is not the same as the size in external32 is not supported", which is better than a silent error.
Double does seem to work today. I bet this recent commit fixed that (http://git.mpich.org/mpich.git/commit/62f750cca300118d22a4ba2d77968343b0acfd53)
hydra and singleton init is https://trac.mpich.org/projects/mpich/ticket/1074 (and me complaining about the error message: https://trac.mpich.org/projects/mpich/ticket/2175 )
- Log in to comment