unit tests fail with MPICH-master

Issue #19 closed

Rob Latham created an issue 2015-04-22

Guess who just started playing with mpi4py today...

If I execute the unit tests, with a single process:

$ ~/work/soft/mpich/bin/mpiexec -np 1 python test/runtests.py -v
[...] omit many lines [...]
...
testGet (test_info.TestInfo) ... ok
testGetNKeys (test_info.TestInfo) ... ok
testGetSetDelete (test_info.TestInfo) ... ok
testPyMethods (test_info.TestInfo) ... ok
testTruth (test_info.TestInfo) ... ok
testPyMethods (test_info.TestInfoEnv) ... ok
testTruth (test_info.TestInfoEnv) ... ok
testPyMethods (test_info.TestInfoNull) ... ok
testTruth (test_info.TestInfoNull) ... ok
testIReadIWrite (test_io.TestIOSelf) ... overflow in finalize stack!
internal ABORT - process 0

If I run the I/O tests by themselves, things are OK:

$  ~/work/soft/mpich/bin/mpiexec -np 1 python test/runtests.py -v --include io'
[... omit many lines ... ]
Ran 69 tests in 12.018s

Comments (18)

Rob Latham reporter
- edited description
- 2015-04-22T19:52:55+00:00
Lisandro Dalcin
@roblatham Sorry for the late answer. For some unknown reason, Bitbucket unsubscribed me from issue notifications from all my repos, so I didn't catch this one until now.

This is what happens when a test suite exercises almost all of the MPI API in a single run. Some static array is likely not large enough to register finalizer callbacks, then you get that error, but only if enough finalizers are registered to trigger exhaustion. Does it make sense? IIRC, I reported a similar (or same?) issue to the devel list.
- 2015-06-03T18:32:37+00:00
Rob Latham reporter
I see that devel message. sure seems related. Sorry you got no response. I'll open a ticket at least... that's not a part of the code I know a whole lot about.
- 2015-06-03T18:50:09+00:00

Rob Latham reporter

OK, with MPICH commit http://git.mpich.org/mpich.git/commit/a4aa06759 there is no longer an overflow in finalize stack

However, I'm getting another failure. Have you seen this one?

% ~/work/soft/mpich/bin/mpiexec -n 4 -l python test/runtests.py -v
...  /* many tests ... */
[1] testPutProcNull (test_rma_nb.TestRMAWorld) ... [2] testPutProcNull (test_rma_nb.TestRMAWorld) ... [3] testPutProcNull (test_rma_nb.TestRMAWorld)[3]  ... [0]
[0] testPutProcNull (test_rma_nb.TestRMAWorld)[0]  ... [0] ok[1] ok
[2] ok
[3] ok
[2] testArgsOnlyAtRoot (test_spawn.TestSpawnSelf)[3] testArgsOnlyAtRoot (test_spawn.TestSpawnSelf)[0]
[1] testArgsOnlyAtRoot (test_spawn.TestSpawnSelf) ... [3]  ... [0] testArgsOnlyAtRoot (test_spawn.TestSpawnSelf)[0]  ... [2]  ... [mpiexec@cobb] handle_pmi_cmd (/home/robl/work/mpich/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:52): Unrecognized PMI command:  | cleaning up processes
[mpiexec@cobb] control_cb (/home/robl/work/mpich/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:280): unable to process PMI command
[mpiexec@cobb] HYDT_dmxu_poll_wait_for_event (/home/robl/work/mpich/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@cobb] HYD_pmci_wait_for_completion (/home/robl/work/mpich/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec@cobb] main (/home/robl/work/mpich/src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion

Hydra is segfaulting, which should not happen and suggests poor if any error checking on the part of MPICH. But what is it about mpi4py that is feeding MPICH this unusual data?

2015-06-23T21:17:12+00:00

Lisandro Dalcin
This is a test that tries to follow what the MPI standard says about spawning processes with arguments that should be relevant only at the root process. Perhaps these particular tests are actually bad code on my side. I guess I should write a C version of them for you to take a look.
- 2015-06-23T21:20:31+00:00
Rob Latham reporter
i'm working on a c version now.
- 2015-06-23T21:32:46+00:00
Lisandro Dalcin
Wrote a C version. Tested with 3.1.4 from Homebrew in Mac. Cannot reproduce the error either with the C version or mpi4py testsuite.
- 2015-06-23T21:53:58+00:00
Lisandro Dalcin
FYI, this error shows up from time to time in Linux, but I never could figure out what's going on. Maybe some kind of race condition in Hydra?
- 2015-06-23T22:00:01+00:00
Rob Latham reporter
It's consistent for me... but only if I run the entire test suite. ~/work/soft/mpich/bin/mpiexec -n 4 python test/runtests.py -v --include spawn --include io and ~/work/soft/mpich/bin/mpiexec -n 4 python test/runtests.py -v --include spawn don't trigger it. Darn it.
- 2015-06-23T22:05:56+00:00
Lisandro Dalcin
I can reproduce in a Linux box with an old MPICH 3.0.4. Increase the number of processes as I did, and just run the specific test TestSpawnSelf.testArgsOnlyAtRoot as shown below. It smells like Hydra cannot cope with many spawns originating from COMM_SELF at near the same time. If I ask for TestSpawnWorld, then I don't get the nasty errors (even if I use as much as 30 proces in a old 4 core desktop machine).
```
mpiexec -n 10 /usr/bin/python /home/dalcinl/Devel/mpi4py-dev/test/runtests.py -q -i spawn TestSpawnSelf.testArgsOnlyAtRoot
```
- 2015-06-23T22:33:49+00:00
Rob Latham reporter
thanks. Can confirm that command causes recent MPICH to segfault, too. I've opened http://trac.mpich.org/projects/mpich/ticket/2282 (but it probably won't get fixed before the next release).
- 2015-06-24T14:43:15+00:00
Lisandro Dalcin
- changed status to closed
Closing as the issue is not related to mpi4py.
- 2015-08-05T19:17:19+00:00
Ken Raffenetti
Just pushed a fix for the MPICH ticket Rob opened. Would appreciate if you could confirm a fix on your end.

Thanks, Ken
- 2016-02-05T22:38:39+00:00
Lisandro Dalcin
@raffenet I could not manage to make it break in my desktop, so I think you got it right :-) I should test it in my low-core-count home laptop, though.
- 2016-02-07T10:50:04+00:00
Rob Latham reporter
On my low-core laptop:
```
%  ~/work/soft/mpich/bin/mpiexec -n 4 -l python test/runtests.py -v
....  
Ran 1110 tests in 41.802s
```
Are there any remaining tests mpi4py disables for MPICH ? the test/ directory has some 'if name == MPICH' blocks but the few I looked at were checking for older versions.
- 2016-02-16T15:32:51+00:00
Lisandro Dalcin
- test/test_pack.py: Datatype.{Pack_unpack}_external() fails for MPI_DOUBLE and MPI_LONG. I think you guys you never implemented fully the external32 format?
- test/test_{dynproc|environ|spawn}.py: If the MPI_APNUM attribute is not set in MPI_COMM_WORLD, the tests assume the Python process was not launched with mpiexec, then some tests are disabled, they used to fail in previous MPICH versions, I would need to check if they still fail with 3.2. All this is related to the singleton-init feature. Does Hydra support singleton-init these days?
- 2016-02-16T19:15:42+00:00

Lisandro Dalcin

@roblatham BTW, why Datatype.{Pack_unpack}_external() fails for MPI_DOUBLE? Please apply the following one-line patch and try yourself:

$ git diff
diff --git a/test/test_pack.py b/test/test_pack.py
index ae00216..4fa8eb1 100644
--- a/test/test_pack.py
+++ b/test/test_pack.py
@@ -104,7 +104,7 @@ class TestPackExternal(BaseTestPackExternal, unittest.TestCase):
 name, version = MPI.get_vendor()
 if name =='MPICH' or name == 'MPICH2' or name == 'DeinoMPI':
     BaseTestPackExternal.skipdtype += ['l']
-    BaseTestPackExternal.skipdtype += ['d']
+    #BaseTestPackExternal.skipdtype += ['d']
 elif name == 'Intel MPI':
     BaseTestPackExternal.skipdtype += ['l']
     BaseTestPackExternal.skipdtype += ['d']

2016-02-16T19:26:34+00:00

Rob Latham reporter
http://trac.mpich.org/projects/mpich/ticket/1754 : I don't know how well we test MPI_LONG and external32: if the native size differs from external32 there could be problems, but it's been a few years since I looked closely.

Excellent: MPICH reports "Conversion of types whose size is not the same as the size in external32 is not supported", which is better than a silent error.

Double does seem to work today. I bet this recent commit fixed that (http://git.mpich.org/mpich.git/commit/62f750cca300118d22a4ba2d77968343b0acfd53)

hydra and singleton init is https://trac.mpich.org/projects/mpich/ticket/1074 (and me complaining about the error message: https://trac.mpich.org/projects/mpich/ticket/2175 )
- 2016-02-16T20:05:33+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: closed

Votes: 0

Watchers: 2