test_rma.py failing tests with OpenMPI 4.0.4 + UCX
Some TestRMASelf tests in test_rma.py are failing for Debian builds with OpenMPI 4.0.4 built with UCX support.
Debian versions are
- python3-mpi4py 3.0.3-5
- libopenmpi3 4.0.4-2
- libucx0 1.8.1-2
The problem seems to have started with openmpi build 4.0.4-2 when UCX supported was included. The error message doesn’t directly implicate UCX however.
Test logs can be found at https://ci.debian.net/packages/m/mpi4py/unstable/amd64/
e.g. https://ci.debian.net/data/autopkgtest/unstable/amd64/m/mpi4py/6531458/log.gz
test/test_rma.py::TestRMASelf::testGetAccumulate --------------------------------------------------------------------------
mpiexec noticed that process rank 2 with PID 0 on node ci-217-b50fcca2 exited on signal 11 (Segmentation fault).
The error can be reproduced manually from the command line, which also indicates an error in TestRMASelf::testStartComplete:
$ mpirun -n 2 python3 -m pytest test/test_rma.py -vv
============================= test session starts ==============================
platform linux -- Python 3.8.5, pytest-4.6.11, py-1.8.1, pluggy-0.13.0 -- /usr/bin/python3
...
test/test_rma.py::TestRMASelf::testPostWait PASSED [ 33%]
test/test_rma.py::TestRMASelf::testPutGet PASSED [ 36%]
test/test_rma.py::TestRMASelf::testPutProcNull PASSED [ 38%]
test/test_rma.py::TestRMASelf::testStartComplete FAILED [ 41%]
test/test_rma.py::TestRMASelf::testStartCompletePostTest PASSED [ 44%]
test/test_rma.py::TestRMASelf::testStartCompletePostWait PASSED [ 47%]
test/test_rma.py::TestRMASelf::testSync PASSED [ 50%]
test/test_rma.py::TestRMAWorld::testAccumulate --------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node monte exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Comments (11)
-
reporter -
It could very well be that UCX does not support or does not properly implement (atomic?) operations on some datatypes. As usual, this is most likely not mpi4py’s fault, there is nothing I can do about it, and it should be reported upstream to the backend MPI implementors. Too bad that MPI implementations do not add mpi4py to its own testing chain, that way they would save everyone’s time catching issues early, as they own testsuites are obviously not comprehensive enough.
For the time being, I guess you could just patch failing tests adding the following decorator line:
@unittest.skipMPI('openmpi(==4.0.4)')
-
reporter Thanks Lisandro. I’ll apply that workaround and pass on your feedback to UCX. I wonder if we ask, will OpenMPI add mpi4py to their CI testing?
-
I asked a few times, I got mixed responses, one of them was “we are not interested in mpi4py bugs“. I’m not interested in Open MPI bugs either, and yet I cannot scape being involved on them and people letting me know about them . At some point, Jeff Squyres seemed interested, I guess he simply did not have time to do it.
-
reporter I’ll ask Alistair to ask them. Maybe they’ll give it more interest if the request comes from the Linux distributions.
-
reporter The tests which need to be skipped to avoid the segfault are
- testAccumulate
- testGetAccumulateProcNullReplace
- testAccumulateProcNullSum
- testCompareAndSwap
- testFetchAndOp
- testGetAccumulate
- testPutGet
One they’re skipped, we get the error message for testStartComplete (for both TestRMASelf and TestRMAWorld)
________________________ TestRMASelf.testStartComplete _________________________ self = <test_rma.TestRMASelf testMethod=testStartComplete> @unittest.skipMPI('openmpi(==1.8.6)') def testStartComplete(self): self.WIN.Start(MPI.GROUP_EMPTY) > self.WIN.Complete() test/test_rma.py:324: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > ??? E mpi4py.MPI.Exception: MPI_ERR_RMA_SYNC: error executing rma sync mpi4py/MPI/Win.pyx:514: Exception
Should I just skip testStartComplete as well, or is this a different problem?
-
I have no idea with that
Start()/Complete()
test fails! Do you want my honest advice? Just go on patching to disable all tests that fail, and move on. That’s simply what I did elsewhere in the testsuite. If you have the time, let Open MPI folks know that things are broken. -
reporter Thanks again, I’ll patch it out also.
-
reporter oh god, with test_rma.py done, now I can see test_rma_nb.py is also affected. Patching more skips
edit: just testPutGet and testAccumulate for this one
-
reporter And errors from testAttachDetach in test_win.py (TestWinCreateDynamicSelf and TestWinCreateDynamicWorld)
__________________ TestWinCreateDynamicSelf.testAttachDetach ___________________ self = <test_win.TestWinCreateDynamicSelf testMethod=testAttachDetach> @unittest.skipMPI('msmpi(<9.1.0)') def testAttachDetach(self): mem1 = MPI.Alloc_mem(8) mem2 = MPI.Alloc_mem(16) mem3 = MPI.Alloc_mem(32) for mem in (mem1, mem2, mem3): self.WIN.Attach(mem) self.testMemory() _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > self.WIN.Detach(mem) > ??? test/test_win.py:202: mpi4py.MPI.Exception: MPI_ERR_UNKNOWN: unknown error
-
- changed status to resolved
- Log in to comment
The bug was reported on the Debian Bug System at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=965352 and discussed with UCX upstream at https://github.com/openucx/ucx/issues/5443 (there had been some explicit UCX bugs in other packages which have evidently been fixed with libucx0 1.8.1-2)