test_rma.py failing tests with OpenMPI 4.0.4 + UCX

Issue #171 resolved
Drew Parsons created an issue

Some TestRMASelf tests in test_rma.py are failing for Debian builds with OpenMPI 4.0.4 built with UCX support.

Debian versions are

  • python3-mpi4py 3.0.3-5
  • libopenmpi3 4.0.4-2
  • libucx0 1.8.1-2

The problem seems to have started with openmpi build 4.0.4-2 when UCX supported was included. The error message doesn’t directly implicate UCX however.

Test logs can be found at https://ci.debian.net/packages/m/mpi4py/unstable/amd64/
e.g. https://ci.debian.net/data/autopkgtest/unstable/amd64/m/mpi4py/6531458/log.gz

test/test_rma.py::TestRMASelf::testGetAccumulate --------------------------------------------------------------------------
mpiexec noticed that process rank 2 with PID 0 on node ci-217-b50fcca2 exited on signal 11 (Segmentation fault).

The error can be reproduced manually from the command line, which also indicates an error in TestRMASelf::testStartComplete:

$ mpirun -n 2 python3 -m pytest test/test_rma.py -vv
============================= test session starts ==============================
platform linux -- Python 3.8.5, pytest-4.6.11, py-1.8.1, pluggy-0.13.0 -- /usr/bin/python3
...
test/test_rma.py::TestRMASelf::testPostWait PASSED                       [ 33%]
test/test_rma.py::TestRMASelf::testPutGet PASSED                         [ 36%]
test/test_rma.py::TestRMASelf::testPutProcNull PASSED                    [ 38%]
test/test_rma.py::TestRMASelf::testStartComplete FAILED                  [ 41%]
test/test_rma.py::TestRMASelf::testStartCompletePostTest PASSED          [ 44%]
test/test_rma.py::TestRMASelf::testStartCompletePostWait PASSED          [ 47%]
test/test_rma.py::TestRMASelf::testSync PASSED                           [ 50%]
test/test_rma.py::TestRMAWorld::testAccumulate --------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node monte exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Comments (11)

  1. Lisandro Dalcin

    It could very well be that UCX does not support or does not properly implement (atomic?) operations on some datatypes. As usual, this is most likely not mpi4py’s fault, there is nothing I can do about it, and it should be reported upstream to the backend MPI implementors. Too bad that MPI implementations do not add mpi4py to its own testing chain, that way they would save everyone’s time catching issues early, as they own testsuites are obviously not comprehensive enough.

    For the time being, I guess you could just patch failing tests adding the following decorator line:

    @unittest.skipMPI('openmpi(==4.0.4)')
    

  2. Drew Parsons reporter

    Thanks Lisandro. I’ll apply that workaround and pass on your feedback to UCX. I wonder if we ask, will OpenMPI add mpi4py to their CI testing?

  3. Lisandro Dalcin

    I asked a few times, I got mixed responses, one of them was “we are not interested in mpi4py bugs“. I’m not interested in Open MPI bugs either, and yet I cannot scape being involved on them and people letting me know about them 😅 . At some point, Jeff Squyres seemed interested, I guess he simply did not have time to do it.

  4. Drew Parsons reporter

    I’ll ask Alistair to ask them. Maybe they’ll give it more interest if the request comes from the Linux distributions.

  5. Drew Parsons reporter

    The tests which need to be skipped to avoid the segfault are

    • ‌ testAccumulate
    • ‌ testGetAccumulateProcNullReplace
    • ‌ testAccumulateProcNullSum
    • ‌ testCompareAndSwap
    • ‌ testFetchAndOp
    • ‌ testGetAccumulate
    • ‌ testPutGet

    One they’re skipped, we get the error message for testStartComplete (for both TestRMASelf and TestRMAWorld)

    ________________________ TestRMASelf.testStartComplete _________________________
    
    self = <test_rma.TestRMASelf testMethod=testStartComplete>
    
        @unittest.skipMPI('openmpi(==1.8.6)')
        def testStartComplete(self):
            self.WIN.Start(MPI.GROUP_EMPTY)
    >       self.WIN.Complete()
    
    test/test_rma.py:324: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    >   ???
    E   mpi4py.MPI.Exception: MPI_ERR_RMA_SYNC: error executing rma sync
    
    mpi4py/MPI/Win.pyx:514: Exception
    

    Should I just skip testStartComplete as well, or is this a different problem?

  6. Lisandro Dalcin

    I have no idea with that Start()/Complete() test fails! Do you want my honest advice? Just go on patching to disable all tests that fail, and move on. That’s simply what I did elsewhere in the testsuite. If you have the time, let Open MPI folks know that things are broken.

  7. Drew Parsons reporter

    oh god, with test_rma.py done, now I can see test_rma_nb.py is also affected. Patching more skips

    edit: just testPutGet and testAccumulate for this one

  8. Drew Parsons reporter

    And errors from testAttachDetach in test_win.py (TestWinCreateDynamicSelf and TestWinCreateDynamicWorld)

    __________________ TestWinCreateDynamicSelf.testAttachDetach ___________________
    self = <test_win.TestWinCreateDynamicSelf testMethod=testAttachDetach>
        @unittest.skipMPI('msmpi(<9.1.0)')
        def testAttachDetach(self):
            mem1 = MPI.Alloc_mem(8)
            mem2 = MPI.Alloc_mem(16)
            mem3 = MPI.Alloc_mem(32)
            for mem in (mem1, mem2, mem3):
                self.WIN.Attach(mem)
                self.testMemory()
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    >           self.WIN.Detach(mem)
    >   ???
    test/test_win.py:202:    mpi4py.MPI.Exception: MPI_ERR_UNKNOWN: unknown error
    

  9. Log in to comment