test failure with openmpi-4.0.1

Issue #124 resolved
Zbigniew Jędrzejewski-Szmek created an issue

With current master at f2c5c4be0057a4a76af65cae0aa5cd2a4be620f1, tests fail under openmpi-4.0.1-1.fc31.x86_64 (Fedora package for openmpi). They pass fine under openmpi-3.1.4-1.fc31.x86_64, and pass fine under mpich.

$ python3 setup.py build && PYTHONPATH=build/lib.linux-x86_64-3.8/ mpiexec -np 1 python3 test/runtests.py -v --no-builddir --thread-level=serialized -e spawn
...
======================================================================
FAIL: testCompareAndSwap (test_rma.TestRMASelf)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_rma.py", line 228, in testCompareAndSwap
    self.assertEqual(rbuf[1], -1)
AssertionError: 0 != -1

======================================================================
FAIL: testFetchAndOp (test_rma.TestRMASelf)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_rma.py", line 190, in testFetchAndOp
    self.assertEqual(rbuf[1], -1)
AssertionError: 37 != -1

======================================================================
FAIL: testCompareAndSwap (test_rma.TestRMAWorld)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_rma.py", line 228, in testCompareAndSwap
    self.assertEqual(rbuf[1], -1)
AssertionError: 0 != -1

======================================================================
FAIL: testFetchAndOp (test_rma.TestRMAWorld)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_rma.py", line 190, in testFetchAndOp
    self.assertEqual(rbuf[1], -1)
AssertionError: 92 != -1

----------------------------------------------------------------------
Ran 1102 tests in 4.081s

FAILED (failures=4, skipped=60)

The failure is consistent in that those 4 always seem to fail, but the numbers in the AssertionErrors vary, suggesting some unitiailized memory.

The Fedora bug is https://bugzilla.redhat.com/show_bug.cgi?id=1705301.

Comments (8)

  1. Lisandro Dalcin

    Can you manually edit the test and print the typecode being used at the point of failure? I cannot reproduce with my own debug build of OpenMPI 4.0.1 using mpi4py/master on Fedora 30 (though IIRC I build my Open MPI with gcc 8 from Fedora 29).

    PS: You should not set PYTHONPATH, and do not pass --no-builddir option, mpi4py always tests from the build directory by default.

  2. Zbigniew Jędrzejewski-Szmek reporter

    b fails, h fails, i, l, q don’t (in testCompareAndSwap).

    When I call the tests as mpiexec -np 1 python3 test/runtests.py -v --thread-level=serialized -i test_rma. When I call them as mpiexec -np 1 python3 test/runtests.py -v --thread-level=serialized -i testCompareAndSwapthey don’t fail at all. So it seems there’s some interaction with another test.

  3. Lisandro Dalcin

    How did you get the openmpi-4.0.1-1.fc31.x86_64 package? Are you using Fedora rawhide?

    I cannot reproduce on Fedora 30 with Open MPI 4.0.1 built from sources.

    Build Open MPI 4.0.1 from sources:

    ./configure --prefix=/home/devel/mpi/openmpi/4.0.1 --enable-debug --enable-mem-debug && make -j 16 && make install
    

    Build and test mpi4py branch maint (released today as mpi4py-3.0.2):

    $ which mpicc
    /home/devel/mpi/openmpi/4.0.1/bin/mpicc
    
    $ python3 setup.py build
    ...
    
    $  ldd build/lib.linux-x86_64-3.7/mpi4py/MPI.cpython-37m-x86_64-linux-gnu.so | grep mpi
        libmpi.so.40 => /home/devel/mpi/openmpi/4.0.1/lib/libmpi.so.40 (0x00007ff52353e000)
        libopen-rte.so.40 => /home/devel/mpi/openmpi/4.0.1/lib/libopen-rte.so.40 (0x00007ff5230e3000)
        libopen-pal.so.40 => /home/devel/mpi/openmpi/4.0.1/lib/libopen-pal.so.40 (0x00007ff522f84000)
    
    $ mpiexec -n 1 python3 test/runtests.py --thread-level=serialized
    [0@kw60439] Python 3.7 (/usr/bin/python3)
    [0@kw60439] MPI 3.1 (Open MPI 4.0.1)
    [0@kw60439] mpi4py 3.0.2 (build/lib.linux-x86_64-3.7/mpi4py)
    ...
    ----------------------------------------------------------------------
    Ran 1142 tests in 6.561s
    
    OK (skipped=46)
    
  4. Zbigniew Jędrzejewski-Szmek reporter

    How did you get the openmpi-4.0.1-1.fc31.x86_64 package? Are you using Fedora rawhide?

    Yes. This is in rawhide.

    I tested this again, and openmpi has been updated in the meantime. With openmpi-4.0.1-5.fc31.x86_64 the tests pass without issue.

  5. Orion Poplawski

    These errors are back now with openmpi-4.0.1-6. What I had done with -5 was to disable UCX support due to it causing some segfaults on x86_64. This is supposedly fixed with UCX 1.5.2 so I have re-enabled it in openmpi -6. However, this does appear to be having some effect on mpi4py as well.

  6. Lisandro Dalcin

    @Orion Poplawski I guess that the only thing you can do is to report the issue upstream to Open MPI devs, then disable the failing tests with a decorator @unittest.skipMPI('openmpi==4.0.1') . These failing MPI calls are rarely used.

    @Jeff Squyres Is there any chance that the way Python loads the MPI libraries (RTLD_LOCAL) may interact badly with the UCX support? FYI, these test are failing because the corresponding MPI calls are writing past the buffer end for datatypes MPI_SIGNED_CHAR and MPI_SHORT. Maybe that’s the problem, the type size is too short (less than 4), and Open MPI (though UCX) is not handling things the right way?

  7. Log in to comment