test failure with openmpi-4.0.1
With current master at f2c5c4be0057a4a76af65cae0aa5cd2a4be620f1, tests fail under openmpi-4.0.1-1.fc31.x86_64 (Fedora package for openmpi). They pass fine under openmpi-3.1.4-1.fc31.x86_64, and pass fine under mpich.
$ python3 setup.py build && PYTHONPATH=build/lib.linux-x86_64-3.8/ mpiexec -np 1 python3 test/runtests.py -v --no-builddir --thread-level=serialized -e spawn
...
======================================================================
FAIL: testCompareAndSwap (test_rma.TestRMASelf)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test/test_rma.py", line 228, in testCompareAndSwap
self.assertEqual(rbuf[1], -1)
AssertionError: 0 != -1
======================================================================
FAIL: testFetchAndOp (test_rma.TestRMASelf)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test/test_rma.py", line 190, in testFetchAndOp
self.assertEqual(rbuf[1], -1)
AssertionError: 37 != -1
======================================================================
FAIL: testCompareAndSwap (test_rma.TestRMAWorld)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test/test_rma.py", line 228, in testCompareAndSwap
self.assertEqual(rbuf[1], -1)
AssertionError: 0 != -1
======================================================================
FAIL: testFetchAndOp (test_rma.TestRMAWorld)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test/test_rma.py", line 190, in testFetchAndOp
self.assertEqual(rbuf[1], -1)
AssertionError: 92 != -1
----------------------------------------------------------------------
Ran 1102 tests in 4.081s
FAILED (failures=4, skipped=60)
The failure is consistent in that those 4 always seem to fail, but the numbers in the AssertionErrors vary, suggesting some unitiailized memory.
The Fedora bug is https://bugzilla.redhat.com/show_bug.cgi?id=1705301.
Comments (8)
-
-
reporter b
fails,h
fails,i
,l
,q
don’t (in testCompareAndSwap).When I call the tests as
mpiexec -np 1 python3 test/runtests.py -v --thread-level=serialized -i test_rma
. When I call them asmpiexec -np 1 python3 test/runtests.py -v --thread-level=serialized -i testCompareAndSwap
they don’t fail at all. So it seems there’s some interaction with another test. -
How did you get the openmpi-4.0.1-1.fc31.x86_64 package? Are you using Fedora rawhide?
I cannot reproduce on Fedora 30 with Open MPI 4.0.1 built from sources.
Build Open MPI 4.0.1 from sources:
./configure --prefix=/home/devel/mpi/openmpi/4.0.1 --enable-debug --enable-mem-debug && make -j 16 && make install
Build and test mpi4py branch maint (released today as mpi4py-3.0.2):
$ which mpicc /home/devel/mpi/openmpi/4.0.1/bin/mpicc $ python3 setup.py build ... $ ldd build/lib.linux-x86_64-3.7/mpi4py/MPI.cpython-37m-x86_64-linux-gnu.so | grep mpi libmpi.so.40 => /home/devel/mpi/openmpi/4.0.1/lib/libmpi.so.40 (0x00007ff52353e000) libopen-rte.so.40 => /home/devel/mpi/openmpi/4.0.1/lib/libopen-rte.so.40 (0x00007ff5230e3000) libopen-pal.so.40 => /home/devel/mpi/openmpi/4.0.1/lib/libopen-pal.so.40 (0x00007ff522f84000) $ mpiexec -n 1 python3 test/runtests.py --thread-level=serialized [0@kw60439] Python 3.7 (/usr/bin/python3) [0@kw60439] MPI 3.1 (Open MPI 4.0.1) [0@kw60439] mpi4py 3.0.2 (build/lib.linux-x86_64-3.7/mpi4py) ... ---------------------------------------------------------------------- Ran 1142 tests in 6.561s OK (skipped=46)
-
reporter How did you get the openmpi-4.0.1-1.fc31.x86_64 package? Are you using Fedora rawhide?
Yes. This is in rawhide.
I tested this again, and openmpi has been updated in the meantime. With openmpi-4.0.1-5.fc31.x86_64 the tests pass without issue.
-
reporter - changed status to resolved
-
These errors are back now with openmpi-4.0.1-6. What I had done with -5 was to disable UCX support due to it causing some segfaults on x86_64. This is supposedly fixed with UCX 1.5.2 so I have re-enabled it in openmpi -6. However, this does appear to be having some effect on mpi4py as well.
-
@Orion Poplawski I guess that the only thing you can do is to report the issue upstream to Open MPI devs, then disable the failing tests with a decorator
@unittest.skipMPI('openmpi==4.0.1')
. These failing MPI calls are rarely used.@Jeff Squyres Is there any chance that the way Python loads the MPI libraries (
RTLD_LOCAL
) may interact badly with the UCX support? FYI, these test are failing because the corresponding MPI calls are writing past the buffer end for datatypesMPI_SIGNED_CHAR
andMPI_SHORT
. Maybe that’s the problem, the type size is too short (less than 4), and Open MPI (though UCX) is not handling things the right way? -
I’m unfortunately not involved in UCX development, so I can’t say. I’m not clear on whether the issue is in Open MPI or UCX itself. To start the ball rolling, I’ve filed https://github.com/open-mpi/ompi/issues/6777 – this should definitely be investigated on our end (probably by Mellanox).
- Log in to comment
Can you manually edit the test and print the typecode being used at the point of failure? I cannot reproduce with my own debug build of OpenMPI 4.0.1 using mpi4py/master on Fedora 30 (though IIRC I build my Open MPI with gcc 8 from Fedora 29).
PS: You should not set PYTHONPATH, and do not pass --no-builddir option, mpi4py always tests from the build directory by default.