Performance issue with infiniband using Open MPI 4.0.5
I’d like to use with mpi4py a new infiniband network. I wrote a simple benchmark and the results are bad, i.e. are the same with and without the MPI library built to use the infiniband network.
With the lib (Open MPI v4.0.5
) that should use inifiniband
Benchmark MPI with mpi4py
Open MPI v4.0.5, package: Open MPI krialforzh@cl2n001 Distribution, ident: 4.0.5, repo rev: v4.0.5, Aug 26, 2020
MPI Version (3, 1)
6.004e-04 s for 102400 floats (10.915 Gb/s)
1.235e-03 s for 204800 floats (10.611 Gb/s)
2.391e-03 s for 409600 floats (10.963 Gb/s)
4.614e-03 s for 819200 floats (11.363 Gb/s)
1.034e-02 s for 1638400 floats (10.141 Gb/s)
2.050e-02 s for 3276800 floats (10.229 Gb/s)
4.198e-02 s for 6553600 floats (9.991 Gb/s)
8.403e-02 s for 13107200 floats (9.983 Gb/s)
1.691e-01 s for 26214400 floats (9.919 Gb/s)
3.390e-01 s for 52428800 floats (9.899 Gb/s)
6.779e-01 s for 104857600 floats (9.900 Gb/s)
With the lib (Open MPI v3.1.3
) that uses the standard network:
Benchmark MPI with mpi4py
Open MPI v3.1.3, package: Debian OpenMPI, ident: 3.1.3, repo rev: v3.1.3, Oct 29, 2018
MPI Version (3, 1)
6.474e-04 s for 102400 floats (10.123 Gb/s)
1.219e-03 s for 204800 floats (10.756 Gb/s)
2.347e-03 s for 409600 floats (11.168 Gb/s)
4.685e-03 s for 819200 floats (11.191 Gb/s)
1.161e-02 s for 1638400 floats (9.035 Gb/s)
2.057e-02 s for 3276800 floats (10.193 Gb/s)
4.212e-02 s for 6553600 floats (9.958 Gb/s)
8.424e-02 s for 13107200 floats (9.958 Gb/s)
1.694e-01 s for 26214400 floats (9.904 Gb/s)
3.382e-01 s for 52428800 floats (9.921 Gb/s)
6.771e-01 s for 104857600 floats (9.912 Gb/s)
I wrote the same benchmark in C++ and the results are correct. It’s much faster with Open MPI v4.0.5
(and infiniband)
Open MPI v4.0.5, package: Open MPI krialforzh@cl2n001 Distribution, ident: 4.0.5, repo rev: v4.0.5, Aug 26, 2020
9.156e-05 s for 102400 floats (71.578 Gb/s)
1.537e-04 s for 204800 floats (85.286 Gb/s)
2.929e-04 s for 409600 floats (89.503 Gb/s)
5.775e-04 s for 819200 floats (90.782 Gb/s)
1.155e-03 s for 1638400 floats (90.821 Gb/s)
2.313e-03 s for 3276800 floats (90.679 Gb/s)
4.629e-03 s for 6553600 floats (90.609 Gb/s)
9.276e-03 s for 13107200 floats (90.431 Gb/s)
1.865e-02 s for 26214400 floats (89.957 Gb/s)
3.754e-02 s for 52428800 floats (89.378 Gb/s)
7.579e-02 s for 104857600 floats (88.546 Gb/s)
The code of the benchmark is here https://foss.heptapod.net/fluiddyn/fluidsim/-/merge_requests/231
I tried with the last Git version of mpi4py and get the same results.
Do you have any suggestions to be able to use the infiniband network with Python ?
Comments (8)
-
reporter -
Next time, please submit your issue in GitHub, we are slowly migrating mpi4py there.
Please try adding the following two lines (at the very beginning) to your Python code:
import mpi4py mpi4py.rc.threads = False
Alternatively, if you are using the master branch from git, you can set the following environment variable before running your Python code:
export MPI4PY_RC_THREADS=0
-
reporter Thanks! It works fine with
mpi4py.rc.threads = False
.I guess you could add some words as a issue template in Bitbucket to tell people who want to create an issue about the migration. Moreover, the link in the documentation is still https://bitbucket.org/mpi4py/mpi4py
I didn’t find anything on
mpi4py.rc
in the documentation. Did I miss this part ?
-
About adding an issue template, yes, I know, I just didn’t have time to do it.
The link to docs will be fixed with next release, I’m just waiting to implement support for DLPack.
As many other aspects of mpi4py, mpi4py.rc is not documented. There is a GitHub issue open to document mpi4py.rc and the
MPI_RC_*
environment variables. Open source is great until no one wants to scratch the boring itches (souce).People should not have to care about mpi4py.rc, it is for special cases. What’s going on here is that your MPI implementation does not use infiniband if you initialize MPI with
MPI_THREAD_MULTIPLE
support. Update your C++ code to useMPI_Init_thread()
and ask forMPI_THREAD_MULTIPLE
, and you should see the exact same problem. You should ask Open MPI forks for clarification about what’s going on.
-
reporter Update your C++ code to use
MPI_Init_thread()
and ask forMPI_THREAD_MULTIPLE
, and you should see the exact same problem.Yes, I tried that and I confirm. I wonder if it is a common behavior or just a very particular problem of our setup.
Someone told me that
MPI_THREAD_MULTIPLE
was known to give less efficient results in cases when one doesn’t need it. If it is true, thenmpi4py.rc
can be important for some mpi4py users. There are many programs that do not rely onMPI_THREAD_MULTIPLE
.
-
Of course it is important, that’s the reason it is there! It is not properly documented, but it is there.
The default has to be MPI_THREAD_MULTIPLE, it is the most general and feature thread-level support and should work correctly in all scenarios for all users with various levels of expertise, and across third-party modules using mpi4py. I did not pick this choice without good advice. This is a reply from William Gropp to my request for advice (September 2008):
I recommend that you initialize with MPI_Init_thread and
MPI_THREAD_MULTIPLE . There is some overhead, but it is mainly an
added latency and is thus most important for short messages. You can
give users that want to optimize the option to select a lower level of
thread support. At 5k entries, on a cluster, the added latency should
not be too serious.
-
reporter Thank you for your advice, your interesting replies and the time you spend on mpi4py. I’m closing this issue since there is nothing to be done.
-
reporter - changed status to closed
mpi4py.rc is the solution to tweek the MPI initiatization and get good performance.
- Log in to comment
A precision: it’s not exactly an Infiniband network but actually an Omni-Path network.
I don’t understand how it is possible to get so large differences between mpi4py and C++ calls by using the same MPI library under the hood. It seems that mpi4py uses the right library but the wrong hardware.