Performance issue with infiniband using Open MPI 4.0.5

Issue #180 closed
Pierre Augier created an issue

I’d like to use with mpi4py a new infiniband network. I wrote a simple benchmark and the results are bad, i.e. are the same with and without the MPI library built to use the infiniband network.

With the lib (Open MPI v4.0.5) that should use inifiniband

Benchmark MPI with mpi4py
Open MPI v4.0.5, package: Open MPI krialforzh@cl2n001 Distribution, ident: 4.0.5, repo rev: v4.0.5, Aug 26, 2020
MPI Version (3, 1)
6.004e-04 s for    102400 floats (10.915 Gb/s)
1.235e-03 s for    204800 floats (10.611 Gb/s)
2.391e-03 s for    409600 floats (10.963 Gb/s)
4.614e-03 s for    819200 floats (11.363 Gb/s)
1.034e-02 s for   1638400 floats (10.141 Gb/s)
2.050e-02 s for   3276800 floats (10.229 Gb/s)
4.198e-02 s for   6553600 floats (9.991 Gb/s)
8.403e-02 s for  13107200 floats (9.983 Gb/s)
1.691e-01 s for  26214400 floats (9.919 Gb/s)
3.390e-01 s for  52428800 floats (9.899 Gb/s)
6.779e-01 s for 104857600 floats (9.900 Gb/s)

With the lib (Open MPI v3.1.3) that uses the standard network:

Benchmark MPI with mpi4py
Open MPI v3.1.3, package: Debian OpenMPI, ident: 3.1.3, repo rev: v3.1.3, Oct 29, 2018
MPI Version (3, 1)
6.474e-04 s for    102400 floats (10.123 Gb/s)
1.219e-03 s for    204800 floats (10.756 Gb/s)
2.347e-03 s for    409600 floats (11.168 Gb/s)
4.685e-03 s for    819200 floats (11.191 Gb/s)
1.161e-02 s for   1638400 floats (9.035 Gb/s)
2.057e-02 s for   3276800 floats (10.193 Gb/s)
4.212e-02 s for   6553600 floats (9.958 Gb/s)
8.424e-02 s for  13107200 floats (9.958 Gb/s)
1.694e-01 s for  26214400 floats (9.904 Gb/s)
3.382e-01 s for  52428800 floats (9.921 Gb/s)
6.771e-01 s for 104857600 floats (9.912 Gb/s)

I wrote the same benchmark in C++ and the results are correct. It’s much faster with Open MPI v4.0.5 (and infiniband)

Open MPI v4.0.5, package: Open MPI krialforzh@cl2n001 Distribution, ident: 4.0.5, repo rev: v4.0.5, Aug 26, 2020
9.156e-05 s for     102400 floats (71.578 Gb/s)
1.537e-04 s for     204800 floats (85.286 Gb/s)
2.929e-04 s for     409600 floats (89.503 Gb/s)
5.775e-04 s for     819200 floats (90.782 Gb/s)
1.155e-03 s for    1638400 floats (90.821 Gb/s)
2.313e-03 s for    3276800 floats (90.679 Gb/s)
4.629e-03 s for    6553600 floats (90.609 Gb/s)
9.276e-03 s for   13107200 floats (90.431 Gb/s)
1.865e-02 s for   26214400 floats (89.957 Gb/s)
3.754e-02 s for   52428800 floats (89.378 Gb/s)
7.579e-02 s for  104857600 floats (88.546 Gb/s)

The code of the benchmark is here https://foss.heptapod.net/fluiddyn/fluidsim/-/merge_requests/231

I tried with the last Git version of mpi4py and get the same results.

Do you have any suggestions to be able to use the infiniband network with Python ?

Comments (8)

  1. Pierre Augier reporter

    A precision: it’s not exactly an Infiniband network but actually an Omni-Path network.

    I don’t understand how it is possible to get so large differences between mpi4py and C++ calls by using the same MPI library under the hood. It seems that mpi4py uses the right library but the wrong hardware.

  2. Lisandro Dalcin

    Next time, please submit your issue in GitHub, we are slowly migrating mpi4py there.

    Please try adding the following two lines (at the very beginning) to your Python code:

    import mpi4py
    mpi4py.rc.threads = False
    

    Alternatively, if you are using the master branch from git, you can set the following environment variable before running your Python code:

    export MPI4PY_RC_THREADS=0
    

  3. Pierre Augier reporter

    Thanks! It works fine with mpi4py.rc.threads = False.

    I guess you could add some words as a issue template in Bitbucket to tell people who want to create an issue about the migration. Moreover, the link in the documentation is still https://bitbucket.org/mpi4py/mpi4py

    I didn’t find anything on mpi4py.rc in the documentation. Did I miss this part ?

  4. Lisandro Dalcin

    About adding an issue template, yes, I know, I just didn’t have time to do it.

    The link to docs will be fixed with next release, I’m just waiting to implement support for DLPack.

    As many other aspects of mpi4py, mpi4py.rc is not documented. There is a GitHub issue open to document mpi4py.rc and the MPI_RC_* environment variables. Open source is great until no one wants to scratch the boring itches (souce).

    People should not have to care about mpi4py.rc, it is for special cases. What’s going on here is that your MPI implementation does not use infiniband if you initialize MPI with MPI_THREAD_MULTIPLE support. Update your C++ code to use MPI_Init_thread() and ask for MPI_THREAD_MULTIPLE, and you should see the exact same problem. You should ask Open MPI forks for clarification about what’s going on.

  5. Pierre Augier reporter

    Update your C++ code to use MPI_Init_thread() and ask for MPI_THREAD_MULTIPLE, and you should see the exact same problem.

    Yes, I tried that and I confirm. I wonder if it is a common behavior or just a very particular problem of our setup.

    Someone told me that MPI_THREAD_MULTIPLE was known to give less efficient results in cases when one doesn’t need it. If it is true, then mpi4py.rc can be important for some mpi4py users. There are many programs that do not rely on MPI_THREAD_MULTIPLE.

  6. Lisandro Dalcin

    Of course it is important, that’s the reason it is there! It is not properly documented, but it is there.

    The default has to be MPI_THREAD_MULTIPLE, it is the most general and feature thread-level support and should work correctly in all scenarios for all users with various levels of expertise, and across third-party modules using mpi4py. I did not pick this choice without good advice. This is a reply from William Gropp to my request for advice (September 2008):

    I recommend that you initialize with MPI_Init_thread and 
    MPI_THREAD_MULTIPLE .  There is some overhead, but it is mainly an 
    added latency and is thus most important for short messages.  You can 
    give users that want to optimize the option to select a lower level of 
    thread support.  At 5k entries, on a cluster, the added latency should 
    not be too serious.

  7. Pierre Augier reporter

    Thank you for your advice, your interesting replies and the time you spend on mpi4py. I’m closing this issue since there is nothing to be done.

  8. Log in to comment