serialization of ibv_poll_cq()

Merged
#468 · Created  · Last updated

Merged pull request

Merge pull request #468 into develop

76cde8a·Author: ·Closed by: ·2021-08-11

Description

Summary

This pull request addresses sometimes-significant degradation in ibv-conduit performance which has been observed at higher thread counts, and was initial reported by Elliot R.

Elliot traced this degradation to false sharing involving the mutex internal to the implementation of completion queues in libibverbs. That lead to the largest change in this PR, which serializes the calls to ibv_poll_cq() (doing so per-Cq, not globally).

The use of a "try lock" operation to serialize polling of completion queues, rather than allowing them to queue up on the internal mutex, allows threads to potentially do more useful work (if/when any exists). This is a significant change (improvement) relative to the prior behavior even if a future libibverbs were to eliminate the false sharing. However, a configure option has been added to restore the old behavior if desired.

Use of the "try lock" approach led to exposing additional instances of false sharing, which became the new bottlenecks. One instance was identified in the rdma dissemination barrier implementations (ibv, aries and reference). Another was found in the counters used in ibv-conduit to round-robin over HCAs in a multi-rail build. These are each addressed in a distinct commit.

Additionally, the interaction between polling by the AM receive thread (when active) with polling by other threads can now be controlled. By default the thread's polls of the receive Cq participate in the same serialization as the other threads. However, there are new options to make this thread the exclusive poller of the receive Cq, or to exclude it from the serialization scheme. The former mode (exclusive access) has a perceived disadvantage of serializing the execution of AM handlers (though for some clients that might have cache advantages). The latter has no perceived benefit, but is included for completeness.

Finally, though committed first, this pull request make changes to testcontend itself to remove instances of false sharing which were interfering with efforts to find ones in the library.

Status

READY FOR REVIEW.

This has been very well tested for performance and correctness on seven distinct systems as shown in the comments.

Testing with the ibv-conduti receive thread active requires fixes for bug 4565 and bug 4266. These have been cherry-picked from pull request #474.

Testing has covered the entire range of thread-related ibv-conduit configure options (all four combinations of --{en,dis}able-ibv-{rcv,conn}-thread * seq/par/parsync * GASNET_RCV_THREAD=y/n if configured-in).

NOTE: there is a single "NO MERGE" commit which should be excluded from review. It contains the testcontend modifications made for the purposes of collecting the tests results in this pull request's comments. This is not the only commit modifying tests/testcontend.c. So, I would recommend review of the "real" changes made to that file by examining only the first commit in this pull request.

All changes shown in the diff other than portions of tests/testcontend.c are subject to review without any need to "filter".

Commits:

  • testcontend: reduce false sharing

    This commit modifies various declarations to isolate certain shared variables to their own cache lines. This reduces false sharing in the test to make this a more meaningful test of contention within GASNet.

  • ibv: serialize CQ poll calls

    This commit adds (when multiple threads are enabled, including the recv thread) logic to serialize calls to ibv_poll_cq(). This work is based on observations by Elliot R. of the Chapel team that the memory allocations internal to libibverb's CQ data structure are susceptible to false sharing that can have a significant impact on performance as thread count increases.

    A new configure option --disable-ibv-serialize-poll-cq is provided to opt-out of this new behavior.

  • ibv: factor and improve "weak counter" abstraction

    There are multiple places in ibv-conduit where we wish to round-robin over multiple HCAs or QPs. These have been written with a "benign" thread race in which read-modify-write is non-atomic. This is expected to be more scalable than an atomic counter with little or no negative impact from having multiple threads selecting the same value concurrently due to the race.

    This commit factors this abstraction and improves it by adding cache padding around the counter in multi-threaded builds (including SEQ builds with the AM receive thread enabled). This padding eliminates false sharing and yields measurable improvements in testcontend.

  • barrier: cache padding in rdma-dissem barriers

    This commit reorganizes fields in the struct used by the reference, aries- and ibv-conduit implementations of an RDMA dissemination barrier. The read/write field are separated from the read-only ones and appropriate cache padding is added.

    Of particular importance is the addition of padding between the lock and the data it protects. Without this, failing "try" traffic on the lock was prone to repeated thrash the data.

    Additionally, unnecessary duplication in aries-conduit has been eliminated.

  • ibv: enable poll serialization in progress threads

    This commit revises the implementation of ibv-conduit's progress threads to allow for two forms of participation in the polling serialization observed by client threads, in addition to the previous non-participation. One is participation as an equal peer with the client threads, while the other is an exclusive mode in which the progress thread acquires the serialization semaphore when started and releases it when terminated.

    This provides new capabilities for a given progress thread, but does not enable it. That will occur in a subsequent commit.

  • ibv: poll serialization by the AM receive thread

    This commit implements a new GASNET_RCV_THREAD_POLL_MODE environment variable, documented in the ibv-conduit README. This setting selects the polling mode of the AM receive thread.

    The connection progress thread has not been modified to opt-in to serialization, because when that thread is active all connection-related completions occur on a distinct Cq that is never polled by any other thread.

  • ChangeLog: entry for serialization of ibv_poll_cq

0 attachments

0 comments

Loading commits...