Bug 4265 and bug 4266 - collective thread-safety

Merged
#474 · Created  · Last updated

Merged pull request

Merge pull request #474 into develop

d901e0f·Author: ·Closed by: ·2021-08-09

Description

Summary:

Resolves the thread safety and defective test issues identified in bugs 4265 and 4266.
Harness coverage of testreduce has also been increased.

Additionally fixes one minor (unreported, nearly harmless) race in the geometry cache, discovered along the way.

Status:

Bug 4265 fix well tested on Dirac and Wombat.
Bug 4266 fix well tested on Wombat.

Testing of bug 4265 fix includes via the repaired testcoll -p N ... and via the ibv-conduit AM receive thread. The latter includes testing both with develop and in the context of work in progress for pull request 468 to provide a mode in which only the AM receive thread polls the recv Cq.

Testing of bug 4266 fix includes added testreduce -p N ... support and via the ibv-conduit AM receive thread (but not using the work in progress for pull request 468).

Commits:

  • testcoll: repair the -p option

    This commit repairs the -p option to testcoll, which was previously ignored as reported in Bug 4265.

  • coll: fix bug 4265

    This commit resolves Bug 4265 ("Collective scratch management is not thread safe") by replacing storage management via a simple freelist with use of a (thread safe) lock-free LIFO queue.

  • coll: fix a minor race in tree_geom_fetch

    This commit relocates a previously unlocked read of a linked list head to take place within the mutex-protected critical section. Prior to this change, a race in which two (or more) callers observed an empty geom cache on entry would result in each caller initializing the cache rather than appending to it. The result would have been a silent memory leak of the entires created by these threads other than the last one to obtain the lock. However, there was no correctness issue.

    Additionally, availability of the list head has been used to make a minor simplification to the list initialization code.

  • testreduce: add -p option for pollers

    This commit extends testreduce with -p N option, similar to that in testam and testcoll. In a PAR build this spawns one or more polling threads to run concurrently with the main thread.

    This change is capable of reproducing bug 4266 on all conduits, where previously it could only be reproduced using ibv-conduit's AM receive thread.

  • harness: increase testreduce coverage

    This adds coverage additional harness coverage for testreduce:

    • Single-element reductions, which have been seen to exercise different algorithms than the default (for at least some conduits and process counts).

    • Runs in PAR mode with a polling thread, such as to reproduce the behaviors seen in bug 4266.

  • coll/reduce: minor cleanups

    This commit removes a redundant declaration and a redundant statement, both in the implementation of reductions. These "code changes" are not "logic changes".

    The removed declaration shadowed an identical one in an enclosing scope.

    The removed statement was an assignment right before a goto which is followed by an identical assignment.

  • coll: fix logic error in AM medium handlers

    The collectives support uses two Medium AM handlers which provide a significant amount of generality. This commit corrects a logic error related to non-empty message size and zero "element size", which had not previously been exercised (but will be soon).

  • coll/reduce: generalize "eager" reductions

    This commit generalizes the reduce_BinomialEager and reduce_BinomialEagerSeg algorithms for ReduceToOne to optionally perform a broadcast (over the same binomial tree as the reduction). This is done if and only if root == GEX_RANK_INVALID, which will soon be used internally as a means to request that these ReduceToOne algorithms perform a ReduceToAll.

  • coll/reduce: new ReduceToAll dispatcher

    This commit resolves bug 4266: "gex_Coll_ReduceToAllNB is not thread-safe" by discarding the flawed implementation built on subordinate collectives with use of the newly generalizes Binomial tree implementations of ReduceToOne (now capable of implementing either ToOne or ToAll).

  • coll/reduce: remove unused reduce_all_Bcast

    This commit removes the old, broken (as reported in bug 4266) implementation of ReduceToAll, which became unused in a previous commit. Additionally, the corresponding SUBORDINATE support in the ReduceToOne implementation has been removed.

0 attachments

0 comments

Loading commits...