d901e0f·Author: Paul Hargrove·Closed by: Paul Hargrove·2021-08-09
Description
Summary:
Resolves the thread safety and defective test issues identified in bugs 4265 and 4266. Harness coverage of testreduce has also been increased.
Additionally fixes one minor (unreported, nearly harmless) race in the geometry cache, discovered along the way.
Status:
Bug 4265 fix well tested on Dirac and Wombat. Bug 4266 fix well tested on Wombat.
Testing of bug 4265 fix includes via the repaired testcoll -p N ... and via the ibv-conduit AM receive thread. The latter includes testing both with develop and in the context of work in progress for pull request 468 to provide a mode in which only the AM receive thread polls the recv Cq.
Testing of bug 4266 fix includes added testreduce -p N ... support and via the ibv-conduit AM receive thread (but not using the work in progress for pull request 468).
Commits:
testcoll: repair the -p option
This commit repairs the -p option to testcoll, which was previously ignored as reported in Bug 4265.
This commit resolves Bug 4265 ("Collective scratch management is not thread safe") by replacing storage management via a simple freelist with use of a (thread safe) lock-free LIFO queue.
coll: fix a minor race in tree_geom_fetch
This commit relocates a previously unlocked read of a linked list head to take place within the mutex-protected critical section. Prior to this change, a race in which two (or more) callers observed an empty geom cache on entry would result in each caller initializing the cache rather than appending to it. The result would have been a silent memory leak of the entires created by these threads other than the last one to obtain the lock. However, there was no correctness issue.
Additionally, availability of the list head has been used to make a minor simplification to the list initialization code.
testreduce: add -p option for pollers
This commit extends testreduce with -p N option, similar to that in testam and testcoll. In a PAR build this spawns one or more polling threads to run concurrently with the main thread.
This change is capable of reproducing bug 4266 on all conduits, where previously it could only be reproduced using ibv-conduit's AM receive thread.
harness: increase testreduce coverage
This adds coverage additional harness coverage for testreduce:
Single-element reductions, which have been seen to exercise different algorithms than the default (for at least some conduits and process counts).
Runs in PAR mode with a polling thread, such as to reproduce the behaviors seen in bug 4266.
coll/reduce: minor cleanups
This commit removes a redundant declaration and a redundant statement, both in the implementation of reductions. These "code changes" are not "logic changes".
The removed declaration shadowed an identical one in an enclosing scope.
The removed statement was an assignment right before a goto which is followed by an identical assignment.
coll: fix logic error in AM medium handlers
The collectives support uses two Medium AM handlers which provide a significant amount of generality. This commit corrects a logic error related to non-empty message size and zero "element size", which had not previously been exercised (but will be soon).
coll/reduce: generalize "eager" reductions
This commit generalizes the reduce_BinomialEager and reduce_BinomialEagerSeg algorithms for ReduceToOne to optionally perform a broadcast (over the same binomial tree as the reduction). This is done if and only if root == GEX_RANK_INVALID, which will soon be used internally as a means to request that these ReduceToOne algorithms perform a ReduceToAll.
coll/reduce: new ReduceToAll dispatcher
This commit resolves bug 4266: "gex_Coll_ReduceToAllNB is not thread-safe" by discarding the flawed implementation built on subordinate collectives with use of the newly generalizes Binomial tree implementations of ReduceToOne (now capable of implementing either ToOne or ToAll).
coll/reduce: remove unused reduce_all_Bcast
This commit removes the old, broken (as reported in bug 4266) implementation of ReduceToAll, which became unused in a previous commit. Additionally, the corresponding SUBORDINATE support in the ReduceToOne implementation has been removed.
Summary:
Resolves the thread safety and defective test issues identified in bugs 4265 and 4266.
Harness coverage of
testreduce
has also been increased.Additionally fixes one minor (unreported, nearly harmless) race in the geometry cache, discovered along the way.
Status:
Bug 4265 fix well tested on Dirac and Wombat.
Bug 4266 fix well tested on Wombat.
Testing of bug 4265 fix includes via the repaired
testcoll -p N ...
and via the ibv-conduit AM receive thread. The latter includes testing both withdevelop
and in the context of work in progress for pull request 468 to provide a mode in which only the AM receive thread polls the recv Cq.Testing of bug 4266 fix includes added
testreduce -p N ...
support and via the ibv-conduit AM receive thread (but not using the work in progress for pull request 468).Commits:
testcoll: repair the -p option
This commit repairs the
-p
option totestcoll
, which was previously ignored as reported in Bug 4265.coll: fix bug 4265
This commit resolves Bug 4265 ("Collective scratch management is not thread safe") by replacing storage management via a simple freelist with use of a (thread safe) lock-free LIFO queue.
coll: fix a minor race in tree_geom_fetch
This commit relocates a previously unlocked read of a linked list head to take place within the mutex-protected critical section. Prior to this change, a race in which two (or more) callers observed an empty geom cache on entry would result in each caller initializing the cache rather than appending to it. The result would have been a silent memory leak of the entires created by these threads other than the last one to obtain the lock. However, there was no correctness issue.
Additionally, availability of the list head has been used to make a minor simplification to the list initialization code.
testreduce: add -p option for pollers
This commit extends
testreduce
with-p N
option, similar to that intestam
andtestcoll
. In a PAR build this spawns one or more polling threads to run concurrently with the main thread.This change is capable of reproducing bug 4266 on all conduits, where previously it could only be reproduced using ibv-conduit's AM receive thread.
harness: increase testreduce coverage
This adds coverage additional harness coverage for
testreduce
:Single-element reductions, which have been seen to exercise different algorithms than the default (for at least some conduits and process counts).
Runs in PAR mode with a polling thread, such as to reproduce the behaviors seen in bug 4266.
coll/reduce: minor cleanups
This commit removes a redundant declaration and a redundant statement, both in the implementation of reductions. These "code changes" are not "logic changes".
The removed declaration shadowed an identical one in an enclosing scope.
The removed statement was an assignment right before a
goto
which is followed by an identical assignment.coll: fix logic error in AM medium handlers
The collectives support uses two Medium AM handlers which provide a significant amount of generality. This commit corrects a logic error related to non-empty message size and zero "element size", which had not previously been exercised (but will be soon).
coll/reduce: generalize "eager" reductions
This commit generalizes the
reduce_BinomialEager
andreduce_BinomialEagerSeg
algorithms for ReduceToOne to optionally perform a broadcast (over the same binomial tree as the reduction). This is done if and only ifroot == GEX_RANK_INVALID
, which will soon be used internally as a means to request that these ReduceToOne algorithms perform a ReduceToAll.coll/reduce: new ReduceToAll dispatcher
This commit resolves bug 4266: "gex_Coll_ReduceToAllNB is not thread-safe" by discarding the flawed implementation built on subordinate collectives with use of the newly generalizes Binomial tree implementations of ReduceToOne (now capable of implementing either ToOne or ToAll).
coll/reduce: remove unused reduce_all_Bcast
This commit removes the old, broken (as reported in bug 4266) implementation of ReduceToAll, which became unused in a previous commit. Additionally, the corresponding SUBORDINATE support in the ReduceToOne implementation has been removed.