Blocking RMA+atomic calls

Issue #28 wontfix

Former user created an issue 2017-05-10

Steve would like to see blocking versions of rput, rget, fetch_add, etc just as a convenience. I'm on board. What do you think?

Official response

Dan Bonachea
Strawman Proposal:

Overview:

Optimizing for hierarchical memory systems in UPC++ often means leveraging local_team() to understand node-memory boundaries and bypassing unnecessary overheads when accessing node-local memory that is physically load/store addressable by the CPU. For the use case of RMA, "bypass" often means localization, where the global_ptr<T> is downcast to a raw T* and accessed using normal C++ load/store operations, entirely bypassing UPC++ machinery for operations that are known to be most efficiently satisfyable via "synchronous" operations on the cache-coherent memory system (which are also amenable to static and architectural reordering optimizations).

However for the case of atomic memory operations, as of 2020.3.0 UPC++ currently provides no equivalent localization transformation. Moreover, the atomic_domain API (by design) only guarantees coherency when all accesses to the memory locations use the appropriate atomic_domain function calls. Unfortunately this means that atomic_domain operations on memory locations that are known to reside locally (with affinity to this process or one in local_team()), still incur the cost of delayed completion (usually via futures) and the UPC++ progress engine (hundreds to thousands of cycles) in order to perform an access that frequently only requires a single atomic memory access instruction.

As quantitative motivation, here are some microbenchmark measurements from our 2.4GHz Xeon E5530 testbed: (Linux/gcc-10/smp/opt develop@9f5b52b0c)
- atomic_domain<int64>::fetch_add(relaxed) loopback latency : 0.165 us / operation (measured by bench/misc_perf)
- gasnett_atomic64_add latency: 0.00225 us / operation (measured by gasnet/tests/testmisc)
- gex_AD_OpNB_U64(GEX_OP_FADD) latency: 0.016 us / operation (measured by gasnet/tests/testfaddperf)
This means on an otherwise-idle core/process, the CPU overheads associated with using the current atomic_domain API are about 70x higher than the cost of the underlying atomic memory instruction on physically shared memory (cached in L1), and are roughly 10x higher than the latency of the call to the GASNet ratomic interface used internally to implement the synchronous AMO. The majority of this overhead cost is in the completion management (eg dynamic allocation of a future), and in entering the user-level progress engine to accept deferred completion (for the AMO that was actually completed synchronously).

The difference is qualitatively similar on KNL/smp, but the relative costs of the overhead penalties are even higher when internal progress incurs the cost polling a NIC on the I/O bus: (Linux/intel-19.0.3/aries/opt develop@9f5b52b0c)
- atomic_domain<int64>::fetch_add(relaxed) loopback latency : 0.800 us / operation (measured by bench/misc_perf)
- gasnett_atomic64_add latency: 0.00436 us / operation (measured by gasnet/tests/testmisc)
- gex_AD_OpNB_U64(GEX_OP_FADD) latency: 0.056 us / operation (measured by gasnet/tests/testfaddperf)
That's comes out to about a 180x penalty over the raw atomic memory instruction, and about a 14x penalty over the latency of the call to the GASNet ratomic interface used internally to implement the synchronous AMO.

To address this gap, I propose the following operation_cx::as_blocking() extension, which provides a mechanism to request synchronous execution of an atomic memory operation, without the overheads of deferred completion or invoking user-level progress.

Proposed Specification:
```
[static] CType operation_cx::as_blocking();
```
Constructs a completion object that represents blocking until operation completion before a communication call returns, delaying the return until the operation completion event occurs.

This CType is only valid for use in operations atomic_domain::<op>(global_ptr<T> p,...), where p.is_local().

For value-producing operations (ie load, compare_exchange and fetching read-modify-write operations), the value produced corresponding to this operation completion is returned by value as an unboxed (ie non-future) type (or type component) in RType.

UPC++ progress level: none

Example usage:
```
  global_ptr<int64_t> gptr = ...; 
  assert(gptr.is_local()); // known to point to a location in local_team()
  // use "as_blocking" to avoid the overheads of futures+progress for a CPU fetch-add instruction
  int64_t result = my_ad.fetch_inc(gptr, std::memory_order_acq_rel, operation_cx::as_blocking());
```
The hope is that this code can be statically expanded down to something close to just the gex_AD_OpNB_U64(GEX_OP_FADD) call measured above, avoiding the unnecessary completion/progress overheads.

Discussion:
- IMO the best-motivated use case is for global_ptr's that are is_local() (ie locations local to the calling node), so I've initially proposed that as a restriction on use.
  - We could relax this precondition later and allow its use for remote communication, but I believe this is contradictory to our design principles to encourage asynchrony instead. The completion/progress overheads are far less noticeable for operations that include a roundtrip network latency, and that's time we'd really prefer to encourage using the core for overlapped work rather than idling it.
  - It's worth noting that atomics are often used for inter-process synchronization, in algorithms where there may be no locally-available opportunities for overlap. However it's less typing to write .wait() to request blocking behavior that also preserves user-level attentiveness during the latency stall (think thousands of cycles or more).
- The draft above is restricted to atomics, but there's nothing fundamental that would prevent its use in RMA (although NOT for RPC, which could easily deadlock). I haven't proposed that for the following reasons:
  - If we keep the restriction to local() pointers, then there's probably zero motivation for extending operation_cx::as_blocking() to RMA, as *(gptr.local()) is more succinct, self-documenting and probably equivalent post-optimization.
  - If we allow use of operation_cx::as_blocking() with remote global_ptr's, then this would provide a means to stall the caller for operation completion of remote RMA without advancing user progress. This provides a novel capability (and should be deadlock-free), however I don't consider this capability a "good thing"; it delays available local callbacks that probably should have been overlapped with the communication latency, and decreases attentiveness to incoming RPCs (when invoked on the master persona). I have yet to see a compelling use case where this capability would provide an expected win.
  - It's easy to relax the restriction later if we decide it has a well-motivated use case, but harder to "take back" a capability that turns out to be a mistake.
- View original context
- 2020-06-16T17:01:55+00:00

Comments (24)

Steven Hofmeyr
A suggestion is that we could do this, but make the non-blocking the default, and then make it explicit that a blocking call is being used, e.g. rget_blocking.
- 2017-05-10T21:00:05+00:00
BrianS
blocking rget is just auto f = rget(....); wait(f); but we can make blocking variants. this implies we are calling into progress, so I don't think that will make deadlocks. blocking rput is waiting for remote completion? I guess the blocking versions would return values instead of futures, so people can ease into futures?
- 2017-05-11T23:29:50+00:00
Former user Account Deleted reporter
Blocking calls would probably make internal progress only so users don't have to worry about callback. This means they wouldn't be implemented on top of wait.
- 2017-05-11T23:39:43+00:00
Steven Hofmeyr
Yes, we want blocking calls to return values. So I can write something like:

int64_t x=atomic_fetchadd(p, n) ;
- 2017-05-11T23:50:08+00:00
Scott Baden
I hear what you are saying Steve: we want to look as much as possible as conventional C++. We are also getting bitten by memberOf(). I suspect there will be others. At this stage of the spec development, what I'd like to do is collect UPC++ idiomatic expressions, and later (at a time to be determined) look at whether including them as first class citizens (i.e. with a named primitive like a blocking op) is the right way to go. Since we said that we wanted nearly all ops to be non-blocking, that implies we are going to have to live with futures on way or the other.. If we decide we want to change the way we do business because there is mounting evidence that we are in error, so be it. But I'm concerned about holding up the spec release and these changes can be settled after we've had a chance to look at the entire document. Fair enough? We will return to this.
- 2017-05-14T05:41:11+00:00
BrianS
There can be blocking versions of all our functions, but that I would put in the appendix. I don't know if there are progress consequences with programming in a blocking model. My first view of this is that upc++ is free to do whatever we like behind a blocking call, so progress should be fine. I know that in MPI you can deadlock with all blocking designs, but that is for users to determine if that applies to them. The specific case Steven is discussing does not have that problem.

the function would need to have a different name. We can't overload based on return type.
- 2017-05-14T07:38:24+00:00
Scott Baden
An appendix would be fine, but until we know about the progress issue, best not to put it in the API.
- 2017-05-14T14:42:02+00:00
Former user Account Deleted reporter
A blocking call that only made internal progress (as opposed to user progress) would be easier for users since they wouldn't have to worry about callbacks firing during an innocuous blocking rput/rget. This semantics would make them unimplementable on top of the asynchronous ones, thus they have true core value and worthy of first-class citizenship in the spec. There would be no deadlock issue, current UPC++ proves this by implementation.
- 2017-05-14T20:14:38+00:00
Scott Baden
John, This is a definitive answer. Long live certain core blocking operations as first class citizens. So are you saying put, get and atomic fetchNadd?
- 2017-05-15T00:08:25+00:00
Dan Bonachea
- changed component to RMA
- changed title to Blocking RMA+atomic calls
- 2017-09-14T07:08:19+00:00
Dan Bonachea
- assigned issue to
  
  Steven Hofmeyr
- changed milestone to 2018.03.31 release
- 2018-01-05T04:46:34+00:00
Dan Bonachea
This issue was discussed in the 1/10/18 meeting, and Steve agreed to look into this (ideally as a low priority before March release, but more likely delayed to Sept).

We discussed the possibility that a code which "knows" it's performing an atomic access on local_team() shared memory might use GC to express blocking atomics:
```
atomic_fetch_add(gptr, val, std::memory_order_acq_rel, operation_cx::as_blocking())  // avoid the overheads of futures, etc for a CPU fetch-add instruction
```
these are particularly well-motivated for atomics because unlike RMA where one can downcast a gptr.is_local() and use a C++ load/store, there is (intentionally) no equivalent coherent localization transformation for atomics.

Note closely related issue ~~#107~~ suggests a different alternative of operation_cx::as_maybeready_future() for a somewhat different use case where the target is < 100% local (so we don't want guaranteed blocking for occasional long-latency remote operations), but we also want to avoid the dispatcher overheads for the common case of local memory.
- 2018-01-12T09:35:49+00:00
Dan Bonachea
- changed milestone to 2018.09.30 release
In the 2018-03-09 meeting we resolved to defer resolution of this issue to next release.
- 2018-03-10T02:58:54+00:00
Dan Bonachea
- changed milestone to 2019.09.30 release
This issue was triaged at the 2018-06-13 Pagoda meeting and assigned a new milestone/priority.
- 2018-06-16T03:45:29+00:00
Dan Bonachea
- changed milestone to FY22
Our latest SOW assigns this work to a FY22 milestone.
- 2019-04-25T19:06:47+00:00
Dan Bonachea
Strawman Proposal:

Overview:

Optimizing for hierarchical memory systems in UPC++ often means leveraging local_team() to understand node-memory boundaries and bypassing unnecessary overheads when accessing node-local memory that is physically load/store addressable by the CPU. For the use case of RMA, "bypass" often means localization, where the global_ptr<T> is downcast to a raw T* and accessed using normal C++ load/store operations, entirely bypassing UPC++ machinery for operations that are known to be most efficiently satisfyable via "synchronous" operations on the cache-coherent memory system (which are also amenable to static and architectural reordering optimizations).

However for the case of atomic memory operations, as of 2020.3.0 UPC++ currently provides no equivalent localization transformation. Moreover, the atomic_domain API (by design) only guarantees coherency when all accesses to the memory locations use the appropriate atomic_domain function calls. Unfortunately this means that atomic_domain operations on memory locations that are known to reside locally (with affinity to this process or one in local_team()), still incur the cost of delayed completion (usually via futures) and the UPC++ progress engine (hundreds to thousands of cycles) in order to perform an access that frequently only requires a single atomic memory access instruction.

As quantitative motivation, here are some microbenchmark measurements from our 2.4GHz Xeon E5530 testbed: (Linux/gcc-10/smp/opt develop@9f5b52b0c)
- atomic_domain<int64>::fetch_add(relaxed) loopback latency : 0.165 us / operation (measured by bench/misc_perf)
- gasnett_atomic64_add latency: 0.00225 us / operation (measured by gasnet/tests/testmisc)
- gex_AD_OpNB_U64(GEX_OP_FADD) latency: 0.016 us / operation (measured by gasnet/tests/testfaddperf)
This means on an otherwise-idle core/process, the CPU overheads associated with using the current atomic_domain API are about 70x higher than the cost of the underlying atomic memory instruction on physically shared memory (cached in L1), and are roughly 10x higher than the latency of the call to the GASNet ratomic interface used internally to implement the synchronous AMO. The majority of this overhead cost is in the completion management (eg dynamic allocation of a future), and in entering the user-level progress engine to accept deferred completion (for the AMO that was actually completed synchronously).

The difference is qualitatively similar on KNL/smp, but the relative costs of the overhead penalties are even higher when internal progress incurs the cost polling a NIC on the I/O bus: (Linux/intel-19.0.3/aries/opt develop@9f5b52b0c)
- atomic_domain<int64>::fetch_add(relaxed) loopback latency : 0.800 us / operation (measured by bench/misc_perf)
- gasnett_atomic64_add latency: 0.00436 us / operation (measured by gasnet/tests/testmisc)
- gex_AD_OpNB_U64(GEX_OP_FADD) latency: 0.056 us / operation (measured by gasnet/tests/testfaddperf)
That's comes out to about a 180x penalty over the raw atomic memory instruction, and about a 14x penalty over the latency of the call to the GASNet ratomic interface used internally to implement the synchronous AMO.

To address this gap, I propose the following operation_cx::as_blocking() extension, which provides a mechanism to request synchronous execution of an atomic memory operation, without the overheads of deferred completion or invoking user-level progress.

Proposed Specification:
```
[static] CType operation_cx::as_blocking();
```
Constructs a completion object that represents blocking until operation completion before a communication call returns, delaying the return until the operation completion event occurs.

This CType is only valid for use in operations atomic_domain::<op>(global_ptr<T> p,...), where p.is_local().

For value-producing operations (ie load, compare_exchange and fetching read-modify-write operations), the value produced corresponding to this operation completion is returned by value as an unboxed (ie non-future) type (or type component) in RType.

UPC++ progress level: none

Example usage:
```
  global_ptr<int64_t> gptr = ...; 
  assert(gptr.is_local()); // known to point to a location in local_team()
  // use "as_blocking" to avoid the overheads of futures+progress for a CPU fetch-add instruction
  int64_t result = my_ad.fetch_inc(gptr, std::memory_order_acq_rel, operation_cx::as_blocking());
```
The hope is that this code can be statically expanded down to something close to just the gex_AD_OpNB_U64(GEX_OP_FADD) call measured above, avoiding the unnecessary completion/progress overheads.

Discussion:
- IMO the best-motivated use case is for global_ptr's that are is_local() (ie locations local to the calling node), so I've initially proposed that as a restriction on use.
  - We could relax this precondition later and allow its use for remote communication, but I believe this is contradictory to our design principles to encourage asynchrony instead. The completion/progress overheads are far less noticeable for operations that include a roundtrip network latency, and that's time we'd really prefer to encourage using the core for overlapped work rather than idling it.
  - It's worth noting that atomics are often used for inter-process synchronization, in algorithms where there may be no locally-available opportunities for overlap. However it's less typing to write .wait() to request blocking behavior that also preserves user-level attentiveness during the latency stall (think thousands of cycles or more).
- The draft above is restricted to atomics, but there's nothing fundamental that would prevent its use in RMA (although NOT for RPC, which could easily deadlock). I haven't proposed that for the following reasons:
  - If we keep the restriction to local() pointers, then there's probably zero motivation for extending operation_cx::as_blocking() to RMA, as *(gptr.local()) is more succinct, self-documenting and probably equivalent post-optimization.
  - If we allow use of operation_cx::as_blocking() with remote global_ptr's, then this would provide a means to stall the caller for operation completion of remote RMA without advancing user progress. This provides a novel capability (and should be deadlock-free), however I don't consider this capability a "good thing"; it delays available local callbacks that probably should have been overlapped with the communication latency, and decreases attentiveness to incoming RPCs (when invoked on the master persona). I have yet to see a compelling use case where this capability would provide an expected win.
  - It's easy to relax the restriction later if we decide it has a well-motivated use case, but harder to "take back" a capability that turns out to be a mistake.
- 2020-06-16T17:01:55+00:00
Amir Kamil
If we are restricting this to local pointers, why not just add AD overloads that take T* rather than global_ptr<T>?
- 2020-06-16T21:31:02+00:00
Dan Bonachea
If we are restricting this to local pointers, why not just add AD overloads that take T* rather than global_ptr<T>?

This is a good question and worth considering as a competing proposal.

Offhand the downsides I see:
1. It doubles the width of the AD interface, although these are just overloads so maybe we don't care.
  1. if we ever found a motivation to expand this beyond AMOs, we'd be potentially doubling other interfaces as well
2. The type signature would give the impression that we support atomics to any T* and not just T's in the shared segment which is all we actually support.
3. In order to call gex_AD_OpNB to perform the AMO given only a T*, the implementation would need to start by effectively performing an upcast (a potentially expensive operation, relative to an atomic instruction) to lookup that T* in the local segment table and discover the rank of the PSHM peer segment where the target memory appears.
  1. Knowing the rank of the target process is necessary for cases like aries-conduit with GEX_FLAG_AD_FAVOR_REMOTE (offload enabled) where the "local" atomic access actually still needs to be processed by the NIC offload hardware in loopback mode to maintain coherence, and this requires naming the correct endpoint with affinity to the target memory.
  2. Our upcast logic currently lives in backend::globalize_memory and calls std::upper_bound on a sorted table of segment base pointers (with a complexity of O(log2(num_pshm_peers))) followed by a lookup in a second table to retrieve the corresponding peer id, and is probably close to the best we can do for that operation. This is not super-expensive in an absolute sense, but on a slow KNL running with 272 PSHM peers, it still sounds expensive relative to the underlying atomic instruction (which is all you need for the cases of smp-conduit or aries-conduit with GEX_FLAG_AD_FAVOR_LOCAL).
  3. The target object probably has an associated global_ptr<T> in the application code containing the information we need, so the caller should just give it to us rather than forcing us to reconstruct that information on every call via on-the-fly upcasts.
Given this entire proposal is motivated by performance, I'm currently opposed to the expected additional cost associated with the counter-proposal of AD overloads taking only a T* argument.
- 2020-06-16T23:37:06+00:00
Scott Baden
What is AD?
- 2020-06-18T15:15:24+00:00
Dan Bonachea
I've updated/edited the proposal above based on our discussion in the 2020-06-17 meeting, where it was observed that additional machinery is required to accommodate the value-producing atomic operations.

I've proposed the value produced by an operation_cx::as_blocking() would be returned unboxed in RType, ie the return type of atomic_domain<int>::fetch_add(...,operation_cx::as_blocking()) would be a simple by-value int. This expands our definition of RTypes somewhat (as they can now include T in addition to future<T>), but I think it solves the problem cleanly and avoids efficiency questions associated with returning what is statically known to be a trivially-ready future pointlessly boxing the produced value.

An alternative formulation would be to introduce new explicit variants of the atomic_domain operations (either with a name suffix or overloaded names with an artificial "tag" argument) that statically demand synchronous completion (not accepting any completion argument and requiring is_local pointer arg) and return-by-value for produced values. This formulation makes it clear the feature is syntactically specific to atomic_domain, rather than giving the impression that it's a general completion variant we'd support elsewhere. It would also prevent combination of synchronous completion with other forms of completion for the same operation, but I cannot think of a well-motivated reason for wanting that capability (for a synchronous operation, it's equally efficient to synchronously invoke any additional desired completions directly in the caller).

Example: (alternative formulations)
```
  global_ptr<int64_t> gptr = ...; 
  assert(gptr.is_local()); // known to point to a location in local_team()

  //  variant of fetch_inc with "_local" suffix is synchronous and returns the value produced by-value
  int64_t result = my_ad.fetch_inc_local(gptr, std::memory_order_acq_rel);

  // another possible API, via overload tagged argument in place of the completion argument (not my favorite)
  // here upcxx::synchronous_local would be a constant of type upcxx::synchronous_local_tag_t or similar
  int64_t result = my_ad.fetch_inc(gptr, std::memory_order_acq_rel, synchronous_local);
```
In the same meeting we resolved to defer generation of a working group draft for this proposal, in favor of related issue ~~#107~~, which solves a different problem but could yield a subset of the performance improvements we are hoping to obtain here.
- 2020-06-18T23:35:05+00:00

Amir Kamil

Performance numbers collected as of the merge of impl PR 345. Systems:

iMac (2019): Intel Core i5-8500 6th gen 3.0 GHz
Dirac (pcp-d-6): Intel Xeon E5530 2.4 GHz

GASNet numbers from gasnet/tests/testfaddperf, UPC++ numbers from a modified version of bench/misc_perf. All numbers are in microseconds.

Add

version	iMac Clang 12.0.5	iMac GCC 10.2.0	Dirac Clang 12.0.0	Dirac GCC 11.1.0	Dirac Intel 2021.1.2	Dirac PGI 20.4
`gex_AD_OpNB_U64`	0.010	0.009	0.014	0.014	0.013	0.015
defer promise	0.014	0.011	0.019	0.017	0.018	0.018
eager promise	0.010	0.011	0.015	0.017	0.014	0.015
defer future	0.14	0.10	0.14	0.12	0.12	0.16
eager future	0.014	0.010	0.018	0.016	0.018	0.025
blocking	0.010	0.009	0.014	0.015	0.013	0.015

Fetch-add (non-value)

version	iMac Clang 12.0.5	iMac GCC 10.2.0	Dirac Clang 12.0.0	Dirac GCC 11.1.0	Dirac Intel 2021.1.2	Dirac PGI 20.4
`gex_AD_OpNB_U64`	0.010	0.010	0.015	0.014	0.015	0.014
defer promise	0.014	0.011	0.020	0.019	0.021	0.017
eager promise	0.011	0.011	0.015	0.018	0.015	0.015
defer future	0.14	0.10	0.14	0.13	0.13	0.16
eager future	0.014	0.010	0.019	0.017	0.020	0.025
blocking	0.010	0.010	0.015	0.015	0.015	0.014

The blocking variant is unspecified and is correct when completion is synchronous (which is the case here), but incorrect when it is asynchronous. So it is the best we can expect from specifying and correctly implementing operation_cx::as_blocking().

The promise variants do not including obtaining the future and waiting on it. That can be amortized over many operations, or even elided if you know a priori that the operations complete synchronously.

In general, eager futures come within 40% of GASNet/blocking (except PGI, where there’s ~75% overhead), while eager promises get even closer (~25% on Dirac/GCC, even closer on the other system/compiler combinations).

Given these results, my inclination is that adding as_blocking() is not worth it. Thoughts?

2021-06-10T01:56:53+00:00

Dan Bonachea
Thanks for collecting these results @Amir Kamil !

I agree these numbers are a compelling demonstration that (as hoped) the combination of as_eager enhancements in issue 107 and impl PR 345 (specifically as_eager_{promise,future}, the ready empty future optimization and non-value fetching AMO overloads) have "closed the gap" in overhead relative to gex_AD_OpNB_U64, which was the primary goal of the proposal in this issue. Once the default is changed to eager completion, programs can potentially reap these benefits without source code changes (although they'll need to use the new overloads for best performance of fetching-AMOs).

More importantly, the impl PR 345 enhancements are semantically much "cleaner" than the as_blocking() proposal, which suffers from non-graceful semantic degradation in the presence of remote pointers (either synchronously blocking for a network round-trip, or crashing/UB).

I hereby withdraw the proposals in this issue.
- 2021-06-11T03:08:17+00:00
Dan Bonachea
- changed status to wontfix
- 2021-06-11T03:08:30+00:00
Dan Bonachea
- changed milestone to 2021.3.0 release
Completed for 2021.3.0
- 2021-09-24T21:51:58+00:00
Log in to comment

Assignee: Steven Hofmeyr

Type: enhancement

Priority: major

Status: wontfix

Component: RMA

Milestone: 2021.3.0 release

Version: –

Votes: 0

Watchers: 4

Official response

Strawman Proposal:

Overview:

Proposed Specification:

Example usage:

Discussion: