Enh: Provide RMA variants that are permitted to signal completion synchronously

Recording an idea Scott and I discussed at SC17.

There is a potential performance pitfall in UPC++ for applications that use rput/rget on global pointers where the target memory is usually local, but rarely remote. Such an application may lack certain static knowledge of locality, so it could not use manual localization optimizations (downcasts and load/store) without introducing an app-level branch, which would duplicate the locality branch that is already taking place inside the UPC++/GASNet runtime.

The problem is our semantics require rput/rget to return a non-ready future, and consequently the current implementation allocates and tracks a new future state object for each rput/rget, even those using shared-memory loopback where the payload transfer completes synchronously inside the injection call. For small payloads, this likely equates to future management overhead that is orders of magnitude more expensive than the underlying shared memory load/store performing the data movement.

If instead we provided a variant of rput/rget (and possibly rpc?) that was permitted to signal completion before returning from the injection call (rather that forcing that signal to delay until the next user-level progress), then synchronously completed shared-memory operations could return a trivially-ready future (probably a single static const object or special value designated for this purpose, eliminating all the management overhead).

The application would need to "opt-in" for this behavior, because the behavior would be semantically visible - an application might need to be prepared for this possible behavior.

Under generalized completion, these injection calls should probably have "progress level: user", because they could possibly run loopback lpc callbacks to indicate completion before returning from injection.

Enh: Provide RMA variants that are permitted to signal completion synchronously

Official response

Comments (9)