Enh: Provide RMA variants that are permitted to signal completion synchronously

Issue #107 resolved
Dan Bonachea created an issue

Recording an idea Scott and I discussed at SC17.

There is a potential performance pitfall in UPC++ for applications that use rput/rget on global pointers where the target memory is usually local, but rarely remote. Such an application may lack certain static knowledge of locality, so it could not use manual localization optimizations (downcasts and load/store) without introducing an app-level branch, which would duplicate the locality branch that is already taking place inside the UPC++/GASNet runtime.

The problem is our semantics require rput/rget to return a non-ready future, and consequently the current implementation allocates and tracks a new future state object for each rput/rget, even those using shared-memory loopback where the payload transfer completes synchronously inside the injection call. For small payloads, this likely equates to future management overhead that is orders of magnitude more expensive than the underlying shared memory load/store performing the data movement.

If instead we provided a variant of rput/rget (and possibly rpc?) that was permitted to signal completion before returning from the injection call (rather that forcing that signal to delay until the next user-level progress), then synchronously completed shared-memory operations could return a trivially-ready future (probably a single static const object or special value designated for this purpose, eliminating all the management overhead).

The application would need to "opt-in" for this behavior, because the behavior would be semantically visible - an application might need to be prepared for this possible behavior.

Under generalized completion, these injection calls should probably have "progress level: user", because they could possibly run loopback lpc callbacks to indicate completion before returning from injection.

Comments (9)

  1. Dan Bonachea reporter

    This issue was discussed in the 1/10/18 meeting. We decided to prototype the potential benefit by March and possibly prototype for September.

    Based on some performance measurements I collected for issue #108: source code here:

     memcpy(4KB)                           0.121670 us
     self.lpc(noop0)                       0.527286 us
     upcxx::rput<double>(self)             0.557903 us
    

    A memcpy of a full-page, 4 kilobyte payload is about 4 times faster than an 8-byte rput().wait() or zero-payload loopback lpc().wait(), due to avoiding the overheads of the UPC++ progress engine. This seems to support the value of the optimization proposed here when local_team loopback is the expected common case.

    One interface idea presented was to spell the "opt-in" syntax using the same rput/rget calls and an extension to the generalized completion framework like operation_cx::as_maybeready_future() and source_cx::as_maybeready_future() (with a better name, possibly as_eager_future?). By only allowing it for future signalling this eliminates the possible need for "progress level: user" on the injection call.

    There might still be motivation to allow the analogous behavior for synchronous LPC notifications, eg rput(val, gptr, operation_cx::as_maybesynchronous_lpc( default_persona(), func )), which would allow synchronous execution of func before returning from the rput call when the data movement is performed synchronously (and because the LPC is self-targetted).

  2. Log in to comment