Enh: Streamline "empty RMA" operations

Issue #484 new
Dan Bonachea created an issue

Pull request 345 added optimizations for RMA operations that synchronously complete due to shared-memory bypass (i.e., operating on a pointer satisfying global_ptr<>::is_local()).

In that PR I identified that the same machinery can also be used to streamline the degenerate case of "empty RMA", i.e. bulk rget/rput operations called with argument count == 0 indicating the data transfer is a no-op (this property is notably independent of pointer locality). These operations vacuously complete synchronously, and therefore are amenable to bypassing libupcxx entry and eager future/promise completions are permitted to be satisfied before return.

This enhancement issue requests we deploy the eager completion optimization for "empty RMA". The proposed algorithm appears in this comment.

This change is currently "on hold" because it has the potential to introduce a new branch into the critical path of remote RMA operations. The plan is to deploy a compiler annotation (via the gasnet_tools interface) that should ensure the corresponding is-empty branch inside the GASNet RMA header can be optimized away, leading to no net growth in dynamic branch count.

Comments (3)

  1. Dan Bonachea reporter

    Pull request 356 adds a runtime warning (enabled by default in codemode=debug) indicating that an empty RMA was invoked.

    If you've seen evidence of this warning in your application, please let us know in this issue!

  2. Dan Bonachea reporter

    Similar observations in this issue also apply to upcxx::copy RMA, where the code paths are much more complicated but we similarly do not recognize or special-case empty transfers. As a result zero-size transfers can propagate all the way down to a device-level call like cuMemcpyDtoHAsync(ByteCount=0). The CUDA library docs don't clarify the behavior of zero-length transfers, so it's even possible this could result in misbehavior today (and similarly with future devices). Even if it works as expected, I'd expect the CPU overhead for such an empty transfer (which at least entails CUDA stream operations) to be much higher than recognizing and short-circuiting empty transfers at entry to copy.

  3. Log in to comment