Tune reference memory kinds

UPC++ features "native" memory kinds support for various devices, whereby upcxx::copy operations are accelerated by GASNet kinds support such that data transfers stream directly between device memory and RDMA-capable network card, without passing through host memory. This is generally the most efficient mechanism for effecting such transfers, and performance is quite good where supported.

However some systems lack native kinds support, for example in the absence of proper kernel drivers or for networks lacking requisite hardware support. On such systems UPC++ utilizes a "reference" implementation of memory kinds that stages transfers involving device memory through host memory on one or both sides, as needed. This adds a semantically superfluous copy of the data payload, which hurts performance (relative to native kinds); however that is fundamentally the necessary step to deal with the lack of support for a direct transfer. This "extra copy" staging incurs a very noticeable performance hit relative to native kinds, and as such the performance of reference kinds (where one or both sides includes device memory) has not received much attention.

However there are a number of additional refinements that could be made to the reference kinds algorithm to improve performance somewhat by specializing for particular cases. Eg:

Reference copies between co-located (same-node but distinct) processes currently stage through host memory on each side involving a device, despite the fact that both memories physically share a system bus and the network hardware is not involved. This could and should be replaced by peer-to-peer IPC support that would eliminate the extra staging copies. We currently hope to deploy this optimization inside GASNet, where similar infrastructure could also benefit same-node transfers using native kinds.
Reference copies combining Remote Host memory with Local GPU memory currently incur an (overlapped) AMRequest round-trip in the critical path. This increases CPU overhead, network occupancy and total operation latency, and introduces sensitivity to remote attentiveness. These AMs could be entirely eliminated by specialization, alleviating the associated performance degradation.

Comments (0)