- edited description
-
assigned issue to
- marked as minor
Consider packing optimization for rput-then-rpc where rpc exceeds Long args, but entire operation fits in a one Medium
This issue is regarding the performance regression observed on Cori/KNL while testing PR 110. Performance data is in this Google Sheet, with the regression visible in both sheets/tabs (results from two distinct commits). This data was collected using the modifications to the nebr_exchange benchmark made in PR#114.
The problem observed is that while aries-conduit looks fine for the Haswell nodes of Cori, on the KNL nodes we see degradation (indicated by red cells w/ white text) of over 10% and as high as 32%. By degradation, I mean a reduction in the reported bandwidth relative to the develop branch prior to merging PR#110.
The problem is happening for the cases of RPC payload of 256 and 512 bytes, each of which should be transmitted by a single AM Medium to be reunited with the upcxx::rput
payload at the target. However, it is not seen for 1KB RPC payoads which one expects to use the same protocol within UPC++.
It is only seen for rput payload sizes up to 32K bytes (by powers of 8). The performance is actually quite good at the next payload size measured (256KB). FWIW, there are no aries-conduit protocol transitions for Puts or Longs between 4KB and 8MB. So, I don't think it likely that the large change in performance between 32KB and 256KB rput payload is attributable to a conduit behavior exposed by the switch (made in PR#110) from Put to Long as the means to move the rput payload.
I will continue, as time allows, to look into aries-conduit for behaviors which might serve to explain the observed results. However, I am shifting assignment of this issue to John under the believe that the answer is more likely to be found within UPC++.
Comments (13)
-
reporter -
reporter I have tried further modification to the test to force page-alignment of both the RMA and RPC buffers.
The performance on the 256-byte RPC payload shows no meaningful difference from the version used to generate the "pr110" data in the spreadsheet.This seems to rule out the possibility that different handling of source alignment between
gex_RMA_PutNB
andgex_AM_RequestLong
could be responsible for the observed performance regression. -
reporter In brainstorming for possible explanations at the aries-conduit level, Dan observed that when evaluating the recent conduit changes to eliminate initiator chaining in the implementation of Long we only compared the performance of Long before and after. We made no comparison of Long to client-level chaining.
So, I've tried to rule-out the possibility that the observed regression was due to some unexpected conduit behavior in which the pre-PR110 use of "put-sync-send" might out-perform RequestLong. I have modified GASNet-EX's
testam
to include tests of an "emulation" ofRequestLong0
viaPutBlocking
+RequestShort4
. Here the 4 arguments for the Short are the high and low words of the destination address and length, to provide the semantic equivalent of Long0. ReplyLong0 was used for the "pong" leg of the round-trip, since Put from AM handler context is out of the question. The following data collected on KNL nodes of Cori show the expected outcome, that Long is uniformly faster than (this instance of) "put-sync-send", which is labeled "Chain" below:G: 1 AMLong ping-pong roundtrip ReqRep: 0.961 sec 4.807 us 0.397 MB/s G: 8 AMLong ping-pong roundtrip ReqRep: 0.965 sec 4.827 us 3.161 MB/s G: 64 AMLong ping-pong roundtrip ReqRep: 1.036 sec 5.178 us 23.575 MB/s G: 512 AMLong ping-pong roundtrip ReqRep: 1.080 sec 5.400 us 180.857 MB/s G: 4096 AMLong ping-pong roundtrip ReqRep: 1.623 sec 8.113 us 963.016 MB/s G: 32768 AMLong ping-pong roundtrip ReqRep: 3.069 sec 15.345 us 4073.012 MB/s G: 262144 AMLong ping-pong roundtrip ReqRep: 14.539 sec 72.695 us 6878.072 MB/s G: 2097152 AMLong ping-pong roundtrip ReqRep: 104.891 sec 524.455 us 7626.963 MB/s L: 1 Chain ping-pong roundtrip ReqRep: 1.465 sec 7.324 us 0.260 MB/s L: 8 Chain ping-pong roundtrip ReqRep: 1.469 sec 7.345 us 2.078 MB/s L: 64 Chain ping-pong roundtrip ReqRep: 1.472 sec 7.362 us 16.581 MB/s L: 512 Chain ping-pong roundtrip ReqRep: 1.539 sec 7.695 us 126.907 MB/s L: 4096 Chain ping-pong roundtrip ReqRep: 1.929 sec 9.645 us 809.983 MB/s L: 32768 Chain ping-pong roundtrip ReqRep: 3.400 sec 17.001 us 3676.231 MB/s L: 262144 Chain ping-pong roundtrip ReqRep: 14.795 sec 73.974 us 6759.159 MB/s L: 2097152 Chain ping-pong roundtrip ReqRep: 105.202 sec 526.010 us 7604.421 MB/s
-
reporter - attached Figure 0.pdf
Discussing this with @Dan Bonachea, it occurred to us that many of the problematic red entires are sending both a Long and a Medium, the aggregate payload of which could instead be carried in a single Medium. Relative to the 2-message case this halves the number of injection overheads incurred, at the expense of additional use of
memcpy()
(the rput paylaod probably copied at both ends). In addition to halving the number of messages, the overhead paid to pair then up via anstd::map
could also be removed.As it happens, multiple GASNet-EX conduits already perform a version of this optimization, which we call "Packed Long". Rather than performing a Long using distinct transfers for payload and header (to be matched at the target), we pack them together (for sufficiently small payload and argument counts) into a single message and
memcpy()
the payload into place at the target. Here, also, the cost of matching at the target is eliminated.The plot I am attaching shows results from several runs of a GASNet-level AMRequestLong ping-pong latency benchmark on Cori/KNL with the default (Packed Long for any size that fits) versus disabling this optimization. As you can see, even with the relatively slow
memcpy()
speed on the KNL, the Packed approach yields lower latency through about 2K.Of course applying such an optimization to rput-then-rpc would require some threshold to tune. So, this may not be a magic bullet. However, it might be worth consideration.
-
- changed milestone to 2020.3.0 release
Bulk roll-over of unresolved issues to next milestone
-
- changed milestone to 2020.9.0 release
This was discussed in the 2020-02-12 meeting and deferred to next release milestone.
-
- changed milestone to 2021.3.0 release
Mass roll-over of open issues to next release milestone
-
-
assigned issue to
I'm touching this code, but may or may not pursue a solution
-
assigned issue to
-
- changed title to post-pr110 perf. regression on Cori/KNL for rput-then-rpc
-
- changed milestone to 2021.9.0 release
Mass roll-over of open issues to next release milestone
-
- changed milestone to 2022.3.0 release
Mass roll-over of open issues to next release milestone
-
- changed milestone to 2022.9.0 release
Mass roll-over of open issues to next release milestone
-
- changed title to Consider packing optimization for rput-then-rpc where rpc exceeds Long args, but entire operation fits in a one Medium
- removed milestone
- marked as enhancement
Cori is being retired before our next release.
Converting this to an enhancement request for the suggested protocol change, which could reduce packet injection overhead for a certain (uncommon?) class of rput-then-rpc operations, at the cost of increased
memcpy()
overhead on both sides. - Log in to comment