Consider packing optimization for rput-then-rpc where rpc exceeds Long args, but entire operation fits in a one Medium

Issue #261 new
Paul Hargrove created an issue

This issue is regarding the performance regression observed on Cori/KNL while testing PR 110. Performance data is in this Google Sheet, with the regression visible in both sheets/tabs (results from two distinct commits). This data was collected using the modifications to the nebr_exchange benchmark made in PR#114.

The problem observed is that while aries-conduit looks fine for the Haswell nodes of Cori, on the KNL nodes we see degradation (indicated by red cells w/ white text) of over 10% and as high as 32%. By degradation, I mean a reduction in the reported bandwidth relative to the develop branch prior to merging PR#110.

The problem is happening for the cases of RPC payload of 256 and 512 bytes, each of which should be transmitted by a single AM Medium to be reunited with the upcxx::rput payload at the target. However, it is not seen for 1KB RPC payoads which one expects to use the same protocol within UPC++.

It is only seen for rput payload sizes up to 32K bytes (by powers of 8). The performance is actually quite good at the next payload size measured (256KB). FWIW, there are no aries-conduit protocol transitions for Puts or Longs between 4KB and 8MB. So, I don't think it likely that the large change in performance between 32KB and 256KB rput payload is attributable to a conduit behavior exposed by the switch (made in PR#110) from Put to Long as the means to move the rput payload.

I will continue, as time allows, to look into aries-conduit for behaviors which might serve to explain the observed results. However, I am shifting assignment of this issue to John under the believe that the answer is more likely to be found within UPC++.

Comments (13)

  1. Paul Hargrove reporter

    I have tried further modification to the test to force page-alignment of both the RMA and RPC buffers.
    The performance on the 256-byte RPC payload shows no meaningful difference from the version used to generate the "pr110" data in the spreadsheet.

    This seems to rule out the possibility that different handling of source alignment between gex_RMA_PutNB and gex_AM_RequestLong could be responsible for the observed performance regression.

  2. Paul Hargrove reporter

    In brainstorming for possible explanations at the aries-conduit level, Dan observed that when evaluating the recent conduit changes to eliminate initiator chaining in the implementation of Long we only compared the performance of Long before and after. We made no comparison of Long to client-level chaining.

    So, I've tried to rule-out the possibility that the observed regression was due to some unexpected conduit behavior in which the pre-PR110 use of "put-sync-send" might out-perform RequestLong. I have modified GASNet-EX's testam to include tests of an "emulation" of RequestLong0 via PutBlocking + RequestShort4. Here the 4 arguments for the Short are the high and low words of the destination address and length, to provide the semantic equivalent of Long0. ReplyLong0 was used for the "pong" leg of the round-trip, since Put from AM handler context is out of the question. The following data collected on KNL nodes of Cori show the expected outcome, that Long is uniformly faster than (this instance of) "put-sync-send", which is labeled "Chain" below:

    G:       1 AMLong      ping-pong roundtrip ReqRep:   0.961 sec    4.807 us     0.397 MB/s
    G:       8 AMLong      ping-pong roundtrip ReqRep:   0.965 sec    4.827 us     3.161 MB/s
    G:      64 AMLong      ping-pong roundtrip ReqRep:   1.036 sec    5.178 us    23.575 MB/s
    G:     512 AMLong      ping-pong roundtrip ReqRep:   1.080 sec    5.400 us   180.857 MB/s
    G:    4096 AMLong      ping-pong roundtrip ReqRep:   1.623 sec    8.113 us   963.016 MB/s
    G:   32768 AMLong      ping-pong roundtrip ReqRep:   3.069 sec   15.345 us  4073.012 MB/s
    G:  262144 AMLong      ping-pong roundtrip ReqRep:  14.539 sec   72.695 us  6878.072 MB/s
    G: 2097152 AMLong      ping-pong roundtrip ReqRep: 104.891 sec  524.455 us  7626.963 MB/s
    
    L:       1 Chain       ping-pong roundtrip ReqRep:   1.465 sec    7.324 us     0.260 MB/s
    L:       8 Chain       ping-pong roundtrip ReqRep:   1.469 sec    7.345 us     2.078 MB/s
    L:      64 Chain       ping-pong roundtrip ReqRep:   1.472 sec    7.362 us    16.581 MB/s
    L:     512 Chain       ping-pong roundtrip ReqRep:   1.539 sec    7.695 us   126.907 MB/s
    L:    4096 Chain       ping-pong roundtrip ReqRep:   1.929 sec    9.645 us   809.983 MB/s
    L:   32768 Chain       ping-pong roundtrip ReqRep:   3.400 sec   17.001 us  3676.231 MB/s
    L:  262144 Chain       ping-pong roundtrip ReqRep:  14.795 sec   73.974 us  6759.159 MB/s
    L: 2097152 Chain       ping-pong roundtrip ReqRep: 105.202 sec  526.010 us  7604.421 MB/s
    
  3. Paul Hargrove reporter

    Discussing this with @Dan Bonachea, it occurred to us that many of the problematic red entires are sending both a Long and a Medium, the aggregate payload of which could instead be carried in a single Medium. Relative to the 2-message case this halves the number of injection overheads incurred, at the expense of additional use of memcpy() (the rput paylaod probably copied at both ends). In addition to halving the number of messages, the overhead paid to pair then up via an std::map could also be removed.

    As it happens, multiple GASNet-EX conduits already perform a version of this optimization, which we call "Packed Long". Rather than performing a Long using distinct transfers for payload and header (to be matched at the target), we pack them together (for sufficiently small payload and argument counts) into a single message and memcpy() the payload into place at the target. Here, also, the cost of matching at the target is eliminated.

    The plot I am attaching shows results from several runs of a GASNet-level AMRequestLong ping-pong latency benchmark on Cori/KNL with the default (Packed Long for any size that fits) versus disabling this optimization. As you can see, even with the relatively slow memcpy() speed on the KNL, the Packed approach yields lower latency through about 2K.

    Of course applying such an optimization to rput-then-rpc would require some threshold to tune. So, this may not be a magic bullet. However, it might be worth consideration.

  4. Dan Bonachea

    Cori is being retired before our next release.

    Converting this to an enhancement request for the suggested protocol change, which could reduce packet injection overhead for a certain (uncommon?) class of rput-then-rpc operations, at the cost of increased memcpy() overhead on both sides.

  5. Log in to comment