Discontiguous job layouts now require `configure --enable-discontig-ranks`

Issue #502 resolved
Paul Hargrove created an issue

Pull request 373 introduced use of GEX_FLAG_PEER_NEVER_NBRHD to remove a dynamic branch and dynamically dead inlined code from the performance-critical RMA paths. That optimization is based on the observation that the UPC++ runtime performs its own shared-memory RMA without calling GASNet-EX.

Unfortunately, the logic to construct local_team allows only contiguous ranges of ranks and this behavior is not likely to ever change (see issue 438). In the presence of discontiguous ranks on the same compute node, there is a very real possibility that local_team is only a subset of the GASNet-EX "nbrhd". In such cases, the use of GEX_FLAG_PEER_NEVER_NBRHD for RMA within the nbrhd is erroneous and rightly asserts in a debug build of GASNet-EX.

I believe elimination of these assertion failures is a blocker for the upcoming release.
I will propose three distinct options in the comments.

Comments (5)

  1. Paul Hargrove reporter

    Options that I am aware of:

    1. Revert pull request 373.
    2. Make "discontinuous layouts" an error at runtime, instructing the user to fix their job spawn. As noted in issue 438, doing this cleanly requires a reduction at startup. A less clean approach would be to locally check that local_team and the nbrhd size are equal, and potentially exit non-collectively when that is not the case.
    3. Add a configure option to determine if discontiguous layout is permitted. If such layouts are permitted, then use of GEX_FLAG_PEER_NEVER_NBRHD would be disabled statically at library compile time. If such layouts are prohibited, then the flag would be used, and "option 2" behavior would prevail in the presence of prohibited layout: explanatory error at startup.

    My current preference is for option 3, with discontiguous layout prohibited by default.

  2. Dan Bonachea

    issue #502: Add configure --enable-discontig-ranks

    By default we now prohibit discontiguous rank layouts with a hard error at startup, unless the library was configured with --enable-discontig-ranks.

    Fixes issue #502.

    → <<cset 8f56d7537945>>

  3. Log in to comment