Reordering ranks placements with UPC++

Issue #84 resolved
Hadia Ahmed created an issue

I am trying to change the ordering of ranks across nodes. When I try to use -m (--distribution) option with srun, I get an error from SLURM as follows: #[unset]:_pmi_init:ERROR - PMI does not support SLURM's non-SMP rank distributions. Please use the MPICH_RANK_REORDER_METHOD env variable to obtain non -default rank placements. Aborting job.

When I use MPICH_RANK_REORDER_METHOD=0|2|3, UPC++ crashes with assertion fail as follows: *** FATAL ERROR: Assertion failure at gasnetc_init_gni() at /global/u2/h/hahmed/upcxx/.nobs/art/3630c7cc67796df5898000e9cdb63682fbfa293f/GASNet-2017.9.0 /gemini-conduit/gasnet_gemini.c:778: status == GNI_RC_SUCCESS

It works only with 1 which is the default behavior. Please find attached the generated backtrace.

Comments (15)

  1. Paul Hargrove

    @hadia620

    SLURM's srun on Cray platforms does not support anything but the default rank ordering (the -m options).
    So, there is nothing we can be expected to do about that.

    The MPICH_* environment variables are for mpich, and thus should not have any impact on the run of a UPC++ code.

    Line 778 of gasnet_gemini.c is checking that memory registration had succeeded and has nothing to to with ranks at all. So, I have no idea why changing an environment variable we don't use could cause this failure.

    I will investigate, but it looks likely that this is a GASNet-EX error rather than a UPC++ one.

  2. Dan Bonachea

    Worth noting - if the underlying question here is how to disable shared memory bypass between co-located processes and force more realistic network communication delays on a single node, try setting envvar: GASNET_SUPERNODE_MAXSIZE=1.

    The effect won't be identical to a cyclic process distribution (in particular the switch contention will differ), but should introduce the communication delays I think you are looking for. There is probably a similar setting for MPICH, but I don't know it off-hand (nor would I trust that it follows exactly the same loopback path through hardware).

  3. Hadia Ahmed reporter

    We are trying to compare off-node communication between UPC++ and MPI. I forgot to mention that the environment variable setting is working fine with the MPI application.

  4. Paul Hargrove

    Unfortunately, I don't currently see any "right incantation" to get any rank ordering other than the default on the Crays.
    If we had full support for teams, one could create a team with any ordering desired.
    However, that is not yet implemented.

    When I try running a simple GASNet test on Cori, I find that setting MPICH_RANK_REORDER_METHOD=0|2 does indeed lead to a failure. However, in a debug build the failure is an assertion that indicates (if you know the code) that the ranks are are not what GASNet expects:

    node 2 error gasnetc_bootstrapExchange_gni() at /global/cscratch1/sd/hargrove/upcnightly-cori/EX-cori-aries-gnu/runtime/src/gasnet/gemini-conduit/gasnet_core.c:321: exchange failed: self data is incorrect
    

    So, it appears that Cray's MPICH_RANK_REORDER_METHOD is affecting (interfering with) something other than Cray's mpich. However, since srun cannot reorder ranks we should probably teach GASNet to honor this env var despite the unfortunate name.

    I have entered GASNet Bug 3647 - aries-conduit should honor MPICH_RANK_REORDER_METHOD

  5. Paul Hargrove

    Good news:
    Setting GASNET_SUPERNODE_MAXSIZE=1 should allow runs with MPICH_RANK_REORDER_METHOD=0|2.

    The bad news is that doing so also disables GASNet's shared-memory support.
    I have a real fix in mind, but not likely to finish it today.

  6. Dan Bonachea

    Good news: Setting GASNET_SUPERNODE_MAXSIZE=1 should allow runs with MPICH_RANK_REORDER_METHOD=0|2.

    The application in question only performs nearest-neighbor communication (ignoring barriers which should only appear in the timed section for the "bad" version of the code) - so this may be a workable interrim solution.

  7. Hadia Ahmed reporter

    Does this mean that all ranks will be treated as if they are on different nodes even if they are on the same node?

  8. Dan Bonachea

    Does this mean that all ranks will be treated as if they are on different nodes even if they are on the same node?

    Yes, that is the effect of GASNET_SUPERNODE_MAXSIZE=1 - loopback communication uses the network API rather than shared-memory bypass. However the details of whether that loopback is performed by local firmware (skipping the switch) or travels over the wires to and from the physical switch varies by conduit.

  9. Hadia Ahmed reporter

    Ok. I agree this will work with as a temporary solution to the cyclic distribution of our application.

  10. Paul Hargrove

    I am finding the "real fix" surprisingly simple (copying logic from the InifiniBand conduit).
    So, I think that correct operation without GASNET_SUPERNODE_MAXSIZE=1 will be possible tomorrow (if using the nightly collaborator-snapshot of GASNet-EX, which will require setting another variable for UPCXX).

  11. Paul Hargrove

    I have completed the "real fix" to allow GASNet (and thus upc++) to work correctly with MPICH_RANK_REORDER_METHOD settings other the the default, without the need to set GASNET_SUPERNODE_MAXSIZE=1 (with its adverse performance implication).

    This fix is in the GASNet-EX "collaborator-snapshot" used by the development version of upcxx.
    Hadia, you will need to start using the 'develop' branch of the upcxx git repo in order to get this fix.
    I would advise that you do so anyway, since there will eventually be other improvements and bug fixes that you will need to take advantage of.

  12. Hadia Ahmed reporter

    I switched to the develop branch, tested reorder method and it is working now. Thank you Paul.

  13. Log in to comment