upcxx-run confusing failure when using smp on multiple nodes

Issue #191 resolved
Steven Hofmeyr created an issue

When executing upcxx-run with -N > 1 but with an executable compiled for smp, the script automatically sets GASNET_PSHM_NODES to be equal to the number of ranks, i.e. -n. This can result in failures such as

*** FATAL ERROR: Nodes requested (448) > maximum (255)

There should be a clearer failure message indicating that smp is not supported with multiple nodes.

Comments (4)

  1. Dan Bonachea

    *** FATAL ERROR: Nodes requested (448) > maximum (255)

    @shofmeyr : the error message you've quoted has nothing to do with the upcxx-run -N argument or multi-node. That's the error message from GASNet smp-conduit from trying to spawn more than 255 processes (the default limit for this single-node conduit). You will get the same error for upcxx-run -n 448 my-smp-program, without a -N argument.

    FWIW, the smp-conduit process limit can be raised (at a small cost in conduit metadata memory) by configuring GASNet with --enable-large-pshm.

    The upcxx-run -N argument is simply ignored by upcxx-run for smp-conduit executables, because they do not support multi-node operation so only -N 1 makes sense.

    I think perhaps the behavior should be for upcxx-run to issue a warning if an smp-conduit executable is launched with upcxx-run -N <nodes> where nodes != 1.

    Thoughts?

  2. Steven Hofmeyr reporter

    Yes, exactly. A warning would be good. Then it would be easy to diagnose that sort of error.

  3. Dan Bonachea

    issue #191: issue a warning for upcxx-run -N w/ smp-conduit

    Passing upcxx-run -N nodes for nodes > 1 now issues a warning on smp-conduit, which does not support multi-node operation.

    Resolves issue #191

    → <<cset 193db619d3d0>>

  4. Log in to comment