Unexplained failure with IBM jsrun (not on Summit)

Issue #583 resolved
Paul Hargrove created an issue

I have been helping a MetaHipMer user getting UPC++ built on a Summit-like system. In her runs of make check with a correctly-configured build of UPC++ she saw all tests fail:

Running tests with RANKS=4

Running test-hello_upcxx-ibv
FAILED (exitcode=255)
Running test-alloc-ibv
FAILED (exitcode=1)
Running test-atomics-ibv
FAILED (exitcode=1)
Running test-barrier-ibv
FAILED (exitcode=1)
Running test-collectives-ibv
FAILED (exitcode=1)
[...etc. ...]

Manual runs yield (in part)

Error: "timeout --foreground -k 120s 300s ./test-hello_upcxx-ibv" does not appear to execute a UPC++/GASNet executable

However, manual runs without timeout --foreground -k 120s 300s appear to work.

This behavior seems, to me at least, to be contrary to the logic in bld/Makefile.tests which attempts to validate the timeout command. So, I am at a loss how to explain how/why the user is seeing this.

I have asked her to make check TIMEOUT=false to repeat the without-timeout behavior within the full automation. I will update this issue when I hear of that result, possibly changing the Component if it looks like an issue in our build infrastructure is at fault.

Official response

  • Paul Hargrove reporter

    We've determined that the "does not appear to execute a UPC++/GASNet executable" was due to having the wrong working directory. However, correcting for that issue, her output from a upcxx-run -vvvv -np 4 -network ibv ./test-hello_upcxx-ibv ends with the following from GASNet's wrapper around mpirun (configured to be jsrun on this system). Notably there is not a single message from GASNet as would be expected from the GASNET_VERBOSEENV and GASNET_SPAWN_VERBOSE settings due to -vvvv

    gasnetrun: running: /usr/tcetmp/bin/jsrun -p 4 --nrs 5 --cpu_per_rs ALL_CPUS --launch_distribution plane:1 --bind packed:40 -E UPCXX_SHARED_HEAP_SIZE -E UPCXX_VERBOSE -E GASNET_SPAWN_VERBOSE -E GASNET_SPAWN_HAVE_MPI -E GASNET_VERBOSEENV -E GASNET_ENVCMD -E GASNET_SPAWN_HAVE_PMI -E GASNET_PSHM_ENABLED -E GASNET_MAX_SEGSIZE -E GASNET_PREFIX -E GASNET_SPAWN_CONTROL -E GASNET_PLATFORM -E GASNET_SPAWN_CONDUIT [redacted full path]/./test-hello_upcxx-ibv

    Meanwhile, we've confirmed that jsrun -p 4 ... works to launch an MPI hello-world.

Comments (3)

  1. Paul Hargrove reporter

    We've determined that the "does not appear to execute a UPC++/GASNet executable" was due to having the wrong working directory. However, correcting for that issue, her output from a upcxx-run -vvvv -np 4 -network ibv ./test-hello_upcxx-ibv ends with the following from GASNet's wrapper around mpirun (configured to be jsrun on this system). Notably there is not a single message from GASNet as would be expected from the GASNET_VERBOSEENV and GASNET_SPAWN_VERBOSE settings due to -vvvv

    gasnetrun: running: /usr/tcetmp/bin/jsrun -p 4 --nrs 5 --cpu_per_rs ALL_CPUS --launch_distribution plane:1 --bind packed:40 -E UPCXX_SHARED_HEAP_SIZE -E UPCXX_VERBOSE -E GASNET_SPAWN_VERBOSE -E GASNET_SPAWN_HAVE_MPI -E GASNET_VERBOSEENV -E GASNET_ENVCMD -E GASNET_SPAWN_HAVE_PMI -E GASNET_PSHM_ENABLED -E GASNET_MAX_SEGSIZE -E GASNET_PREFIX -E GASNET_SPAWN_CONTROL -E GASNET_PLATFORM -E GASNET_SPAWN_CONDUIT [redacted full path]/./test-hello_upcxx-ibv

    Meanwhile, we've confirmed that jsrun -p 4 ... works to launch an MPI hello-world.

  2. Dan Bonachea

    Paul said:

    gasnetrun: running: /usr/tcetmp/bin/jsrun -p 4
    Meanwhile, we've confirmed that jsrun -p 4 ... works to launch an MPI

    What about /usr/tcetmp/bin/jsrun -p 4 for MPI hello-world?
    It's not impossible that the /usr/tcetmp/bin/jsrun full path cached at configure time doesn't exist in the batch environment, where it lives elsewhere...

  3. Paul Hargrove reporter

    Turns out this was a "known issue" on the system and not specific to UPC++.

    CUDA 11 causes applications to suffer an early and silent death, while CUDA 10 is fine. A change of environment modules resolved this issue for the user.

    I am unclear on the connection between CUDA and her build of UPC++ without --with-cuda, but suspect that it is related to our use of mpi-spawer plus MPI's use of CUDA.

  4. Log in to comment