make check fails

Issue #612 closed
zs created an issue

This is how I arrive at the failed test:

wget https://bitbucket.org/berkeleylab/upcxx/downloads/upcxx-2023.3.0.tar.gz
tar xf upcxx-2023.3.0.tar.gz 
cd upcxx-2023.3.0
./configure 
make all
make check

The output is the following:
<details>

Building dependencies...
************
Compiling and running tests for the default network, NETWORKS='udp'.
Please, ensure you are in a proper environment for launching parallel jobs
(eg batch system session, if necessary) or the run step may fail.
************

Compiling test-hello_upcxx-udp                                         SUCCESS
Compiling test-alloc-udp                                               SUCCESS
Compiling test-atomics-udp                                             SUCCESS
Compiling test-barrier-udp                                             SUCCESS
Compiling test-collectives-udp                                         SUCCESS
Compiling test-dist_object-udp                                         SUCCESS
Compiling test-future-udp                                              SUCCESS
Compiling test-global_ptr-udp                                          SUCCESS
Compiling test-local_team-udp                                          SUCCESS
Compiling test-memory_kinds-udp                                        SUCCESS
Compiling test-rpc_barrier-udp                                         SUCCESS
Compiling test-rpc_ff_ring-udp                                         SUCCESS
Compiling test-rput-udp                                                SUCCESS
Compiling test-vis-udp                                                 SUCCESS
Compiling test-uts_ranks-udp                                           SUCCESS
Compiling test-persona-example-udp                                     SUCCESS
Compiling test-rput_thread-udp                                         SUCCESS
Compiling test-view-udp                                                SUCCESS

Result reports: /scratch/students/apptest/tmp/upcxx-2023.3.0/test-results/login02.lisc_2023-06-22_22:24:24

PASSED compiling 18 tests

Running tests with RANKS=4
Running test-hello_upcxx-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-alloc-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-atomics-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-barrier-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-collectives-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-dist_object-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-future-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-global_ptr-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-local_team-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-memory_kinds-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-rpc_barrier-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-rpc_ff_ring-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-rput-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-vis-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-uts_ranks-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-persona-example-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-rput_thread-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-view-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)

</details>

I’m on a HPC system with slurm:

$ uname -r
4.18.0-477.13.1.el8_8.x86_64
$ lsb_release -dv
LSB Version:    :core-4.1-amd64:core-4.1-noarch
Description:    Oracle Linux Server release 8.8
$ sbatch --version
slurm 23.02.3

I guess that the message at the beginning of the make check output should be a clear hint, but I don’t know what it means. Could somebody explain, point me to a tutorial (or other literature) and/or guide me step-for-step what I have to do to get the make check command succeeding?

Thanks in advance.

Comments (9)

  1. Paul Hargrove

    ZS,

    I am afraid a "point me to a tutorial (or other literature) and/or guide" is not possible since there is such a wide variety of configurations for HPC systems. However, we'll do our best to help you out here in the issue tracker.

    If you are running make check within a Slurm allocation of nodes, such as via salloc or sbatch then you are in the "proper environment for launching parallel jobs" you noted mention of in the output.

    Once you are certain you are running in such an environment AND there is no high-speed network such as InifniBand, then running udp-conduit jobs may be as simple adding the following three commands before make check (or any use of upcxx-run to launch UPC++ executables):

    export GASNET_SPAWNFN='C'
    export GASNET_CSPAWN_CMD='srun -n %N %C'
    export GASNET_WORKER_RANK='SLURM_PROCID'
    

    However, if there is a high-speed network such as InfiniBand or Omni-Path, then we should determine why it has not been detected at configure time (if it had been, then udp-conduit would not be the default).

    -Paul

  2. zs reporter

    Hi Paul,

    thanks a lot for your time! Thanks to your explanation and hints, I got it running. The problem was two-fold:
    1.) make check command was run in bash on a single node
    2.) the make process has been executed on a local temporary directory in /tmp/.

    So, moving the date to a shared network file system and running it on 4 nodes, did the trick for me:

    salloc --nodes=4 make check  
    

    The test is namely performed with RANKS=4.

    For future reference, these are the errors I encountered (only output of last test):

    running locally:

    $ make check
    ...
    Running test-view-udp
    *** GASNET ERROR: Environment variable SSH_SERVERS is missing.
    *** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
    FAILED (exitcode=134)
    

    running in “proper environment” with not enough nodes:

    $ salloc --nodes=2 make check
    ...
    Running test-view-udp
    FAILED (exitcode=143)
    

    running in “proper environment” without shared file system:

    $ salloc --nodes=4 make check
    ...
    FAILED (exitcode=143)
    Running test-view-udp
    slurmstepd: error: couldn't chdir to `/tmp/tmp.SJMDlroBs7/upcxx-2023.3.0': No such file or directory: going to /tmp instead
    slurmstepd: error: couldn't chdir to `/tmp/tmp.SJMDlroBs7/upcxx-2023.3.0': No such file or directory: going to /tmp instead
    slurmstepd: error: couldn't chdir to `/tmp/tmp.SJMDlroBs7/upcxx-2023.3.0': No such file or directory: going to /tmp instead
    slurmstepd: error: couldn't chdir to `/tmp/tmp.SJMDlroBs7/upcxx-2023.3.0': No such file or directory: going to /tmp instead
    slurmstepd: error: couldn't chdir to `/tmp/tmp.SJMDlroBs7/upcxx-2023.3.0': No such file or directory: going to /tmp instead
    slurmstepd: error: couldn't chdir to `/tmp/tmp.SJMDlroBs7/upcxx-2023.3.0': No such file or directory: going to /tmp instead
    timeout: failed to run command ‘./test-view-udp’: No such file or directory
    timeout: failed to run command ‘./test-view-udp’: No such file or directory
    timeout: failed to run command ‘./test-view-udp’: No such file or directory
    slurmstepd: error: run_script_as_user: couldn't change working dir to /tmp/tmp.SJMDlroBs7/upcxx-2023.3.0: No such file or directory
    slurmstepd: error: run_script_as_user: couldn't change working dir to /tmp/tmp.SJMDlroBs7/upcxx-2023.3.0: No such file or directory
    slurmstepd: error: run_script_as_user: couldn't change working dir to /tmp/tmp.SJMDlroBs7/upcxx-2023.3.0: No such file or directory
    timeout: failed to run command ‘./test-view-udp’: No such file or directory
    FAILED (exitcode=143)
    

    running in “proper environment” with enough nodes and with current working directory on network-/shared filesystem:

    $ salloc --nodes=4 make check
    Running test-view-udp
    Test result: SUCCESS (rank 0/4: nodeb18.lisc)
    

    Does this mean that upcxx can only run on multiple nodes? Could I somehow run the tests performed by make check on a single node?

    Thanks!

  3. Paul Hargrove

    zs,

    I am please to hear that things are (mostly) working for you.

    I cannot immediately think of anything in UPC++ which would account for the failures to run 4 processes on 2 nodes, but I might be overlooking something. It should work. So, my best guess relates to Slurm. Can you please try the following (with the same environment variable settings):

    salloc --nodes=2 --ntasks=4 make check

    This differs from your previous attempt in that it is telling Slurm how many processes you plan to run. Similarly, the following is hopefully sufficient to run on a single node:

    salloc --nodes=1 --ntasks=4 make check

    -Paul

    EDIT: my original post had --tasks where --ntasks was intended.

  4. zs reporter

    Hi Paul,

    interestingly the check did not work (even with --nodes=4). What did work, though, was the following:

    export GASNET_SPAWNFN='C'
    export GASNET_CSPAWN_CMD='srun -n %N %C'
    export GASNET_WORKER_RANK='SLURM_PROCID'
    make check
    # theses worked too 
    salloc --nodes=2 --ntasks=4 make check
    salloc --nodes=1 --ntasks=4 make check
    salloc make check
    

    I did not even have to prepend salloc to the make check command. So I added these variables to the module.

    Thanks!

  5. Log in to comment