- changed status to new
make check fails
This is how I arrive at the failed test:
wget https://bitbucket.org/berkeleylab/upcxx/downloads/upcxx-2023.3.0.tar.gz
tar xf upcxx-2023.3.0.tar.gz
cd upcxx-2023.3.0
./configure
make all
make check
The output is the following:
<details>
Building dependencies...
************
Compiling and running tests for the default network, NETWORKS='udp'.
Please, ensure you are in a proper environment for launching parallel jobs
(eg batch system session, if necessary) or the run step may fail.
************
Compiling test-hello_upcxx-udp SUCCESS
Compiling test-alloc-udp SUCCESS
Compiling test-atomics-udp SUCCESS
Compiling test-barrier-udp SUCCESS
Compiling test-collectives-udp SUCCESS
Compiling test-dist_object-udp SUCCESS
Compiling test-future-udp SUCCESS
Compiling test-global_ptr-udp SUCCESS
Compiling test-local_team-udp SUCCESS
Compiling test-memory_kinds-udp SUCCESS
Compiling test-rpc_barrier-udp SUCCESS
Compiling test-rpc_ff_ring-udp SUCCESS
Compiling test-rput-udp SUCCESS
Compiling test-vis-udp SUCCESS
Compiling test-uts_ranks-udp SUCCESS
Compiling test-persona-example-udp SUCCESS
Compiling test-rput_thread-udp SUCCESS
Compiling test-view-udp SUCCESS
Result reports: /scratch/students/apptest/tmp/upcxx-2023.3.0/test-results/login02.lisc_2023-06-22_22:24:24
PASSED compiling 18 tests
Running tests with RANKS=4
Running test-hello_upcxx-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-alloc-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-atomics-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-barrier-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-collectives-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-dist_object-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-future-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-global_ptr-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-local_team-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-memory_kinds-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-rpc_barrier-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-rpc_ff_ring-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-rput-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-vis-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-uts_ranks-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-persona-example-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-rput_thread-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
Running test-view-udp
*** GASNET ERROR: Environment variable SSH_SERVERS is missing.
*** FATAL ERROR: Error spawning SPMD worker threads. Exiting...
FAILED (exitcode=134)
</details>
I’m on a HPC system with slurm:
$ uname -r
4.18.0-477.13.1.el8_8.x86_64
$ lsb_release -dv
LSB Version: :core-4.1-amd64:core-4.1-noarch
Description: Oracle Linux Server release 8.8
$ sbatch --version
slurm 23.02.3
I guess that the message at the beginning of the make check
output should be a clear hint, but I don’t know what it means. Could somebody explain, point me to a tutorial (or other literature) and/or guide me step-for-step what I have to do to get the make check
command succeeding?
Thanks in advance.
Comments (9)
-
-
ZS,
I am afraid a "point me to a tutorial (or other literature) and/or guide" is not possible since there is such a wide variety of configurations for HPC systems. However, we'll do our best to help you out here in the issue tracker.
If you are running
make check
within a Slurm allocation of nodes, such as viasalloc
orsbatch
then you are in the "proper environment for launching parallel jobs" you noted mention of in the output.Once you are certain you are running in such an environment AND there is no high-speed network such as InifniBand, then running udp-conduit jobs may be as simple adding the following three commands before
make check
(or any use ofupcxx-run
to launch UPC++ executables):export GASNET_SPAWNFN='C' export GASNET_CSPAWN_CMD='srun -n %N %C' export GASNET_WORKER_RANK='SLURM_PROCID'
However, if there is a high-speed network such as InfiniBand or Omni-Path, then we should determine why it has not been detected at configure time (if it had been, then udp-conduit would not be the default).
-Paul
-
- changed component to Support: Installation
- removed milestone
- marked as task
-
reporter Hi Paul,
thanks a lot for your time! Thanks to your explanation and hints, I got it running. The problem was two-fold:
1.)make check
command was run in bash on a single node
2.) the make process has been executed on a local temporary directory in /tmp/.So, moving the date to a shared network file system and running it on 4 nodes, did the trick for me:
salloc --nodes=4 make check
The test is namely performed with
RANKS=4
.For future reference, these are the errors I encountered (only output of last test):
running locally:
$ make check ... Running test-view-udp *** GASNET ERROR: Environment variable SSH_SERVERS is missing. *** FATAL ERROR: Error spawning SPMD worker threads. Exiting... FAILED (exitcode=134)
running in “proper environment” with not enough nodes:
$ salloc --nodes=2 make check ... Running test-view-udp FAILED (exitcode=143)
running in “proper environment” without shared file system:
$ salloc --nodes=4 make check ... FAILED (exitcode=143) Running test-view-udp slurmstepd: error: couldn't chdir to `/tmp/tmp.SJMDlroBs7/upcxx-2023.3.0': No such file or directory: going to /tmp instead slurmstepd: error: couldn't chdir to `/tmp/tmp.SJMDlroBs7/upcxx-2023.3.0': No such file or directory: going to /tmp instead slurmstepd: error: couldn't chdir to `/tmp/tmp.SJMDlroBs7/upcxx-2023.3.0': No such file or directory: going to /tmp instead slurmstepd: error: couldn't chdir to `/tmp/tmp.SJMDlroBs7/upcxx-2023.3.0': No such file or directory: going to /tmp instead slurmstepd: error: couldn't chdir to `/tmp/tmp.SJMDlroBs7/upcxx-2023.3.0': No such file or directory: going to /tmp instead slurmstepd: error: couldn't chdir to `/tmp/tmp.SJMDlroBs7/upcxx-2023.3.0': No such file or directory: going to /tmp instead timeout: failed to run command ‘./test-view-udp’: No such file or directory timeout: failed to run command ‘./test-view-udp’: No such file or directory timeout: failed to run command ‘./test-view-udp’: No such file or directory slurmstepd: error: run_script_as_user: couldn't change working dir to /tmp/tmp.SJMDlroBs7/upcxx-2023.3.0: No such file or directory slurmstepd: error: run_script_as_user: couldn't change working dir to /tmp/tmp.SJMDlroBs7/upcxx-2023.3.0: No such file or directory slurmstepd: error: run_script_as_user: couldn't change working dir to /tmp/tmp.SJMDlroBs7/upcxx-2023.3.0: No such file or directory timeout: failed to run command ‘./test-view-udp’: No such file or directory FAILED (exitcode=143)
running in “proper environment” with enough nodes and with current working directory on network-/shared filesystem:
$ salloc --nodes=4 make check Running test-view-udp Test result: SUCCESS (rank 0/4: nodeb18.lisc)
Does this mean that upcxx can only run on multiple nodes? Could I somehow run the tests performed by
make check
on a single node?Thanks!
-
reporter - changed status to open
-
reporter - changed status to resolved
-
reporter - changed status to closed
-
zs,
I am please to hear that things are (mostly) working for you.
I cannot immediately think of anything in UPC++ which would account for the failures to run 4 processes on 2 nodes, but I might be overlooking something. It should work. So, my best guess relates to Slurm. Can you please try the following (with the same environment variable settings):
salloc --nodes=2 --ntasks=4 make check
This differs from your previous attempt in that it is telling Slurm how many processes you plan to run. Similarly, the following is hopefully sufficient to run on a single node:
salloc --nodes=1 --ntasks=4 make check
-Paul
EDIT: my original post had
--tasks
where--ntasks
was intended. -
reporter Hi Paul,
interestingly the check did not work (even with
--nodes=4
). What did work, though, was the following:export GASNET_SPAWNFN='C' export GASNET_CSPAWN_CMD='srun -n %N %C' export GASNET_WORKER_RANK='SLURM_PROCID' make check # theses worked too salloc --nodes=2 --ntasks=4 make check salloc --nodes=1 --ntasks=4 make check salloc make check
I did not even have to prepend
salloc
to themake check
command. So I added these variables to the module.Thanks!
- Log in to comment