UPC++ installation and spawning on InfiniBand system
@jeremieLagraviere wrote:
Dear UPC++ Users,
Long time ago I encountered the same problem and got busy with something else and never actually solved my problem.
My program
Performs a rather simple sparse matrix vector multiplication Works fine on single node (smp), delivers consistent performance depending on the parameters, and performs as well as the identical UPC version I already have.
Toolchain
UPCXX Installation
UPC++ is installed like this on the supercomputer I am using:
module load Python/2.7.14-GCCcore-6.4.0-bare
module load openmpi.gnu/3.1.2
export GASNET_CONFIGURE_ARGS='--enable-pshm --disable-pshm-posix --enable-pshm-sysv --disable-aligned-segments --enable-segment-large --disable-fca --without-multiconf --enable-sptr-struct --disable-debug --enable-ibv'
export CC=/cluster/software/VERSIONS/gcc-7.2.0/bin/gcc
export CXX=/cluster/software/VERSIONS/gcc-7.2.0/bin/g++
export MPI_CC=/cluster/software/VERSIONS/openmpi.gnu-3.1.2/bin/mpicc
export MPI_CXX=/cluster/software/VERSIONS/openmpi.gnu-3.1.2/bin/mpicxx
export MPI_LIBS=/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib/libmpi.so
export MPI_CFLAGS="-O3"
export CFLAGS="-O3"
export LDFLAGS="-O3"
export CXXFLAGS="-O3"
./install installed
NOTE: This does NOT enable mpi network conduit!...Do you know why?
I am totally ready to change/update recompile UPC++ if any parameters seems wrong/incorrect to you.
Going multinode (starting with 2 nodes)
Compiling
I compile my program using this command
$ make compileLegit
export UPCXX_GASNET_CONDUIT=ibv && export LD_LIBRARY_PATH=/cluster/software/VERSIONS/gcc-7.2.0/lib64:/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib:/cluster/software/EASYBUILD/Python/2.7.14-GCCcore-6.4.0-bare/lib && /usit/abel/u1/jeremie/myRepo/compilers/BUPCXX_GCC/installed/bin/upcxx -O3 main.cpp tools.cpp mainComputation.cpp fileReader.cpp timeManagement.cpp -I. -Iincludes/ -L/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib -lmpi -lm -o upcxxProgram/upcxxSpmv
Running
On the supercomputer I am using, I use this SLURM script(s):
Using MPIRUN
#!/bin/bash
#[...] parameters that are not relevant for debugging such as job name, time and account
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --mem-per-cpu=3900MB
#3025MB
source /cluster/bin/jobsetup
ulimit -c 0
#export GASNET_PHYSMEM_NOPROBE=1
export GASNET_PHYSMEM_MAX=54G
export UPCXX_SEGMENT_MB=3300
export GASNET_MAX_SEGSIZE=54G
module load Python/2.7.14-GCCcore-6.4.0-bare
module load openmpi.gnu/3.1.2
cd /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk/v1.4.1
myDate="`date +%Y-%m-%d-%H.%M`"
export UPCXX_GASNET_CONDUIT=ibv
export LD_LIBRARY_PATH=/cluster/software/VERSIONS/gcc-7.2.0/lib64:/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib:/cluster/software/EASYBUILD/Python/2.7.14-GCCcore-6.4.0-bare/lib
mpirun -np 32 upcxxProgram/upcxxSpmv /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk//dataset//D67MPI3Dheart.55 100 65536 &> /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk/v1.4.1/output/output.$myDate
USING UPCXX-RUN
#!/bin/bash
#[...] parameters that are not relevant for debugging such as job name, time and account
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --mem-per-cpu=3900MB
#3025MB
source /cluster/bin/jobsetup
ulimit -c 0
#export GASNET_PHYSMEM_NOPROBE=1
export GASNET_PHYSMEM_MAX=54G
export UPCXX_SEGMENT_MB=3300
export GASNET_MAX_SEGSIZE=54G
module load Python/2.7.14-GCCcore-6.4.0-bare
module load openmpi.gnu/3.1.2
cd /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk/v1.4.1
myDate="`date +%Y-%m-%d-%H.%M`"
export UPCXX_GASNET_CONDUIT=ibv
export LD_LIBRARY_PATH=/cluster/software/VERSIONS/gcc-7.2.0/lib64:/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib:/cluster/software/EASYBUILD/Python/2.7.14-GCCcore-6.4.0-bare/lib
export UPCXX_GASNET_CONDUIT=ibv && export UPCXX_SEGMENT_MB=1643 && export GASNET_MAX_SEGSIZE=59000MB && export GASNET_PSHM_NODES=32 && export LD_LIBRARY_PATH=/cluster/software/VERSIONS/gcc-7.2.0/lib64:/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib:/cluster/software/EASYBUILD/Python/2.7.14-GCCcore-6.4.0-bare/lib && /usit/abel/u1/jeremie/myRepo/compilers/BUPCXX_GCC/installed/bin/upcxx-run -n 32 upcxxProgram/upcxxSpmv ../dataset/D67MPI3Dheart.55 100 65536 &> /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk/v1.4.1/output/output.$myDate
Problem
When running my job with the script (MPIRUN) above, I get these errors:
*** FATAL ERROR: Requested spawner "(not set)" is unknown or not supported in this build
WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
[...] same error again and again
WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
*** FATAL ERROR: Requested spawner "(not set)" is unknown or not supported in this build
WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
*** FATAL ERROR: Requested spawner "(not set)" is unknown or not supported in this build
WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
*** FATAL ERROR: Requested spawner "(not set)" is unknown or not supported in this build
WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
--------------------------------------------------------------------------
mpirun noticed that process rank 16 with PID 20057 on node c5-20 exited on signal 6 (Aborted).
When running my job with the script (UPCXX-RUN) above, I get these errors:
export UPCXX_GASNET_CONDUIT=ibv && export UPCXX_SEGMENT_MB=1643 && export GASNET_MAX_SEGSIZE=59000MB && export GASNET_PSHM_NODES=32 && export LD_LIBRARY_PATH=/cluster/software/VERSIONS/gcc-7.2.0/lib64:/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib:/cluster/software/EASYBUILD/Python/2.7.14-GCCcore-6.4.0-bare/lib && /usit/abel/u1/jeremie/myRepo/compilers/BUPCXX_GCC/installed/bin/upcxx-run -n 32 upcxxProgram/upcxxSpmv ../dataset/D67MPI3Dheart.55 100 65536
*** Failed to start processes on c1-17
*** FATAL ERROR: One or more processes died before setup was completed
WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
/bin/sh: line 1: 22261 Aborted /usit/abel/u1/jeremie/myRepo/compilers/BUPCXX_GCC/installed/bin/upcxx-run -n 32 upcxxProgram/upcxxSpmv ../dataset/D67MPI3Dheart.55 100 65536
make: *** [runLegit] Error 134
Questions
Should I try to get the MPI conduit enabled when compiling UPC++ compiler? If yes, what should I change in the command/environment variables indicated earlier?
Should I compile using MPI, or IBV conduit? UPCXX_GASNET_CONDUIT=ibv/ UPCXX_GASNET_CONDUIT=mpi Should I run my program on multiple nodes using mpirun or upcxx-run?
How to get rid of the error about the "requested spawner "(not set)""?
Thank you in advance for your help. --Jeremie
Comments (5)
-
-
reporter Hi Jeremie - My apologies for the trouble you encountered, lets get this fixed for you.
First thing, can you please confirm you are using the current v2018.9.0 release package of UPC++? I notice that your previous issue
#109was using a much older release so I want to ensure we are discussing the same software.I suspect the root problem here is that the GASNet configure detection of MPI support is failing, which leads to the silent disable of mpi-conduit and MPI spawning support in ibv-conduit. This in turn leads the "requested spawner "(not set)" spawner error you are seeing when trying to launch with
mpirun
. MPI compatability is not mandatory to run UPC++, it's just often the simplest spawning setup. If you don't have mpi-compat then you'll need to spawn withupcxx-run
instead ofmpirun
and will need either a working PMI library or passwordless-ssh setup for your compute nodes. It's worth noting that even if we fix the portable mpi-conduit backend as a side-effect of this effort, you should always prefer ibv-conduit on InfiniBand hardware (ie compile app withUPCXX_GASNET_CONDUIT=ibv
), as it will provide vastly better communication performance.Looking at your install settings I see a number of unnecessary, redundant and/or incorrect settings that may be contributing factors. Please try the following settings:
module load Python/2.7.14-GCCcore-6.4.0-bare module load openmpi.gnu/3.1.2 export GASNET_CONFIGURE_ARGS='--enable-pshm --disable-pshm-posix --enable-pshm-sysv --enable-ibv --enable-mpi-compat' export CC=/cluster/software/VERSIONS/gcc-7.2.0/bin/gcc export CXX=/cluster/software/VERSIONS/openmpi.gnu-3.1.2/bin/mpicxx export MPI_CC=/cluster/software/VERSIONS/openmpi.gnu-3.1.2/bin/mpicc ./install new-install
Note I've made the following changes:
- Removed all the following settings:
MPI_CXX
(this is not a setting),MPI_LIBS
(should not be necessary in a correct install), all the-O3
settings (this is incorrect for the debug variant of the install tree, and enabled by default for the opt variant) - Changed
CXX
tompicxx
to ensure proper UPC++ linkage with MPI (see docs/hybrid.md) - The addition of
--enable-mpi-compat
toGASNET_CONFIGURE_ARGS
tells GASNet configure to force MPI compatibility or drop dead with an error message explaining why it failed. - The remaining
GASNET_CONFIGURE_ARGS
assumes you need SystemV-based shared-memory bypass for some specific reason - most Linux systems work best with the default POSIX-based PSHM support (in which case you could remove the three*pshm*
arguments). In particular, the latest release of our software addresses some PSHM issues you reported in issue#109: GASNet Bug 3693 Bug 3694
Please attach the entire console output from the install script (and any config.log file referenced by an error message) using these settings in your response.
Finally, if you get to a run step, please add the
upcxx-run -vv
option or (for mpirun) setGASNET_VERBOSEENV=1
, to give us more diagnostics about spawning behavior. - Removed all the following settings:
-
@bonachea thank you so much for your help In the mean time I manage to have MPI compatibility when installing UPC++
However, be sure that I will use your way to build my next UPC++ installation!
-
@bonachea actually, just to be sure to use the best options as possible, I am rebuilding my UPC++ installation based on your suggestions.
-
reporter - changed status to duplicate
Duplicate of
#195. - Log in to comment
Thanks a lot Dan! However I need to "refine" the bug detection, context etc. before actually being able to report something meaningful. That is why I removed my post on the Google Group because I realized I could bypass the problem for now.