UPC++ installation and spawning on InfiniBand system

Issue #194 duplicate
Dan Bonachea created an issue

@jeremieLagraviere wrote:

Dear UPC++ Users,

Long time ago I encountered the same problem and got busy with something else and never actually solved my problem.

My program

Performs a rather simple sparse matrix vector multiplication Works fine on single node (smp), delivers consistent performance depending on the parameters, and performs as well as the identical UPC version I already have.

Toolchain

UPCXX Installation

UPC++ is installed like this on the supercomputer I am using:

    module load Python/2.7.14-GCCcore-6.4.0-bare
    module load openmpi.gnu/3.1.2
    export GASNET_CONFIGURE_ARGS='--enable-pshm --disable-pshm-posix --enable-pshm-sysv --disable-aligned-segments --enable-segment-large --disable-fca --without-multiconf --enable-sptr-struct --disable-debug --enable-ibv'
    export CC=/cluster/software/VERSIONS/gcc-7.2.0/bin/gcc
    export CXX=/cluster/software/VERSIONS/gcc-7.2.0/bin/g++
    export MPI_CC=/cluster/software/VERSIONS/openmpi.gnu-3.1.2/bin/mpicc
    export MPI_CXX=/cluster/software/VERSIONS/openmpi.gnu-3.1.2/bin/mpicxx
    export MPI_LIBS=/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib/libmpi.so
    export MPI_CFLAGS="-O3"
    export CFLAGS="-O3"
    export LDFLAGS="-O3"
    export CXXFLAGS="-O3"
    ./install installed

NOTE: This does NOT enable mpi network conduit!...Do you know why?

I am totally ready to change/update recompile UPC++ if any parameters seems wrong/incorrect to you.


Going multinode (starting with 2 nodes)

Compiling

I compile my program using this command

    $ make compileLegit 
    export UPCXX_GASNET_CONDUIT=ibv && export LD_LIBRARY_PATH=/cluster/software/VERSIONS/gcc-7.2.0/lib64:/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib:/cluster/software/EASYBUILD/Python/2.7.14-GCCcore-6.4.0-bare/lib && /usit/abel/u1/jeremie/myRepo/compilers/BUPCXX_GCC/installed/bin/upcxx -O3 main.cpp tools.cpp mainComputation.cpp fileReader.cpp timeManagement.cpp -I. -Iincludes/  -L/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib -lmpi -lm -o upcxxProgram/upcxxSpmv

Running

On the supercomputer I am using, I use this SLURM script(s):

Using MPIRUN

    #!/bin/bash
    #[...] parameters that are not relevant for debugging such as job name, time and account 
    #SBATCH --nodes=2
    #SBATCH --ntasks-per-node=16
    #SBATCH --mem-per-cpu=3900MB
    #3025MB
    source /cluster/bin/jobsetup

    ulimit -c 0
    #export GASNET_PHYSMEM_NOPROBE=1


    export GASNET_PHYSMEM_MAX=54G
    export UPCXX_SEGMENT_MB=3300
    export GASNET_MAX_SEGSIZE=54G


    module load Python/2.7.14-GCCcore-6.4.0-bare
    module load openmpi.gnu/3.1.2

    cd /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk/v1.4.1
    myDate="`date +%Y-%m-%d-%H.%M`"

    export UPCXX_GASNET_CONDUIT=ibv 

    export LD_LIBRARY_PATH=/cluster/software/VERSIONS/gcc-7.2.0/lib64:/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib:/cluster/software/EASYBUILD/Python/2.7.14-GCCcore-6.4.0-bare/lib
    mpirun -np 32 upcxxProgram/upcxxSpmv /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk//dataset//D67MPI3Dheart.55 100 65536 &> /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk/v1.4.1/output/output.$myDate

USING UPCXX-RUN

    #!/bin/bash
    #[...] parameters that are not relevant for debugging such as job name, time and account 
    #SBATCH --nodes=2
    #SBATCH --ntasks-per-node=16
    #SBATCH --mem-per-cpu=3900MB
    #3025MB
    source /cluster/bin/jobsetup

    ulimit -c 0
    #export GASNET_PHYSMEM_NOPROBE=1


    export GASNET_PHYSMEM_MAX=54G
    export UPCXX_SEGMENT_MB=3300
    export GASNET_MAX_SEGSIZE=54G


    module load Python/2.7.14-GCCcore-6.4.0-bare
    module load openmpi.gnu/3.1.2

    cd /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk/v1.4.1
    myDate="`date +%Y-%m-%d-%H.%M`"

    export UPCXX_GASNET_CONDUIT=ibv 

    export LD_LIBRARY_PATH=/cluster/software/VERSIONS/gcc-7.2.0/lib64:/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib:/cluster/software/EASYBUILD/Python/2.7.14-GCCcore-6.4.0-bare/lib
    export UPCXX_GASNET_CONDUIT=ibv &&  export UPCXX_SEGMENT_MB=1643  && export GASNET_MAX_SEGSIZE=59000MB && export GASNET_PSHM_NODES=32 && export LD_LIBRARY_PATH=/cluster/software/VERSIONS/gcc-7.2.0/lib64:/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib:/cluster/software/EASYBUILD/Python/2.7.14-GCCcore-6.4.0-bare/lib && /usit/abel/u1/jeremie/myRepo/compilers/BUPCXX_GCC/installed/bin/upcxx-run -n 32 upcxxProgram/upcxxSpmv ../dataset/D67MPI3Dheart.55 100 65536 &> /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk/v1.4.1/output/output.$myDate

Problem

When running my job with the script (MPIRUN) above, I get these errors:

    *** FATAL ERROR: Requested spawner "(not set)" is unknown or not supported in this build
    WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
    [...] same error again and again
    WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
    -------------------------------------------------------
    Primary job  terminated normally, but 1 process returned
    a non-zero exit code. Per user-direction, the job has been aborted.
    -------------------------------------------------------
    *** FATAL ERROR: Requested spawner "(not set)" is unknown or not supported in this build
    WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
    *** FATAL ERROR: Requested spawner "(not set)" is unknown or not supported in this build
    WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
    *** FATAL ERROR: Requested spawner "(not set)" is unknown or not supported in this build
    WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
    --------------------------------------------------------------------------
    mpirun noticed that process rank 16 with PID 20057 on node c5-20 exited on signal 6 (Aborted).

When running my job with the script (UPCXX-RUN) above, I get these errors:

    export UPCXX_GASNET_CONDUIT=ibv &&  export UPCXX_SEGMENT_MB=1643  && export GASNET_MAX_SEGSIZE=59000MB && export GASNET_PSHM_NODES=32 && export LD_LIBRARY_PATH=/cluster/software/VERSIONS/gcc-7.2.0/lib64:/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib:/cluster/software/EASYBUILD/Python/2.7.14-GCCcore-6.4.0-bare/lib && /usit/abel/u1/jeremie/myRepo/compilers/BUPCXX_GCC/installed/bin/upcxx-run -n 32 upcxxProgram/upcxxSpmv ../dataset/D67MPI3Dheart.55 100 65536
    *** Failed to start processes on c1-17
    *** FATAL ERROR: One or more processes died before setup was completed
    WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
    /bin/sh: line 1: 22261 Aborted                 /usit/abel/u1/jeremie/myRepo/compilers/BUPCXX_GCC/installed/bin/upcxx-run -n 32 upcxxProgram/upcxxSpmv ../dataset/D67MPI3Dheart.55 100 65536
    make: *** [runLegit] Error 134

Questions

Should I try to get the MPI conduit enabled when compiling UPC++ compiler? If yes, what should I change in the command/environment variables indicated earlier?

Should I compile using MPI, or IBV conduit? UPCXX_GASNET_CONDUIT=ibv/ UPCXX_GASNET_CONDUIT=mpi Should I run my program on multiple nodes using mpirun or upcxx-run?

How to get rid of the error about the "requested spawner "(not set)""?

Thank you in advance for your help. --Jeremie

Comments (5)

  1. Jérémie Lagravière

    Thanks a lot Dan! However I need to "refine" the bug detection, context etc. before actually being able to report something meaningful. That is why I removed my post on the Google Group because I realized I could bypass the problem for now.

  2. Dan Bonachea reporter

    Hi Jeremie - My apologies for the trouble you encountered, lets get this fixed for you.

    First thing, can you please confirm you are using the current v2018.9.0 release package of UPC++? I notice that your previous issue #109 was using a much older release so I want to ensure we are discussing the same software.

    I suspect the root problem here is that the GASNet configure detection of MPI support is failing, which leads to the silent disable of mpi-conduit and MPI spawning support in ibv-conduit. This in turn leads the "requested spawner "(not set)" spawner error you are seeing when trying to launch with mpirun. MPI compatability is not mandatory to run UPC++, it's just often the simplest spawning setup. If you don't have mpi-compat then you'll need to spawn with upcxx-run instead of mpirun and will need either a working PMI library or passwordless-ssh setup for your compute nodes. It's worth noting that even if we fix the portable mpi-conduit backend as a side-effect of this effort, you should always prefer ibv-conduit on InfiniBand hardware (ie compile app with UPCXX_GASNET_CONDUIT=ibv), as it will provide vastly better communication performance.

    Looking at your install settings I see a number of unnecessary, redundant and/or incorrect settings that may be contributing factors. Please try the following settings:

    module load Python/2.7.14-GCCcore-6.4.0-bare
    module load openmpi.gnu/3.1.2
    export GASNET_CONFIGURE_ARGS='--enable-pshm --disable-pshm-posix --enable-pshm-sysv --enable-ibv --enable-mpi-compat'
    export CC=/cluster/software/VERSIONS/gcc-7.2.0/bin/gcc
    export CXX=/cluster/software/VERSIONS/openmpi.gnu-3.1.2/bin/mpicxx
    export MPI_CC=/cluster/software/VERSIONS/openmpi.gnu-3.1.2/bin/mpicc
    ./install new-install
    

    Note I've made the following changes:

    • Removed all the following settings: MPI_CXX (this is not a setting), MPI_LIBS (should not be necessary in a correct install), all the -O3 settings (this is incorrect for the debug variant of the install tree, and enabled by default for the opt variant)
    • Changed CXX to mpicxx to ensure proper UPC++ linkage with MPI (see docs/hybrid.md)
    • The addition of --enable-mpi-compat to GASNET_CONFIGURE_ARGS tells GASNet configure to force MPI compatibility or drop dead with an error message explaining why it failed.
    • The remaining GASNET_CONFIGURE_ARGS assumes you need SystemV-based shared-memory bypass for some specific reason - most Linux systems work best with the default POSIX-based PSHM support (in which case you could remove the three *pshm* arguments). In particular, the latest release of our software addresses some PSHM issues you reported in issue #109: GASNet Bug 3693 Bug 3694

    Please attach the entire console output from the install script (and any config.log file referenced by an error message) using these settings in your response.

    Finally, if you get to a run step, please add the upcxx-run -vv option or (for mpirun) set GASNET_VERBOSEENV=1, to give us more diagnostics about spawning behavior.

  3. Jérémie Lagravière

    @bonachea thank you so much for your help In the mean time I manage to have MPI compatibility when installing UPC++

    However, be sure that I will use your way to build my next UPC++ installation!

  4. Jérémie Lagravière

    @bonachea actually, just to be sure to use the best options as possible, I am rebuilding my UPC++ installation based on your suggestions.

  5. Log in to comment