PMIX Startup crashes in explicit OpenMPI MPI_Init at large scale on Able (ibv/SLURM)

Issue #195 resolved
Jérémie Lagravière created an issue

Dear UPC++ Users,

This issue, is a continuation/improvement/more accurate version of what was described here: https://bitbucket.org/berkeleylab/upcxx/issues/194/upc-installation-and-spawning-on This previous issue, is out of date and most of the information in it is irrelevant now.

My Program

Performs a rather simple sparse matrix vector multiplication

Works fine on single node (smp), delivers consistent performance depending on the parameters, and performs as well as the identical UPC version I already have.

Today I added at the beginning of my program these lines of code:

#include <mpi.h>
//...
int main (int argc, char *argv[])
{
        MPI_Init(NULL, NULL);
    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    //
    //             // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

//...
       MPI_Finalize();
}

Note: I have put MPI_Init() before upcxx::init() and I have put MPI_Finlalize() after upcxx::finalize();

Do not hesitate to tell me if you think this is a mistake. For info: adding MPI_Init and MPI_Finalize solves a problem that causes harmless crashes at the end of my jobs. Basically with MPI_Init() and MPI_Finalize() my progam run is cleaner.

For info 2: I do not mix UPC++ and MPI. My program is 100% UPC++.

Toolchain

UPCXX Installation

UPCXX Version

$ ./upcxx --version
UPC++ version 20180900 
Copyright (c) 2018, The Regents of the University of California,
through Lawrence Berkeley National Laboratory.
http://upcxx.lbl.gov

g++ (GCC) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

UPC++ is installed like this on the supercomputer I am using:

export GASNET_CONFIGURE_ARGS='--enable-pshm --disable-pshm-posix --enable-pshm-sysv --disable-aligned-segments --enable-segment-large --disable-fca --without-multiconf --enable-sptr-struct --disable-debug** --enable-ibv --enable-mpi**'
export CC=/cluster/software/VERSIONS/gcc-7.2.0/bin/gcc
export CXX=/cluster/software/VERSIONS/gcc-7.2.0/bin/g++
export MPI_CC=/cluster/software/VERSIONS/openmpi.gnu-3.1.2/bin/mpicc
export MPI_CXX=/cluster/software/VERSIONS/openmpi.gnu-3.1.2/bin/mpicxx
export MPI_LIBS=/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib/libmpi.so
export MPI_CFLAGS="-O3"
export CFLAGS="-O3"
export LDFLAGS="-O3"
export CXXFLAGS="-O3"

@bonachea I noticed your remarks on this, but since yesterday, I have rebuilt my UPC++ toolchain before I saw your advice, and it worked. I will update my UPC++ toolchain environment variables on my next build!

The UPC++ installation supports, among others, the mpi and ibv and smp network conduit. Which is usually what I need to run my programs on any supercomputer. * smp for single node * mpi for tests * ibv for performance

Compiling

I compile my program using this command

$ make compileLegit 
export UPCXX_GASNET_CONDUIT=ibv && export LD_LIBRARY_PATH=/cluster/software/VERSIONS/gcc-7.2.0/lib64:/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib:/cluster/software/EASYBUILD/Python/2.7.14-GCCcore-6.4.0-bare/lib && /usit/abel/u1/jeremie/myRepo/compilers/BUPCXX_GCC/installed/bin/upcxx -O3 main.cpp tools.cpp mainComputation.cpp fileReader.cpp timeManagement.cpp -I. -Iincludes/  -L/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib -lmpi -lm -o upcxxProgram/upcxxSpmv

I have attached my Makefile and the parameter file I am using to compile my UPC++ program

Runtime

On the supercomputer (abel.uio.no) I am using, I use this SLURM script(s):

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --output=//cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk/v1.4.1/output/slurmOutput.upcxx
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=16
#SBATCH --mem-per-cpu=3900MB
#3025MB
source /cluster/bin/jobsetup

ulimit -c 0

export GASNET_PHYSMEM_MAX=54G
export UPCXX_SEGMENT_MB=3300
export GASNET_MAX_SEGSIZE=54G


module load Python/2.7.14-GCCcore-6.4.0-bare
module load openmpi.gnu/3.1.2

cd /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk/v1.4.1
myDate="`date +%Y-%m-%d-%H.%M`"

export UPCXX_GASNET_CONDUIT=ibv 
export LD_LIBRARY_PATH=/cluster/software/VERSIONS/gcc-7.2.0/lib64:/cluster/software/VERSIONS/openmpi.gnu-3.1.2/lib:/cluster/software/EASYBUILD/Python/2.7.14-GCCcore-6.4.0-bare/lib


#and then some command, see below for detailed examples and explanations

Using MPIRUN

At the end of the SLURM script there is a command line calling mpirun and my program with the right arguments. Some of them work, some of them lead to crashes.

Working

mpirun -n 128 upcxxProgram/upcxxSpmv /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk//dataset//D67MPI3Dheart.55 1000 65536 &> /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk/v1.4.1/output/output.$myDate
mpirun -n 256 upcxxProgram/upcxxSpmv /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk//dataset//D67MPI3Dheart.55 1000 65536 &> /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk/v1.4.1/output/output.$myDate

Not working

mpirun -n 512 upcxxProgram/upcxxSpmv /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk//dataset//D67MPI3Dheart.55 1000 65536 &> /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk/v1.4.1/output/output.$myDate
mpirun -n 256 upcxxProgram/upcxxSpmv /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk//dataset//D67MPI3Dheart.55 1000 65536 &> /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk/v1.4.1/output/output.$myDate

Using UPCXX-RUN

At the end of the SLURM script there is a command line calling upcxx-run and my program with the right arguments. Some of them work, some of them lead to crashes.

Working

/usit/abel/u1/jeremie/myRepo/compilers/teststuff/installed/bin/upcxx-run -n 128 upcxxProgram/upcxxSpmv /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk//dataset//D67MPI3Dheart.55 1000 65536 &> /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk/v1.4.1/output/output.$myDate

Not working

/usit/abel/u1/jeremie/myRepo/compilers/teststuff/installed/bin/upcxx-run -n 256 upcxxProgram/upcxxSpmv /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk//dataset//D67MPI3Dheart.55 1000 65536 &> /cluster/home/jeremie/myRepo/pgm-jlg-upc-svn/trunk/v1.4.1/output/output.$myDate

The problem

As you have seen in the previous section some commands are categorized as "working" and some as " not working". Clearly this is directly connected to the amount of cores and nodes I want to use. When reaching 256 / 512 / 1024 threads (16, 32, 64 nodes) the result is almost certain to crash (above 256 is 100% crash).

Error message

Basically the error messages are always the same (different lenght depending on the run):

$ cat crash256WithN
[compute-11-35.local:07594] PMIX ERROR: PMIX TEMPORARILY UNAVAILABLE in file ptl_tcp.c at line 688
[compute-11-35.local:07593] PMIX ERROR: PMIX TEMPORARILY UNAVAILABLE in file ptl_tcp.c at line 688
[compute-11-35.local:07593] OPAL ERROR: Unreachable in file pmix2x_client.c at line 109
[compute-11-35.local:07596] PMIX ERROR: PMIX TEMPORARILY UNAVAILABLE in file ptl_tcp.c at line 688
[compute-11-35.local:07596] OPAL ERROR: Unreachable in file pmix2x_client.c at line 109
[compute-11-35.local:07598] PMIX ERROR: PMIX TEMPORARILY UNAVAILABLE in file ptl_tcp.c at line 688
[....]
[compute-11-35.local:07589] OPAL ERROR: Unreachable in file pmix2x_client.c at line 109
[compute-11-35.local:07591] PMIX ERROR: PMIX TEMPORARILY UNAVAILABLE in file ptl_tcp.c at line 688
[...]
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[compute-11-35.local:07587] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[compute-11-35.local:07584] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[compute-11-35.local:07585] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[compute-11-35.local:07590] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

Some of them crash with a shorter message, which I think is still from the same cause:

$ cat crash512WithN 
[compute-18-22.local:06651] PMIX ERROR: PMIX TEMPORARILY UNAVAILABLE in file ptl_tcp.c at line 688
[compute-18-22.local:06651] OPAL ERROR: Unreachable in file pmix2x_client.c at line 109
[compute-18-22.local:06650] PMIX ERROR: PMIX TEMPORARILY UNAVAILABLE in file ptl_tcp.c at line 688
[compute-18-22.local:06650] OPAL ERROR: Unreachable in file pmix2x_client.c at line 109
[compute-18-22.local:06652] PMIX ERROR: PMIX TEMPORARILY UNAVAILABLE in file ptl_tcp.c at line 688
[compute-18-22.local:06652] OPAL ERROR: Unreachable in file pmix2x_client.c at line 109
[compute-18-22.local:06653] PMIX ERROR: PMIX TEMPORARILY UNAVAILABLE in file ptl_tcp.c at line 688
[compute-18-22.local:06653] OPAL ERROR: Unreachable in file pmix2x_client.c at line 109
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[compute-18-22.local:06650] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[compute-18-22.local:06653] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[compute-18-22.local:06651] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[compute-18-22.local:06652] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[47324,1],386]
  Exit code:    1
--------------------------------------------------------------------------

Questions

The problem, as you can see, is that after a certain amount of nodes/cores I get a probability of crash of 100% on the supercomputer I am using.

However, with smaller amount of nodes/cores everything runs fine.

Moreover, on this supercomputer I am certain that I can use up-to 64 nodes with a GASNET/GASNET-EX based language, as my other programs are implemented in UPC and they run on 1024 threads / 64 nodes.

So I am guessing the problem could either be: * UPC++ compiler configuration * UPC++ wrong or missing options when I compile my program * UPC++ or GASNET-EX has a problem interacting with the supercomputer I am using * Some GASNET environment variables that I forgot to set or wrongfully set * Some issue with the runtime options I am using, am I "too close" to MPI when I use mpirun? Is considered a "bad way" to do things with UPC++?

What are your guesses?

What can I do to stabilize this situation?

Thank you in advance for your help

Jérémie

Comments (13)

  1. Dan Bonachea

    Hi @jeremieLagraviere sorry to hear you are still having problems.

    My suggestions:

    1. Given you are getting PMI error messages from the MPI implementation at job spawn, the first thing I'd try is an MPI hello-world program that does not use UPC++/GASNet. If that generates errors or fails to run correctly then you should contact your system admins.
    2. I notice your batch script sets #SBATCH --nodes=16 --ntasks-per-node=16, which means you've reserved resources for 256 processes, so we should not expect to successfully run more than that (eg 512 ranks should not work in this job).
    3. Please re-install the current version of UPC++ in a clean directory using the configure and envvar settings I provided in issue #194
    4. Rebuild your application using upcxx -g (and remove the -O3). This will activate debug mode, which enables thousands of sanity checking assertions system-wide and should always be your first step in diagnosing correctness problems. (Once things are working you should rebuild the application with -O3 before gathering performance results).
    5. If your application does not make MPI calls, we should remove your spurious calls to MPI_Init and MPI_Finalize, since that likely just confuses the situation.
    6. At the run step, please add the upcxx-run -vv option or (for mpirun) set GASNET_VERBOSEENV=1, to give us more diagnostics about spawning behavior.

    One other random note - according to your system documentation the Intel compilers should be available on your supercomputer. Provided it's a modern Intel version (17.0.2 or later) you should consider eventually using that instead of gcc, in order to get the best serial optimizations.

  2. Dan Bonachea

    @jeremieLagraviere is this problem resolved?

    We are preparing a release soon and it would be good to know if your problem represents a bug we should investigate.

  3. Jérémie Lagravière reporter

    @bonachea Sorry I am bit slow these days

    1. So at least I ran the MPI test file you sent me. And it ran with no issue.

    2. About your second point: I should have said in original post, that I was giving a bit of a generic SLURM batch file. Indeed, I change the options for mpirun/upcxx-run and #SBATCH depending on my targeted thread count/node count.

    3. it's done. Did not change anything the problem described in the original post.

    4 and 6: TODO!

    1. It's done. It does not change anything to the stability of the program at runtime when using more than 16 nodes / 256 threads

    Random not. Using the Intel Compiler on this supercomputer (Abel) to build and run application based on UPC++ is really giving me a headache. Weirdly it works fine on another supercomputer (Fram https://documentation.sigma2.no/quick/fram.html). However on this supercomputer (Fram), my program runs well on single node, but when using multi-node it crashes with a segfault...This is probably another issue, right now I stay on the supercomputer described in my original post (Abel).

  4. Dan Bonachea

    @Jérémie Lagravière : What's the status of this problem? Is it still impeding your progress? Have you tried with the current 2019.3.2 release?

  5. Jérémie Lagravière reporter

    @Dan Bonachea Thanks for the update.
    Yes, the problem is still occurring on the platform I am using.

    Since, the original post, I have written another UPC++ program: a stencil computation for the heat equation.

    And, when using a certain number of cores/threads/nodes I get the same errors.

    The parameters are the same as before.

    My only solution is to re-run multiple times my program until it works.

    This time the heat equation is a much simpler program than the previous sparse matrix vector multiplication…I was expecting that maybe the problem would come from something in the SpMV program, but apparently nope: even the heat equation crashes on high number of cores/threads/nodes

  6. Dan Bonachea

    @Jérémie Lagravière : Thanks for the update. My recommendations:

    Start a new issue for pure UPC++ exit behavior

    In your original report, you stated:

    For info: adding MPI_Init and MPI_Finalize solves a problem that causes harmless crashes at the end of my jobs. Basically with MPI_Init() and MPI_Finalize() my progam run is cleaner. For info 2: I do not mix UPC++ and MPI. My program is 100% UPC++.

    As far as I know, you have not reported these "harmless" exit crashes to us for a pure UPC++ program. This sounds like an independent problem, so can you please start a new issue with more details about what exit-time messages you see, using a debug build for a pure UPC++ program with no MPI_Init/MPI_Finalize calls, and a program to reproduce the crash? Please include complete output for spawning with the upcxx-run -vv option.

    Further examine MPI spawn behavior (for this issue)

    Based on all the evidence above, our best current guess is some bad interaction between your system's spawner and the MPI installation, possibly involving one or more misconfigured nodes that cause crashes when the job grows to include them. Since you are not actually using MPI for communication, we suspect the shortest path to solving your problem and getting you running smoothly is to remove MPI from the picture entirely (hence the suggestion above).

    However if we wish to continue pursuing the issue in this report (OpenMPI crashing in MPI_Init), can you please re-confirm that you are able to successfully run the attached mpi-hello2.c program repeatedly at large scale over 1024 nodes, using the same MPI library and batch job where the hybrid UPC+ program crashes? For example, write a single large-scale batch job that alternates running the pure MPI code and MPI/UPC++ hybrid back to back in a loop a few hundred times.

  7. Paul Hargrove

    @Jérémie Lagravière, It has been over a year since this issue was last updated.
    Can you please indicate whether you are still having problems, or if we may close this issue.

  8. Jérémie Lagravière reporter

    @PHHargrove Yep you can close the issue.

    Not to say that the problem is solved, but I am not using UPC++ these days, so I have no way to check whether the problem is still active or not.

  9. Log in to comment