Bad Termination when running on XSEDE Comet cluster with mvapich

Issue #160 resolved
George Koubbe created an issue

Hello,

I installed mpi4py on XSEDE Comet platform, using mvapich, specifying it in the mpi.cfg as following:

# MVAPICH MPI example  
# ----------------  
[mvapich]  
mpi_dir              = /opt/mvapich2/intel/ib/ 
mpicc                = %(mpi_dir)s/bin/mpicc  
mpicxx               = %(mpi_dir)s/bin/mpicxx  
library_dirs         = %(mpi_dir)s/lib:/opt/intel/2018.1.163/lib/intel64:/etc/libibverbs.d
runtime_library_dirs = %(library_dirs)s

The command:

which mpicc

outputs:

/opt/mvapich2/intel/ib/bin/mpicc

The command:

mpicc --version

outputs:

icc (ICC) 18.0.1 20171018
Copyright (C) 1985-2017 Intel Corporation.  All rights reserved.

The command:

which mpirun

outputs:

/opt/mvapich2/intel/ib/bin/mpirun

The command:

mpirun --version

outputs:

HYDRA build details:
    Version:                                 3.2.1
    Release Date:                            General Availability Release
    CC:                              icc  -fPIC -O3  
    CXX:                             icpc  -fPIC -O3  
    F77:                             ifort -fPIC -O3  
    F90:                             ifort -fPIC -O3  
    Configure options:                       '--disable-option-checking' '--prefix=/opt/mvapich2/intel/ib' '--enable-shared' '--enable-sharedlibs=gcc' '--with-hwloc' '--with-ib-include=/usr/include/infiniband' '--with-ib-libpath=/usr/lib64' '--enable-fast=O3' '--with-limic2=/scratch/rolls/mpi-roll/BUILD/sdsc-mvapich2_intel_ib-2.3.2/../..//cache/build-limic' '--enable-avx' '--with-slurm=/usr/lib64/slurm' '--with-file-system=lustre' 'CC=icc' 'CFLAGS=-fPIC -O3 -O3' 'CPPFLAGS=-I/usr/include/infiniband -I/scratch/rolls/mpi-roll/BUILD/sdsc-mvapich2_intel_ib-2.3.2/mvapich2-2.3.2/src/mpl/include -I/scratch/rolls/mpi-roll/BUILD/sdsc-mvapich2_intel_ib-2.3.2/mvapich2-2.3.2/src/mpl/include -I/scratch/rolls/mpi-roll/BUILD/sdsc-mvapich2_intel_ib-2.3.2/mvapich2-2.3.2/src/openpa/src -I/scratch/rolls/mpi-roll/BUILD/sdsc-mvapich2_intel_ib-2.3.2/mvapich2-2.3.2/src/openpa/src -D_REENTRANT -I/scratch/rolls/mpi-roll/BUILD/sdsc-mvapich2_intel_ib-2.3.2/mvapich2-2.3.2/src/mpi/romio/include -I/include -I/include -I/usr/include/infiniband -I/scratch/rolls/mpi-roll/BUILD/sdsc-mvapich2_intel_ib-2.3.2/../..//cache/build-limic/include -I/include -I/include' 'CXX=icpc' 'CXXFLAGS=-fPIC -O3 -O3' 'FC=ifort' 'FCFLAGS=-fPIC -O3 -O3' 'F77=ifort' 'FFLAGS=-L/usr/lib64 -L/lib -L/lib -fPIC -O3 -O3' '--cache-file=/dev/null' '--srcdir=.' 'LDFLAGS=-L/usr/lib64 -L/lib -L/lib -L/lib -Wl,-rpath,/lib -L/lib -Wl,-rpath,/lib -L/usr/lib64 -L/scratch/rolls/mpi-roll/BUILD/sdsc-mvapich2_intel_ib-2.3.2/../..//cache/build-limic/lib -L/lib -L/lib' 'LIBS=-libmad -lrdmacm -libumad -libverbs -lrt -llimic2 -lpthread ' 'MPLLIBNAME=mpl'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Checkpointing libraries available:       
    Demux engines available:                 poll select

However, when I run:

mpirun -n 4 -ppn 2 python demo/helloworld.py

I get the following:

WARNING: Error in initializing MVAPICH2 ptmalloc library.Continuing without InfiniBand registration cache support.
Hello, World! I am process 1 of 4 on comet-14-01.sdsc.edu.
Hello, World! I am process 0 of 4 on comet-14-01.sdsc.edu.
Hello, World! I am process 2 of 4 on comet-14-02.sdsc.edu.
Hello, World! I am process 3 of 4 on comet-14-02.sdsc.edu.
Error in system call pthread_mutex_destroy: Device or resource busy
    src/mpi/init/initthread.c:241
[rank 2] Assertion failed in file src/mpi/init/initthread.c at line 242: err == 0
[comet-14-02.sdsc.edu:mpi_rank_2][error_sighandler] Caught error: Segmentation fault (signal 11)
Error in system call pthread_mutex_destroy: Device or resource busy
    src/mpi/init/initthread.c:241
[rank 3] Assertion failed in file src/mpi/init/initthread.c at line 242: err == 0
[comet-14-02.sdsc.edu:mpi_rank_3][error_sighandler] Caught error: Segmentation fault (signal 11)
Error in system call pthread_mutex_destroy: Device or resource busy
    src/mpi/init/initthread.c:241
[rank 0] Assertion failed in file src/mpi/init/initthread.c at line 242: err == 0
[comet-14-01.sdsc.edu:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
Error in system call pthread_mutex_destroy: Device or resource busy
    src/mpi/init/initthread.c:241
[rank 1] Assertion failed in file src/mpi/init/initthread.c at line 242: err == 0
[comet-14-01.sdsc.edu:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 5740 RUNNING AT comet-14-02
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Apparently, the application runs fine, but it gives errors regarding bad termination. What can I do to solve this?

Any suggestion would be greatly appreciated.

George Koubbe.

Comments (3)

  1. Lisandro Dalcin

    I have absolutely no idea. Maybe set LD_LIBRARY_PATH to list all the possible paths your MPI library needs. This is very unlikely an issue in mpi4py. You are using the Intel compiler to build mpi4py, but your Python was built most likely with GCC, then incompatibility issues may quick in. The initial warning you got is suspicious.

    Did you try to build and run a plain C example likedemo/helloworld.c ? Does it work?

  2. George Koubbe reporter

    I contacted SDSC Comet user support, and they told me to add the following line before the mpirun:

    export MV2_ENABLE_AFFINITY=0
    

    I tested it and it works perfectly.

    Sorry for bothering on here.

  3. Log in to comment