Bad Termination when running on XSEDE Comet cluster with mvapich
Issue #160
resolved
Hello,
I installed mpi4py on XSEDE Comet platform, using mvapich, specifying it in the mpi.cfg as following:
# MVAPICH MPI example
# ----------------
[mvapich]
mpi_dir = /opt/mvapich2/intel/ib/
mpicc = %(mpi_dir)s/bin/mpicc
mpicxx = %(mpi_dir)s/bin/mpicxx
library_dirs = %(mpi_dir)s/lib:/opt/intel/2018.1.163/lib/intel64:/etc/libibverbs.d
runtime_library_dirs = %(library_dirs)s
The command:
which mpicc
outputs:
/opt/mvapich2/intel/ib/bin/mpicc
The command:
mpicc --version
outputs:
icc (ICC) 18.0.1 20171018
Copyright (C) 1985-2017 Intel Corporation. All rights reserved.
The command:
which mpirun
outputs:
/opt/mvapich2/intel/ib/bin/mpirun
The command:
mpirun --version
outputs:
HYDRA build details:
Version: 3.2.1
Release Date: General Availability Release
CC: icc -fPIC -O3
CXX: icpc -fPIC -O3
F77: ifort -fPIC -O3
F90: ifort -fPIC -O3
Configure options: '--disable-option-checking' '--prefix=/opt/mvapich2/intel/ib' '--enable-shared' '--enable-sharedlibs=gcc' '--with-hwloc' '--with-ib-include=/usr/include/infiniband' '--with-ib-libpath=/usr/lib64' '--enable-fast=O3' '--with-limic2=/scratch/rolls/mpi-roll/BUILD/sdsc-mvapich2_intel_ib-2.3.2/../..//cache/build-limic' '--enable-avx' '--with-slurm=/usr/lib64/slurm' '--with-file-system=lustre' 'CC=icc' 'CFLAGS=-fPIC -O3 -O3' 'CPPFLAGS=-I/usr/include/infiniband -I/scratch/rolls/mpi-roll/BUILD/sdsc-mvapich2_intel_ib-2.3.2/mvapich2-2.3.2/src/mpl/include -I/scratch/rolls/mpi-roll/BUILD/sdsc-mvapich2_intel_ib-2.3.2/mvapich2-2.3.2/src/mpl/include -I/scratch/rolls/mpi-roll/BUILD/sdsc-mvapich2_intel_ib-2.3.2/mvapich2-2.3.2/src/openpa/src -I/scratch/rolls/mpi-roll/BUILD/sdsc-mvapich2_intel_ib-2.3.2/mvapich2-2.3.2/src/openpa/src -D_REENTRANT -I/scratch/rolls/mpi-roll/BUILD/sdsc-mvapich2_intel_ib-2.3.2/mvapich2-2.3.2/src/mpi/romio/include -I/include -I/include -I/usr/include/infiniband -I/scratch/rolls/mpi-roll/BUILD/sdsc-mvapich2_intel_ib-2.3.2/../..//cache/build-limic/include -I/include -I/include' 'CXX=icpc' 'CXXFLAGS=-fPIC -O3 -O3' 'FC=ifort' 'FCFLAGS=-fPIC -O3 -O3' 'F77=ifort' 'FFLAGS=-L/usr/lib64 -L/lib -L/lib -fPIC -O3 -O3' '--cache-file=/dev/null' '--srcdir=.' 'LDFLAGS=-L/usr/lib64 -L/lib -L/lib -L/lib -Wl,-rpath,/lib -L/lib -Wl,-rpath,/lib -L/usr/lib64 -L/scratch/rolls/mpi-roll/BUILD/sdsc-mvapich2_intel_ib-2.3.2/../..//cache/build-limic/lib -L/lib -L/lib' 'LIBS=-libmad -lrdmacm -libumad -libverbs -lrt -llimic2 -lpthread ' 'MPLLIBNAME=mpl'
Process Manager: pmi
Launchers available: ssh rsh fork slurm ll lsf sge manual persist
Topology libraries available: hwloc
Resource management kernels available: user slurm ll lsf sge pbs cobalt
Checkpointing libraries available:
Demux engines available: poll select
However, when I run:
mpirun -n 4 -ppn 2 python demo/helloworld.py
I get the following:
WARNING: Error in initializing MVAPICH2 ptmalloc library.Continuing without InfiniBand registration cache support.
Hello, World! I am process 1 of 4 on comet-14-01.sdsc.edu.
Hello, World! I am process 0 of 4 on comet-14-01.sdsc.edu.
Hello, World! I am process 2 of 4 on comet-14-02.sdsc.edu.
Hello, World! I am process 3 of 4 on comet-14-02.sdsc.edu.
Error in system call pthread_mutex_destroy: Device or resource busy
src/mpi/init/initthread.c:241
[rank 2] Assertion failed in file src/mpi/init/initthread.c at line 242: err == 0
[comet-14-02.sdsc.edu:mpi_rank_2][error_sighandler] Caught error: Segmentation fault (signal 11)
Error in system call pthread_mutex_destroy: Device or resource busy
src/mpi/init/initthread.c:241
[rank 3] Assertion failed in file src/mpi/init/initthread.c at line 242: err == 0
[comet-14-02.sdsc.edu:mpi_rank_3][error_sighandler] Caught error: Segmentation fault (signal 11)
Error in system call pthread_mutex_destroy: Device or resource busy
src/mpi/init/initthread.c:241
[rank 0] Assertion failed in file src/mpi/init/initthread.c at line 242: err == 0
[comet-14-01.sdsc.edu:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
Error in system call pthread_mutex_destroy: Device or resource busy
src/mpi/init/initthread.c:241
[rank 1] Assertion failed in file src/mpi/init/initthread.c at line 242: err == 0
[comet-14-01.sdsc.edu:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 5740 RUNNING AT comet-14-02
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
Apparently, the application runs fine, but it gives errors regarding bad termination. What can I do to solve this?
Any suggestion would be greatly appreciated.
George Koubbe.
Comments (3)
-
-
reporter I contacted SDSC Comet user support, and they told me to add the following line before the mpirun:
export MV2_ENABLE_AFFINITY=0
I tested it and it works perfectly.
Sorry for bothering on here.
-
reporter - changed status to resolved
Add before mpirun:
export MV2_ENABLE_AFFINITY=0
- Log in to comment
I have absolutely no idea. Maybe set
LD_LIBRARY_PATH
to list all the possible paths your MPI library needs. This is very unlikely an issue in mpi4py. You are using the Intel compiler to build mpi4py, but your Python was built most likely with GCC, then incompatibility issues may quick in. The initial warning you got is suspicious.Did you try to build and run a plain C example like
demo/helloworld.c
? Does it work?