Python segmentation fault Open MPI 1.7.2
The Python part of Dolfin creates a segmentation fault with this particular version of Open MPI.
case3| python demo_poisson.py
[case3:04260] *** Process received signal ***
[case3:04260] Signal: Segmentation fault (11)
[case3:04260] Signal code: Address not mapped (1)
[case3:04260] Failing at address: 0x2f
[case3:04260] [ 0] /lib64/libpthread.so.0(+0xf9f0) [0x7f22252829f0]
[case3:04260] [ 1] /lib64/libc.so.6(_IO_vfprintf+0x22d8) [0x7f2224f0dc48]
[case3:04260] [ 2] /lib64/libc.so.6(__vasprintf_chk+0xb5) [0x7f2224fbcce5]
[case3:04260] [ 3] /lib64/libc.so.6(__asprintf_chk+0x82) [0x7f2224fbcc22]
[case3:04260] [ 4] /usr/lib64/mpi/gcc/openmpi/lib64/libmpi.so.1(ompi_mpi_init+0x3bc) [0x7f2214bbd4fc]
[case3:04260] [ 5] /usr/lib64/mpi/gcc/openmpi/lib64/libmpi.so.1(PMPI_Init_thread+0x15d) [0x7f2214bd919d]
[case3:04260] [ 6] /LOCAL/Software/FEniCS-1.4.0/lib/libdolfin.so.1.4(_ZN6dolfin17SubSystemsManager8init_mpiEiPPci+0x8d) [0x7f221dc8987d]
[case3:04260] [ 7] /LOCAL/Software/FEniCS-1.4.0/lib/libdolfin.so.1.4(_ZN6dolfin17SubSystemsManager8init_mpiEv+0x31) [0x7f221dc89a61]
[case3:04260] [ 8] /LOCAL/Software/FEniCS-1.4.0/lib/libdolfin.so.1.4(_ZN6dolfin3MPI4sizeEP19ompi_communicator_t+0xd) [0x7f221dc88fbd]
[case3:04260] [ 9] /LOCAL/Software/FEniCS-1.4.0/lib/libdolfin.so.1.4(_ZN6dolfin3MPI11is_receiverEP19ompi_communicator_t+0x9) [0x7f221dc89019]
[case3:04260] [10] /LOCAL/Software/FEniCS-1.4.0/lib/libdolfin.so.1.4(_ZN6dolfin13RectangleMesh5buildEddddmmSs+0x49) [0x7f221d99e7c9]
[case3:04260] [11] /LOCAL/Software/FEniCS-1.4.0/lib/libdolfin.so.1.4(_ZN6dolfin13RectangleMeshC2EddddmmSs+0x8c) [0x7f221d99fb4c]
[case3:04260] [12] /LOCAL/Software/FEniCS-1.4.0/lib64/python2.7/site-packages/dolfin/cpp/_mesh.so(+0xf47b5) [0x7f220676d7b5]
[case3:04260] [13] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x265b) [0x7f2225559fbb]
[case3:04260] [14] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x221) [0x7f222555e811]
[case3:04260] [15] /usr/lib64/libpython2.7.so.1.0(+0xb1a7f) [0x7f2225542a7f]
[case3:04260] [16] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x46) [0x7f222553df16]
[case3:04260] [17] /usr/lib64/libpython2.7.so.1.0(+0xaddfa) [0x7f222553edfa]
[case3:04260] [18] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x46) [0x7f222553df16]
[case3:04260] [19] /usr/lib64/libpython2.7.so.1.0(+0xbe039) [0x7f222554f039]
[case3:04260] [20] /usr/lib64/libpython2.7.so.1.0(+0xbd66a) [0x7f222554e66a]
[case3:04260] [21] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x46) [0x7f222553df16]
[case3:04260] [22] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x144e) [0x7f2225558dae]
[case3:04260] [23] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x4f6) [0x7f222555eae6]
[case3:04260] [24] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalCode+0x32) [0x7f222558b812]
[case3:04260] [25] /usr/lib64/libpython2.7.so.1.0(+0x106f7d) [0x7f2225597f7d]
[case3:04260] [26] /usr/lib64/libpython2.7.so.1.0(PyRun_FileExFlags+0x92) [0x7f2225526010]
[case3:04260] [27] /usr/lib64/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0x308) [0x7f2225526bef]
[case3:04260] [28] /usr/lib64/libpython2.7.so.1.0(Py_Main+0xc60) [0x7f222552e81e]
[case3:04260] [29] /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f2224ee6be5]
[case3:04260] *** End of error message ***
Segmentation fault
This is a bug in OpenMPI, caused by the following line in SubsystemManager.cpp:
SubSystemsManager::init_mpi(0, &c, MPI_THREAD_MULTIPLE);
which causes trouble with these particular lines in ompi_mpi_init
/* if we were not externally started, then we need to setup
* some envars so the MPI_INFO_ENV can get the cmd name
* and argv (but only if the user supplied a non-NULL argv!), and
* the requested thread level
*/
if (NULL == getenv("OMPI_COMMAND") && NULL != argv && NULL != argv[0]) {
asprintf(&cmd, "OMPI_COMMAND=%s", argv[0]);
putenv(cmd);
}
which accesses argv[0]!
To get around this, just pass NULL as argv
SubSystemsManager::init_mpi(0, NULL, MPI_THREAD_MULTIPLE);
this fixed the bug for me. This is probably fixed in the latest version of Open MPI but could be very annoying... One can also get around this bug by running the program with mpirun -n 1 python ...
Comments (13)
-
reporter -
- changed milestone to 1.5
-
Quote from
http://stackoverflow.com/questions/3024197/what-does-int-argc-char-argv-mean
Since "The value of argv[argc] shall be 0" (C++03 §3.6.1/2), argv cannot be null.
-
I've seen this error before and it is annoying, but adding a work-around for every compiler or MPI bug isn't sustainable. Since the latest OpenMPI series is 1.8 and the reported error is for 1.7, I don't think we should change the DOLFIN code. If affects many users and is technically straightforward, we can have the build system print an error message.
-
reporter If Martin Alnas quote is correct then it is a FEniCS problem: see https://bitbucket.org/fenics-project/dolfin/src/2523d77bf847d9501fddd0206f7ebe3a7f99d158/dolfin/common/SubSystemsManager.cpp?at=master
char* c; SubSystemsManager::init_mpi(0, &c, MPI_THREAD_MULTIPLE);
should be then
char* c = ""; SubSystemsManager::init_mpi(0, &c, MPI_THREAD_MULTIPLE);
and everything is fine...
-
@felix_ospald Do I understand correctly from your last comment that you've tested with
char* c = "";
and it works? If yes, I can make a change. -
reporter I did not test it, but the problem is that argv[0] points somewhere and OpenMPI calls asprintf(&cmd, "OMPI_COMMAND=%s", argv[0]); which causes the crash. If you set char c = "" then argv[0] = "" and there should be no problem. Also if the standard says argv[argc] should point to a valid null character then this means you should initialize char c = "". I can test if it works, but I'm pretty certain.
-
reporter Ok, I recompiled and it works. I used
char* c = const_cast<char*>("");
-
reporter Maybe it makes also sense to introduce a parameters["mpi_params"] from which argv is built. This would allow to pass arguments to MPI as listed here: http://linux.die.net/man/3/mpi_init
-
@garth-wells this seems an easy fix to merge for 1.5 release
-
-
assigned issue to
-
assigned issue to
-
- changed status to resolved
Fix in next
-
- removed milestone
Removing milestone: 1.5 (automated comment)
- Log in to comment
Actually to totally fix this it would be even better to replace
by something like