Python segmentation fault Open MPI 1.7.2

Issue #384 resolved
Felix Ospald created an issue

The Python part of Dolfin creates a segmentation fault with this particular version of Open MPI.

case3| python demo_poisson.py
[case3:04260] *** Process received signal ***
[case3:04260] Signal: Segmentation fault (11)
[case3:04260] Signal code: Address not mapped (1)
[case3:04260] Failing at address: 0x2f
[case3:04260] [ 0] /lib64/libpthread.so.0(+0xf9f0) [0x7f22252829f0]
[case3:04260] [ 1] /lib64/libc.so.6(_IO_vfprintf+0x22d8) [0x7f2224f0dc48]
[case3:04260] [ 2] /lib64/libc.so.6(__vasprintf_chk+0xb5) [0x7f2224fbcce5]
[case3:04260] [ 3] /lib64/libc.so.6(__asprintf_chk+0x82) [0x7f2224fbcc22]
[case3:04260] [ 4] /usr/lib64/mpi/gcc/openmpi/lib64/libmpi.so.1(ompi_mpi_init+0x3bc) [0x7f2214bbd4fc]
[case3:04260] [ 5] /usr/lib64/mpi/gcc/openmpi/lib64/libmpi.so.1(PMPI_Init_thread+0x15d) [0x7f2214bd919d]
[case3:04260] [ 6] /LOCAL/Software/FEniCS-1.4.0/lib/libdolfin.so.1.4(_ZN6dolfin17SubSystemsManager8init_mpiEiPPci+0x8d) [0x7f221dc8987d]
[case3:04260] [ 7] /LOCAL/Software/FEniCS-1.4.0/lib/libdolfin.so.1.4(_ZN6dolfin17SubSystemsManager8init_mpiEv+0x31) [0x7f221dc89a61]
[case3:04260] [ 8] /LOCAL/Software/FEniCS-1.4.0/lib/libdolfin.so.1.4(_ZN6dolfin3MPI4sizeEP19ompi_communicator_t+0xd) [0x7f221dc88fbd]
[case3:04260] [ 9] /LOCAL/Software/FEniCS-1.4.0/lib/libdolfin.so.1.4(_ZN6dolfin3MPI11is_receiverEP19ompi_communicator_t+0x9) [0x7f221dc89019]
[case3:04260] [10] /LOCAL/Software/FEniCS-1.4.0/lib/libdolfin.so.1.4(_ZN6dolfin13RectangleMesh5buildEddddmmSs+0x49) [0x7f221d99e7c9]
[case3:04260] [11] /LOCAL/Software/FEniCS-1.4.0/lib/libdolfin.so.1.4(_ZN6dolfin13RectangleMeshC2EddddmmSs+0x8c) [0x7f221d99fb4c]
[case3:04260] [12] /LOCAL/Software/FEniCS-1.4.0/lib64/python2.7/site-packages/dolfin/cpp/_mesh.so(+0xf47b5) [0x7f220676d7b5]
[case3:04260] [13] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x265b) [0x7f2225559fbb]
[case3:04260] [14] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x221) [0x7f222555e811]
[case3:04260] [15] /usr/lib64/libpython2.7.so.1.0(+0xb1a7f) [0x7f2225542a7f]
[case3:04260] [16] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x46) [0x7f222553df16]
[case3:04260] [17] /usr/lib64/libpython2.7.so.1.0(+0xaddfa) [0x7f222553edfa]
[case3:04260] [18] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x46) [0x7f222553df16]
[case3:04260] [19] /usr/lib64/libpython2.7.so.1.0(+0xbe039) [0x7f222554f039]
[case3:04260] [20] /usr/lib64/libpython2.7.so.1.0(+0xbd66a) [0x7f222554e66a]
[case3:04260] [21] /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x46) [0x7f222553df16]
[case3:04260] [22] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x144e) [0x7f2225558dae]
[case3:04260] [23] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x4f6) [0x7f222555eae6]
[case3:04260] [24] /usr/lib64/libpython2.7.so.1.0(PyEval_EvalCode+0x32) [0x7f222558b812]
[case3:04260] [25] /usr/lib64/libpython2.7.so.1.0(+0x106f7d) [0x7f2225597f7d]
[case3:04260] [26] /usr/lib64/libpython2.7.so.1.0(PyRun_FileExFlags+0x92) [0x7f2225526010]
[case3:04260] [27] /usr/lib64/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0x308) [0x7f2225526bef]
[case3:04260] [28] /usr/lib64/libpython2.7.so.1.0(Py_Main+0xc60) [0x7f222552e81e]
[case3:04260] [29] /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f2224ee6be5]
[case3:04260] *** End of error message ***
Segmentation fault

This is a bug in OpenMPI, caused by the following line in SubsystemManager.cpp:

SubSystemsManager::init_mpi(0, &c, MPI_THREAD_MULTIPLE);

which causes trouble with these particular lines in ompi_mpi_init

    /* if we were not externally started, then we need to setup
     * some envars so the MPI_INFO_ENV can get the cmd name
     * and argv (but only if the user supplied a non-NULL argv!), and
     * the requested thread level
     */
    if (NULL == getenv("OMPI_COMMAND") && NULL != argv && NULL != argv[0]) {
        asprintf(&cmd, "OMPI_COMMAND=%s", argv[0]);
        putenv(cmd);
    }

which accesses argv[0]!

To get around this, just pass NULL as argv

SubSystemsManager::init_mpi(0, NULL, MPI_THREAD_MULTIPLE);

this fixed the bug for me. This is probably fixed in the latest version of Open MPI but could be very annoying... One can also get around this bug by running the program with mpirun -n 1 python ...

Comments (13)

  1. Felix Ospald reporter

    Actually to totally fix this it would be even better to replace

    MPI_Init_thread(&argc, &argv, required_thread_level, &provided);
    

    by something like

    if (argc == 0) argv = NULL;
    MPI_Init_thread(&argc, &argv, required_thread_level, &provided);
    
  2. Prof Garth Wells

    I've seen this error before and it is annoying, but adding a work-around for every compiler or MPI bug isn't sustainable. Since the latest OpenMPI series is 1.8 and the reported error is for 1.7, I don't think we should change the DOLFIN code. If affects many users and is technically straightforward, we can have the build system print an error message.

  3. Prof Garth Wells

    @felix_ospald Do I understand correctly from your last comment that you've tested with char* c = ""; and it works? If yes, I can make a change.

  4. Felix Ospald reporter

    I did not test it, but the problem is that argv[0] points somewhere and OpenMPI calls asprintf(&cmd, "OMPI_COMMAND=%s", argv[0]); which causes the crash. If you set char c = "" then argv[0] = "" and there should be no problem. Also if the standard says argv[argc] should point to a valid null character then this means you should initialize char c = "". I can test if it works, but I'm pretty certain.

  5. Log in to comment