CarpetInterp hangs with openmpi

Issue #2023 closed
anonymous created an issue

On three occasions with two different simulations (TOV, BNS) the code hung. In two cases, I could attach a debugger to the running process, and in both cases the backtrace looked like this:

#0  0x00007f4ecfd5729a in __GI___pthread_mutex_lock (mutex=0xc40d730) at ../nptl/pthread_mutex_lock.c:79
#1  0x00007f4ec7224807 in ?? () from /usr/lib/openmpi/lib/openmpi/mca_btl_openib.so
#2  0x00007f4ecdae734a in opal_progress () from /usr/lib/libmpi.so.1
#3  0x00007f4ecda2d3b4 in ompi_request_default_wait_all () from /usr/lib/libmpi.so.1
#4  0x00007f4ec61cb7b7 in ompi_coll_tuned_sendrecv_actual () from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#5  0x00007f4ec61d0df6 in ompi_coll_tuned_alltoallv_intra_pairwise () from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#6  0x00007f4ecda3a00f in PMPI_Alltoallv () from /usr/lib/libmpi.so.1
#7  0x00000000015458fe in CarpetInterp::Carpet_DriverInterpolate (cctkGH_=<optimized out>, N_dims=<optimized out>, local_interp_handle=3, param_table_handle=6421, coord_system_handle=0, N_interp_points=224, interp_coords_type_code=130, coords_list=<optimized out>, N_input_arrays=6, input_array_variable_indices=0x7fff38fbf640, N_output_arrays=24, output_array_type_codes=0x7fff38fbf660, output_arrays=0x7fff38fbfb00)
at /home/wolfgang.kastaun/ET/Payne/Cactus/arrangements/Carpet/CarpetInterp/src/interp.cc:645
#8  0x000000000071ee4f in SymBase_SymmetryInterpolateFaces (cctkGH_=0xccb1be0, N_dims=<optimized out>, local_interp_handle=3, param_table_handle=6421, coord_system_handle=0, N_interp_points=224, interp_coords_type=130, interp_coords=0x7fff38fbe1a0, N_input_arrays=6, input_array_indices=0x7fff38fbf640, N_output_arrays=24, output_array_types=0x7fff38fbf660, output_arrays=0x7fff38fbfb00, faces=0)
at /home/wolfgang.kastaun/ET/Payne/Cactus/arrangements/CactusBase/SymBase/src/Interpolation.c:381

I'm not sure if this is a problem with Carpet or with my OpenMPI installation. It only happens rarely, after one day or so. I'm using ET version Payne, gcc 4.9.2, openmpi 1.6.5 on an infiniband interconnect. The code was compiled with OpenMP support and run using 4 threads per process.

Looking at various variables in Carpet_DriverInterpolate with the debugger, I noticed that the vector tmp was reported with size -488221088, but this could also be mis-reported by gdb since the code was compiled with -O3.

Keyword: OpenMPI,
Keyword: Carpet

Comments (9)

  1. Ian Hinder
    • removed comment

    This looks very similar to what we encountered with OmniPath on Minerva, which held up the acceptance for a few months. In that case, it was a bug in the OmniPath driver, which Intel eventually found and fixed, using our reproducible example. This was with SpEC, not Cactus. It looks the same on surface (a hang in MPI_Alltoallv), but you are using OpenMPI and infiniband, whereas we were using Intel MPI and OmniPath, so unfortunately this doesn't seem to be the same problem. You could also try compiling without optimisation to see if that gives better backtraces.

    Is your problem reproducible? What cluster is this? Have you tried different OpenMPI versions, or a different MPI implementation?

  2. anonymous reporter
    • removed comment

    Replying to [comment:1 hinder]:

    This looks very similar to what we encountered with OmniPath on Minerva, which held up the acceptance for a few months. In that case, it was a bug in the OmniPath driver, which Intel eventually found and fixed, using our reproducible example. This was with SpEC, not Cactus. It looks the same on surface (a

    Is that reproducible example a full SpEC run or a small example code I could test as well? Did you use hyperthreading and/or OpenMP?

    hang in MPI_Alltoallv), but you are using OpenMPI and infiniband, whereas we were using Intel MPI and OmniPath, so unfortunately this doesn't seem to be the same problem. You could also try compiling without optimisation to see if that gives better backtraces.

    Is your problem reproducible?

    Not really. I did several short benchmarks (<20 min) with the same executable, and it happened to 2 out of 18. It also happened in one longer run after around 30 hours.

    What cluster is this?
    The new cluster "holodeck" for NR at AEI Hanover. It's a 640 core Intel Xeon system.

    Have you tried different OpenMPI versions, or a different MPI implementation?

    Not yet, we need to install them first..

  3. Ian Hinder
    • removed comment

    Unfortunately we were unsuccessful in reproducing it in a small test case; it was a full SpEC run. There was no hyperthreading or OpenMP. In the runs where it was not reproducible, are the numerical results identical in each different copy of the run, or does the code make different decisions based, e.g. on timing, in each case? SpEC has several places where it chooses the best algorithm based on runtime performance, and this meant that the code paths were different. That is why it didn't affect all the runs. Only when we disabled all these options did we manage to get something to fail reliably.

  4. anonymous reporter
    • removed comment

    Replying to [comment:3 hinder]:

    Unfortunately we were unsuccessful in reproducing it in a small test case; it was a full SpEC run. There was no hyperthreading or OpenMP. In the runs where it was not reproducible, are the numerical results identical in each different copy of the run, or does the code make different decisions based, e.g. on timing, in each case? SpEC has several places where it chooses the best algorithm based on runtime performance, and this meant that the code paths were different. That is why it didn't affect all the runs. Only when we disabled all these options did we manage to get something to fail reliably.

    The hydro part always does the same, not sure about McLachlan. By now, it also happened with intel-compiled code, but using the same MPI. I tend to think it's a problem with our installation, not Cactus.

  5. Ian Hinder
    • removed comment

    McLachlan should be deterministic. I suspect the same; it feels like a system problem, not a Cactus problem.

  6. Roland Haas
    • removed comment

    Wolfgang: any updates on this? if not then I will close with worksforme in a week.

  7. Log in to comment