OpenMP Heisenbug with Default Thornfile

Issue #2774 open
José Ferreira created an issue

Hello, I am facing an issue where running a simulation with the toolkit, using the default thornlist, can either run as expected, or crash right in the beginning of the simulation. This took place with ET_TOV.par, that ships with the toolkit, and a thornfile that evolves a constant scalar field, that ships with the Scalar thorn also included in the toolkit.

I believe that the culprit is OpenMP because if I disable threading during run-time, by setting OMP_NUM_THREADS=1, or if I disable OpenMP during compile-time, the simulations run as expected.

This bug takes place both in my laptop and in my desktop, which share similar operating systems and software stack.

In the following sections, I will write, line by line, the steps that I have performed in order to reproduce the bug, and my attempt of tracking it down.

Installing and Compiling the Toolkit

To avoid compiling CarpetX thorns that fail to compile in my system, for some reason, I start by downloading the previous version of the toolkit with

$ curl -kLO https://raw.githubusercontent.com/gridaphobe/CRL/ET_2023_05/GetComponents
$ chmod a+x GetComponents
$ ./GetComponents --parallel https://bitbucket.org/einsteintoolkit/manifest/raw/ET_2023_05/einsteintoolkit.th

and then change to the Cactus root directory

$ cd Cactus

I create the options file arch.cfg , which should already be present in the parent directory, with the following written in it

# Cactus configuration for Arch and Arch-based distros

## Decide which flags will be used at compile-time
OPTIMISE = yes
WARN     = yes
DEBUG    = no
PROFILE  = no
OPENMP   = yes

## Compilers
CPP = cpp
FPP = cpp
CC  = gcc
CXX = g++
F77 = gfortran
F90 = gfortran

## Default flags
CPPFLAGS = -DMPICH_IGNORE_CXX_SEEK
FPPFLAGS = -traditional
CFLAGS   = -g3 -march=native -std=gnu99
CXXFLAGS = -g3 -march=native -std=gnu++0x
F77FLAGS = -g3 -march=native -fcray-pointer -m128bit-long-double -ffixed-line-length-none -fno-range-check
F90FLAGS = -g3 -march=native -fcray-pointer -m128bit-long-double -ffixed-line-length-none -fno-range-check
LDFLAGS  = -rdynamic

## Optimization flags
CPP_OPTIMISE_FLAGS = -DKRANC_VECTORS # -DCARPET_OPTIMISE -DNDEBUG
FPP_OPTIMISE_FLAGS =                 # -DCARPET_OPTIMISE -DNDEBUG
C_OPTIMISE_FLAGS   = -Ofast
CXX_OPTIMISE_FLAGS = -Ofast
F77_OPTIMISE_FLAGS = -Ofast
F90_OPTIMISE_FLAGS = -Ofast

## Warning flags
CPP_WARN_FLAGS = -Wall
FPP_WARN_FLAGS = -Wall
C_WARN_FLAGS   = -Wall
CXX_WARN_FLAGS = -Wall
F77_WARN_FLAGS = -Wall
F90_WARN_FLAGS = -Wall

## Debug flags
CPP_DEBUG_FLAGS = -DCARPET_DEBUG -fsanitize=undefined -fsanitize=thread
FPP_DEBUG_FLAGS = -DCARPET_DEBUG -fsanitize=undefined -fsanitize=thread
C_DEBUG_FLAGS   = -O0            -fsanitize=undefined -fsanitize=thread
CXX_DEBUG_FLAGS = -O0            -fsanitize=undefined -fsanitize=thread
F77_DEBUG_FLAGS = -O0            -fsanitize=undefined -fsanitize=thread
F90_DEBUG_FLAGS = -O0            -fsanitize=undefined -fsanitize=thread

## Code profiling flags
CPP_PROFILE_FLAGS =
FPP_PROFILE_FLAGS =
C_PROFILE_FLAGS   = -pg
CXX_PROFILE_FLAGS = -pg
F77_PROFILE_FLAGS = -pg
F90_PROFILE_FLAGS = -pg

## OpenMP
CPP_OPENMP_FLAGS = -fopenmp
FPP_OPENMP_FLAGS = -fopenmp
C_OPENMP_FLAGS   = -fopenmp
CXX_OPENMP_FLAGS = -fopenmp
F77_OPENMP_FLAGS = -fopenmp
F90_OPENMP_FLAGS = -fopenmp

## Libraries location
LIBDIRS      =
MPI_DIR      = /usr
HDF5_DIR     = /usr
PTHREADS_DIR = NO_BUILD
LIBS              = gfortran open-pal z 
C_LINE_DIRECTIVES = yes                 
F_LINE_DIRECTIVES = yes                 

I then create the configuration folder with

$ make base-config options=../arch.cfg THORNLIST=thornlists/einsteintoolkit.th

which reveals no errors, with the terminal output attached in file make-config.out, and then make the binary with

$ make -j $(nproc) base

that also reveals no errors, and the output is attached in make-binary.out.

To ensure full reproducibility, instead of doing this manually I created a very simple script, attached as run.sh, that reproduces the steps laid out in this section.

Running the Toolkit and Finding the Bug

With no errors so far, and the binary exe/cactus_base in place, I will run one of the par files provided by default in the toolkit, in this case, par/ET_TOV.par

$ exe/cactus_base par/tov_ET.par

which crashes, producing something that says

Rank 0 with PID 2329840 received signal 11
Writing backtrace to tov_ET/backtrace.0.txt
[1]    2329840 segmentation fault (core dumped)

The full output for this simulation is sent in the file run.out that is being sent as an attachment, along with backtrace.0.txt.

Interestingly, if insist on running this simulation for a few times, I will eventually find one where the bug doesn’t take place, and the simulation seems to be running as expected.

Therefore, this is not just a classical bug, it’s a Heisenbug!

Tracking the Bug

If I am to run the binary disabling OpenMP at runtime, i.e.

$ OMP_NUM_THREADS=1 exe/cactus_base par/ET_TOV.par

then I consistently get no errors, even after running the code a few dozens of time.

This is why I am lead to believe that the culprit of the Heisenbug is OpenMP.

I created a new configuration of the toolkit with DEBUG=yes and OPTIMIZE=no in the options file above, which created the binary exe/cactus_base-debug. For completeness, the output of the making of the configuration is sent in make-config-debug.out and for the binary in make-binary-debug.out, although no errors were produced.

I don’t have that much experience in debugging low-level code, so I decided to disable optimizations and add -fsanitize=undefined and -fsanitize=thread to GCC , which looked reasonable to me.

Running the debug version of the binary with the same parfile

$ exe/cactus_base-debug par/ET_TOV.par

reveals the same error on loop, and the program never terminates (and by never I mean in around one minute), and the errors look something like

==================
WARNING: ThreadSanitizer: data race (pid=2419680)
  Read of size 8 at 0x7ffe815ca630 by thread T5:
    #0 grhydro_atmospherereset_._omp_fn.0 /home/undercover/misc/tmp/Cactus/arrangements/EinsteinEvolve/GRHydro/src/GRHydro_UpdateMask.F90:323 (cactus_base-debug+0xd14fbb9) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #1 gomp_thread_start /usr/src/debug/gcc/gcc/libgomp/team.c:129 (libgomp.so.1+0x20c95) (BuildId: 919d8c8c3093e63652b89795375dcf12dd9cb1d4)

  Previous write of size 8 at 0x7ffe815ca630 by main thread:
    #0 grhydro_atmospherereset_ /home/undercover/misc/tmp/Cactus/arrangements/EinsteinEvolve/GRHydro/src/GRHydro_UpdateMask.F90:321 (cactus_base-debug+0xd11394a) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #1 CCTKi_BindingsFortranWrapperGRHydro /home/undercover/misc/tmp/Cactus/configs/base-debug/bindings/Variables/GRHydro.c:37 (cactus_base-debug+0x155440a4) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #2 CCTK_CallFunction /home/undercover/misc/tmp/Cactus/src/main/ScheduleInterface.c:323 (cactus_base-debug+0x152bc047) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #3 CallScheduledFunction /home/undercover/misc/tmp/Cactus/arrangements/Carpet/Carpet/src/CallFunction.cc:440 (cactus_base-debug+0xa36120b) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #4 Carpet::CallFunction(void*, cFunctionData*, void*) /home/undercover/misc/tmp/Cactus/arrangements/Carpet/Carpet/src/CallFunction.cc:373 (cactus_base-debug+0xa35f44b) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #5 CCTKi_ScheduleCallFunction /home/undercover/misc/tmp/Cactus/src/main/ScheduleInterface.c:3096 (cactus_base-debug+0x152c6b74) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #6 ScheduleTraverseFunction /home/undercover/misc/tmp/Cactus/src/schedule/ScheduleTraverse.c:595 (cactus_base-debug+0x152d6b94) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #7 ScheduleTraverseGroup /home/undercover/misc/tmp/Cactus/src/schedule/ScheduleTraverse.c:369 (cactus_base-debug+0x152d5be6) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #8 ScheduleTraverseGroup /home/undercover/misc/tmp/Cactus/src/schedule/ScheduleTraverse.c:385 (cactus_base-debug+0x152d673d) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #9 ScheduleTraverseGroup /home/undercover/misc/tmp/Cactus/src/schedule/ScheduleTraverse.c:385 (cactus_base-debug+0x152d673d) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #10 CCTKi_DoScheduleTraverse /home/undercover/misc/tmp/Cactus/src/schedule/ScheduleTraverse.c:159 (cactus_base-debug+0x152d4d7e) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #11 ScheduleTraverse /home/undercover/misc/tmp/Cactus/src/main/ScheduleInterface.c:1400 (cactus_base-debug+0x152bedf0) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #12 CCTK_ScheduleTraverse /home/undercover/misc/tmp/Cactus/src/main/ScheduleInterface.c:919 (cactus_base-debug+0x152bcedf) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #13 ScheduleTraverse /home/undercover/misc/tmp/Cactus/arrangements/Carpet/Carpet/src/Initialise.cc:1393 (cactus_base-debug+0xa3f8929) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #14 CallRestrict /home/undercover/misc/tmp/Cactus/arrangements/Carpet/Carpet/src/Initialise.cc:529 (cactus_base-debug+0xa3e8be8) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #15 Carpet::Initialise(tFleshConfig*) /home/undercover/misc/tmp/Cactus/arrangements/Carpet/Carpet/src/Initialise.cc:121 (cactus_base-debug+0xa3ddc37) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #16 main /home/undercover/misc/tmp/Cactus/src/main/flesh.cc:80 (cactus_base-debug+0x1528f2e7) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)

  Location is stack of main thread.

  Location is global '<null>' at 0x000000000000 ([stack]+0xf7630)

  Thread T5 (tid=2419694, running) created by main thread at:
    #0 pthread_create /usr/src/debug/gcc/gcc/libsanitizer/tsan/tsan_interceptors_posix.cpp:1036 (libtsan.so.2+0x44219) (BuildId: 7e8fcb9ed0a63b98f2293e37c92ac955413efd9e)
    #1 gomp_team_start /usr/src/debug/gcc/gcc/libgomp/team.c:858 (libgomp.so.1+0x212df) (BuildId: 919d8c8c3093e63652b89795375dcf12dd9cb1d4)
    #2 CarpetLib::dist::pseudoinit(ompi_communicator_t*) /home/undercover/misc/tmp/Cactus/arrangements/Carpet/CarpetLib/src/dist.cc:200 (cactus_base-debug+0xac0eac8) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #3 CarpetMultiModelStartup /home/undercover/misc/tmp/Cactus/arrangements/Carpet/Carpet/src/CarpetStartup.cc:29 (cactus_base-debug+0xa3733e3) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #4 CCTK_CallFunction /home/undercover/misc/tmp/Cactus/src/main/ScheduleInterface.c:309 (cactus_base-debug+0x152bbf51) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #5 CCTKi_ScheduleCallFunction /home/undercover/misc/tmp/Cactus/src/main/ScheduleInterface.c:3096 (cactus_base-debug+0x152c6b74) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #6 ScheduleTraverseFunction /home/undercover/misc/tmp/Cactus/src/schedule/ScheduleTraverse.c:595 (cactus_base-debug+0x152d6b94) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #7 ScheduleTraverseGroup /home/undercover/misc/tmp/Cactus/src/schedule/ScheduleTraverse.c:369 (cactus_base-debug+0x152d5be6) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #8 CCTKi_DoScheduleTraverse /home/undercover/misc/tmp/Cactus/src/schedule/ScheduleTraverse.c:159 (cactus_base-debug+0x152d4d7e) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #9 ScheduleTraverse /home/undercover/misc/tmp/Cactus/src/main/ScheduleInterface.c:1400 (cactus_base-debug+0x152bedf0) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #10 CCTK_ScheduleTraverse /home/undercover/misc/tmp/Cactus/src/main/ScheduleInterface.c:919 (cactus_base-debug+0x152bcedf) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #11 CCTKi_CallStartupFunctions /home/undercover/misc/tmp/Cactus/src/main/CallStartupFunctions.c:50 (cactus_base-debug+0x1527f958) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #12 CCTKi_InitialiseCactus /home/undercover/misc/tmp/Cactus/src/main/InitialiseCactus.c:117 (cactus_base-debug+0x152a179e) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #13 main /home/undercover/misc/tmp/Cactus/src/main/flesh.cc:64 (cactus_base-debug+0x1528f271) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)

SUMMARY: ThreadSanitizer: data race /home/undercover/misc/tmp/Cactus/arrangements/EinsteinEvolve/GRHydro/src/GRHydro_UpdateMask.F90:323 in grhydro_atmospherereset_._omp_fn.0
==================

The full output of this run until I stopped is attached in run-debug.out (it’s rather large for a text file, sorry).

Once again, by disabling OpenMP at runtime everything seems fine, with the exception of the error

Cactus/arrangements/CactusNumerical/MoL/src/Operators.c:332:31: runtime error: variable length array bound evaluates to non-positive value 0

which I don’t think is problematic, but I’ve decided to share anyways.

If you have any general or specific tips, tricks or hacks to more accurately track down this bug, or on how to interpret the output of the previous tracebacks, would be much appreciated.

Machine information

I’ve witness this behavior in two different machines, with similar OS’s:

  • Legion

    • Type: Laptop
    • Processor: Quad-core Intel(R) Core(TM) i5-7300HQ CPU @ 2.50GHz (no hyper-threading)
    • GPU: Integrated + Nvidia 1050 Ti Mobile
    • OS: Manjaro (x86_64)
    • Kernel: Linux LTS 5.10.206
    • GCC: 13.2.1
    • OpenMP: 16.0.6
    • OpenBLAS: 0.3.26
    • hwloc: 2.10.0
  • Gravitino

    • Type: Desktop
    • Processor: Octa-core Intel(R) Core(TM) i7-9700 @ 3.00GHz (no hyper-threading)
    • GPU: Nvidia 1050 Ti
    • OS: Arch Linux (x86_64)
    • Kernel: Linux LTS 6.6.15
    • GCC: 13.2.1
    • OpenMP: 16.0.6
    • OpenBLAS: 0.3.26
    • hwloc: 2.10.0

There are no virtual environments, everything is managed by the global package manager using the latest releases in their corresponding repositories, and all binaries are linked against system libraries.

If you need any more information about any of the machines, or any of the steps provided above, do no hesitate in replying to this thread.

Thank you!

Comments (5)

  1. José Ferreira reporter

    After disabling Formaline during the compilation and in the par files, the incidence of the Heisenbug reduced significantly.

    Whereas before it would happen almost every time, now it happens very rarely, i.e. once every large dozens of simulations or so.

    I would like to stress than the error always took place during the initialization procedure and never during the simulation.

  2. José Ferreira reporter

    I found two new situations where the bug manifests itself seemingly every time:

    • On trying to recover from checkpoints
    • By setting “TerminationTrigger” as an active thorn

    The behavior is the same with a crash during initialization and a backtrace being dropped.

    EDIT: Now I’m trying out “TerminationTrigger” again and it seems fine…

  3. Log in to comment