- changed status to open
- removed comment
Output a stack backtrace on a fatal error in a simulation
When a Cactus simulation aborts with a signal, it is often difficult to determine which part of the code led to the problem. The attached patch registers a signal handler on Carpet startup for signals 11 and 6 (segmentation fault and abort, e.g. from assert()) which outputs a stack backtrace from each process to a file, including demangling symbol names. It uses some low-level and possibly unofficial APIs, and is likely not completely portable. However, I have tested it on Mac OS (gcc) and Linux (intel) and it works in those places.
Part of this code was contributed by Justin Luitjens at the Carpet developers' workshop in summer 2010.
Keyword:
Comments (16)
-
reporter -
- removed comment
It worked for me on linux CentOS 5. Now the backtrace didn't include the symbols:
Backtrace from rank 1 pid 21481: 1. /lib64/libc.so.6(gsignal+0x35) [0x3531430265] 2. /lib64/libc.so.6(abort+0x110) [0x3531431d10] 3. /lib64/libc.so.6(assert_fail+0xf6) [0x35314296e6] 4. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x115cdbe] 5. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x420eb7] 6. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x420fd1] 7. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0xa0b9fd] 8. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0xa0955a] 9. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x5928e0] a. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x11225fd] b. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x111b1c7] c. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x111c857] d. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x111ce83] e. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x411979] f. /lib64/libc.so.6(libc_start_main+0xf4) [0x353141d994] 10. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch(memcmp+0x381) [0x411679]
I had OPTIMISE = yes in my configuration file but I also had CFLAGS = -g -debug all ... So -g doesn't seem to be passed along here. If we don't have the symbols then it becomes much harder to debug. In any case, apart from this detail, I would say this patch looks good for applying.
-
reporter - removed comment
Try adding -rdynamic to LDFLAGS:
http://gcc.gnu.org/onlinedocs/gcc/Link-Options.html
This was necessary for me to get this to work. The patch uses undocumented GLibC hacks, so we should test it on several exotic architectures before applying - it might need to be disabled if certain features are not available. There is no platform-independent or supported way of doing this, according to Justin's comments.
We could consider adding -rdynamic by default to LDFLAGS in the flesh, though I don't know what the consequences of this would be.
-
- removed comment
GCC says:
-rdynamic Pass the flag -export-dynamic to the ELF linker, on targets that support it. This instructs the linker to add all symbols, not only used ones, to the dynamic symbol table. This option is needed for some uses of "dlopen" or to allow obtaining backtraces from within a program.
This seems like a safe thing to do. Note that we override many defaults in the SimFactory, so I would begin by adding it to SimFactory's option lists on the machines that we often use.
-
repo owner - removed comment
When trying this with gcc (4.6) I needed to #include <cstring>. I also added code to output the backtrace.xx,txt files to out_dir in case more than one run segfaults. Finally I seem to require -ldl in LDFLAGS otherwise it fails during link time with dladdr not found. -lcrypto (from OpenSSL) also allows me to link (I assume crypto links to dl), so maybe this never happened with the ET thorn lists. Since dladdr is not POSIX but a GNU extension, is there maybe an autoconf test for it? I have never used autoconf myself so have no idea.
-
- removed comment
Autoconf has generic macros to test for header files or for functions. Cactus wraps these. CCTK_CHECK_FUNCS(dladdr) may be all you need; look for CCTK_CHECK_FUNCS in configure.in.
-
repo owner - removed comment
I added diffs to use autoconf to detect dladdr and cxa_demangle (both of which are GNU extensions and/or glibc specific, they are present for if -D_GNU_SOURCE or -std=gnuXXX is used, but not eg. on Kraken when using PGI and only _BSD_SOURCE).
-
- removed comment
The patch backtrace_amend_v2.diff removes two #include statements that shouldn't be there in the first case... Is that a patch on top of the first patch?
Anyway, please apply.
-
repo owner - removed comment
backtrace_amend_v2.diff replaces backtrace_amend.diff yes (should have simply let trac replace the file). If you mean the cxxabi and dlfcn includes, then those are still present and are acutally required (they declare function prototypes and a struct). They are now below cctk.h .
-
reporter - assigned issue to
- removed comment
-
reporter - removed comment
I had to remove some errant square brackets from backtrace_autoconf.diff. When using the Intel compiler, the Dl_info structure is only available if _GNU_SOURCE is defined. This is because it is a GNU extension. In order to get this defined (both in Initialise.cc and during the autoconf test), we could add it to every optionlist that uses the Intel compiler. Or maybe we could add it in known_architectures. Erik?
-
reporter - removed comment
The autoconf magic also seems to not work quite right. I have tested it using GCC on Mac OS and Intel 11 on Linux. In each case, autoconf reports
checking for cxxabi.h... yes checking for cxa_demangle... yes checking for Dl_info.dli_sname... yes checking for dladdr... yes
but cctk_Config.h contains
- define HAVE_CXA_DEMANGLE 1 /* #undef HAVE_DLADDR */
Since HAVE_DLADDR is not defined, the backtrace names are not demangled. Roland has volunteered to try to get the autoconf macros working.
-
- removed comment
_GNU_SOURCE has other effects as well, and may create all sorts of problems e.g. with <math.h> or <stdlib.h>. We can try, but it could lead to a rat's tail of problems on the usual weird architectures (AIX, PGI/IBM compilers, non-x86 CPUs, etc.).
Since this is an architecture specific piece of code, I suggest instead to #define _GNU_SOURCE before including these include files, but in such a way that only this source file is affected. If necessary, we can add this #define also to the autoconf macros.
-
repo owner - assigned issue to
- removed comment
-
reporter - removed comment
Backtrace patch committed in changeset:3345:d87fce06a3cd/Carpet.
Flesh patch committed in SVN revision changeset:"4726/Cactus flesh".
-
reporter - changed status to resolved
- removed comment
- Log in to comment