CCS interferes with multi-process debugging

Issue #591 resolved
Dan Bonachea created an issue

I'm seeing evidence on multiple systems that CCS primary segment validation that takes place in upcxx::init() is interfering with multiprocess debugging with at least gdb. Here's a simple demonstration using the nightly develop build on dirac with smp-conduit:

pcp-d-5 upcxx/test$ cat hello_upcxx.cpp 
#include <upcxx/upcxx.hpp>
#include <iostream>
#include <sstream>

int main() {
  upcxx::init();

  std::ostringstream oss;
  oss << "Hello from "<<upcxx::rank_me()<<" of "<<upcxx::rank_n()<<'\n';
  std::cout << oss.str() << std::flush;

  upcxx::finalize();
  return 0;
}

pcp-d-5 upcxx/test$ upcxx --version
UPC++ version 2022.9.7 upcxx-2022.9.7-31-g98dc364 / gex-stable-2023_03_09-0-ge170295
Citing UPC++ in publication? Please see: https://upcxx.lbl.gov/publications
Copyright (c) 2023, The Regents of the University of California,
through Lawrence Berkeley National Laboratory.
https://upcxx.lbl.gov

g++ (GCC) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

pcp-d-5 upcxx/test$ upcxx -network=smp -g hello_upcxx.cpp

pcp-d-5 upcxx/test$ env GASNET_PSHM_NODES=2 ./a.out     
Hello from 1 of 2
Hello from 0 of 2

pcp-d-5 upcxx/test$ env GASNET_PSHM_NODES=2 gdb ./a.out
GNU gdb (GDB) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./a.out...
(gdb) r
Starting program: <redacted>/upcxx/test/a.out 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
[Detaching after fork from child process 31152]
Hello from 0 of 2
Hello from 1 of 2
[Inferior 1 (process 31114) exited normally]
(gdb) break main
Breakpoint 1 at 0x40585e: file hello_upcxx.cpp, line 6.
(gdb) r
Starting program: <redacted>/upcxx/test/a.out 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".

Breakpoint 1, main () at hello_upcxx.cpp:6
6         upcxx::init();
(gdb) c
Continuing.
[Detaching after fork from child process 31194]
*** FATAL ERROR (proc 0): 
//////////////////////////////////////////////////////////////////////
UPC++ fatal error:
 on process 0 (pcp-d-5)
 at <redacted>/berkeleylab-upcxx-develop/src/./ccs.cpp:1128
 in function: static void upcxx::detail::segmap_cache::verify_all()

Primary segment verification failed

To have UPC++ freeze during these errors so you can attach a debugger,
rerun the program with GASNET_FREEZE_ON_ERROR=1 in the environment.
//////////////////////////////////////////////////////////////////////

*** FATAL ERROR (proc 1): 
//////////////////////////////////////////////////////////////////////
UPC++ fatal error:
 on process 1 (pcp-d-5)
 at <redacted>/berkeleylab-upcxx-develop/src/./ccs.cpp:1128
 in function: static void upcxx::detail::segmap_cache::verify_all()

Primary segment verification failed

To have UPC++ freeze during these errors so you can attach a debugger,
rerun the program with GASNET_FREEZE_ON_ERROR=1 in the environment.
//////////////////////////////////////////////////////////////////////

*** NOTICE (proc 1): Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.
*** Caught a fatal signal (proc 1): SIGABRT(6)
*** NOTICE (proc 0): Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.

Program received signal SIGABRT, Aborted.

The problem can only occur in a job with 2 or more ranks (single-rank segment validation is trivial), and appears to be triggered by instructing the debugger to insert a breakpoint into the primary code segment such that the breakpoint is created before the execution of upcxx::init() (there are several spawner-dependent ways to accomplish this). It does not appear to matter whether the location of the breakpoint itself is before or after the call upcxx::init(), provided it's somewhere in the primary code segment. It also appears that it depends upon setting breakpoints differently in the primary code segment of two or more worker processes; this occurs in the example above because upcxx::init() induces a call to fork() at the line [Detaching after fork from child process 31194] which creates the second worker process that runs outside gdb without breakpoints. The problem is easiest to demonstrate using smp-conduit as shown above, but the defect has been confirmed to also affect network conduits including ibv-conduit and udp-conduit (using GASNET_FREEZE to attach a debugger before CCS init).

This defect is not a surprising behavior since breakpoints are usually implemented by modifying the in-memory code segment to insert a trap instruction, resulting in mismatched CCS segment hashes. I've reproduced this on several Linux systems, using gdb versions 7.6.1, 8.0.1, 9.2 and 11.2.

Comments (2)

  1. Dan Bonachea reporter

    There are two recommended workaround for users encountering this problem in current releases:

    1. configure --disable-ccs-rpc, avoids the problem by disabling the CCS feature. Note the CCS feature is required to support applications with RPC callbacks in shared libraries.
    2. Link the executable using --Wl,--build-id
  2. Log in to comment