Wiki

Clone wiki

upcxx / docs / ccs-rpc-debugging

CCS Debugging

Introduction

The purpose of this document is to give an overview of the debugging information produced by the cross-code segment (CCS) RPC feature (see ccs-rpc.md for a general overview). Situations where CCS is require but not enabled and various errors from invalid usage of CCS will be covered.

Linking with -Wl,--build-id is recommended when debugging a CCS-enabled application (see debugging.md). Otherwise, the debugger may modify the code segment, interfering with the ability for UPC++ to identify and verify code segments by their hashes. This link option will embed an invariant identifier at link-time. A segment in an executable or library built with -Wl,--build-id will have a numerical "segment #" in the CCS debug tables, while executables and libraries built without will report "segment hash".

CCS Verification Failure

CCS verification is enabled by default with UPCXX_CODEMODE=debug and disabled with UPCXX_CODEMODE=opt. upcxx::init() will also automatically verify all segments in debug mode. A verification error is most likely to happen when using a function in a dlopened library or if verification is manually enabled in opt mode. In order for UPC++ to verify that all ranks have loaded a library and that ranks have the same versions, one of the upcxx::experimental::relo::verify_*() functions must be used for UPC++ to check that all segment hashes match.

// ccs1.cpp
int main() {
  upcxx::init();
  void (*dlopen_function)() = nullptr;
  void* handle = dlopen("liblibrary2.so", RTLD_NOW);
  if (!handle)
    throw std::runtime_error(dlerror());
  dlopen_function = reinterpret_cast<void(*)()>(dlsym(handle, "dlopen_function"));
  if (!dlopen_function)
    throw std::runtime_error(dlerror());

  // This fixes the verification error
  // upcxx::experimental::relo::verify_all();

  // Would throw `upcxx::segment_verification_error`
  upcxx::rpc(0, dlopen_function).wait();
  upcxx::finalize();
  return 0;
}

If segment verification is enabled and an RPC call is made to a function in an unverified segment, an error like the following will be produced on the RPC initiator:

terminate called after throwing an instance of 'upcxx::segment_verification_error'
  what():  Attempted to use unverified segment [7fa70314e000-7fa70314e11d] to relocate function pointer 7fa70314e100.

[0] -------------------------------------------------------------------------------------------------------------------------------------------------------------------
[0] | Lookup for pointer:   0x7fa70314e100 (BAD VERIFICATION)                                                                                                         |
[0] |-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
[0] | dlpi_name (rank_me: 0)                                   | hash                                     | segment #    | flags    | start_addr     | end_addr       |
[0] |----------------------------------------------------------|------------------------------------------|--------------|----------|----------------|----------------|
[0] |   ./ccs2                                                 | b0c00b3c987f7972b54fa2ad2393055a00000000 | segment hash |        2 | 0x55cf7577c000 | 0x55cf75bad7e9 |
[0] |   linux-vdso.so.1                                        | 9a7cfd842208c5ed2d7313c8823432075f6ce683 |            0 |        2 | 0x7fffd0b64000 | 0x7fffd0b649ba |
[0] |   /usr/lib/gcc/x86_64-pc-linux-gnu/11.2.0/libstdc++.so.6 | e157510e2e2c2ca7ef590db749b6a57900000000 | segment hash |        2 | 0x7fa702fb1000 | 0x7fa7030a85a9 |
[0] |   /lib64/libm.so.6                                       | 85fcee4f2f9ffef9ce7c3b7bf1e7b27b00000000 | segment hash |        2 | 0x7fa702e50000 | 0x7fa702ebbb45 |
[0] |   /usr/lib/gcc/x86_64-pc-linux-gnu/11.2.0/libgcc_s.so.1  | 29f7f4fb3724acda32e5335087f54b5700000000 | segment hash |        2 | 0x7fa702e2b000 | 0x7fa702e3c0f9 |
[0] |   /lib64/libc.so.6                                       | 4d9050284432d45489efef807ddf9fa500000000 | segment hash |        2 | 0x7fa702c5a000 | 0x7fa702dc545c |
[0] |   /lib64/ld-linux-x86-64.so.2                            | df6c22986b268b6b41dffef5807cb47f00000000 | segment hash |        2 | 0x7fa703153000 | 0x7fa703176c9e |
[0] | * ./liblibrary2.so                                       | c120b080aef988bc88f687acc3756c2800000000 | segment hash |        1 | 0x7fa70314e000 | 0x7fa70314e11d |
[0] -------------------------------------------------------------------------------------------------------------------------------------------------------------------

The * to the left of ./liblibrary2.so indicates that this is the problem segment the error references. flags shows the bit flag state of the segments. 0x1 indicates the segment is in the CCS cache while 0x2 indicates a verified segment.

CCS Asymmetry

CCS Verification errors can also occur if a library is loaded asymmetrically. Take the following example, where the library is only loaded on rank 0.

// ccs2.cpp
int main() {
  upcxx::init();
  if (upcxx::rank_n() < 2) {
    std::cerr << "This test must be run with at least two ranks" << std::endl;
    return 2;
  }
  if (upcxx::rank_me() == 0) {
    void (*dlopen_function)() = nullptr;
    void* handle = dlopen("liblibrary2.so", RTLD_NOW);
    if (!handle)
      throw std::runtime_error(dlerror());
    dlopen_function = reinterpret_cast<void(*)()>(dlsym(handle, "dlopen_function"));
    if (!dlopen_function)
      throw std::runtime_error(dlerror());
  }

  upcxx::experimental::relo::verify_all();

  if (upcxx::rank_me() == 0) {
    // throws `upcxx::segment_verification_error`
    upcxx::rpc(1, dlopen_function).wait();
  }

  upcxx::finalize();
  return 0;
}

In this case, the verification error is due to a failed rather than missing verification. This error looks nearly the same as the previous one, except the flags field would read "5" rather than "1", the combination of the bad_verification (0x4) flag and the active (0x1) flag. In this contrived example, this error could be easily fixed by having all ranks load the library. In a program where libraries are asymmetrically dlopened when needed, such as a Python library, it would be more complicated. In such a use case, this exception could be caught to trigger all ranks to load the necessary library.

Asymmetry may also occur if different ranks receive different configurations, leading to different versions of shared libraries to be loaded. If this should occur, it can be debugged by using upcxx::experimental::relocation::debug_write_segment_table() and comparing the library file paths and hashes between ranks to find the offending libraries.

CCS Debugging Without Verification

The CCS segment verification feature requires symmetry across all processes. This may not be possible in heterogeneous configurations where only some ranks are able to load a library which UPC++ directly RPCs into. In such cases, CCS relocations can still be debugged manually. Without verification enabled, the error will occur on the RCP target process rather than the initiating process and will display the relocation token in the form of {hash, offset}.

*** FATAL ERROR (proc 1):
//////////////////////////////////////////////////////////////////////
UPC++ fatal error:
 on process 1 (abominable-gentoo)
 at /home/colin/upcxx/build-fpic/bld/upcxx.assert0.optlev3.dbgsym0.gasnet_seq.udp/include/upcxx/ccs.hpp:306
 in function: Fp upcxx::detail::function_token_ms::detokenize(upcxx::detail::segmap_cache&) const [with Fp = void (*)()]()

Attempted detokenization in unknown executable segment. See: docs/ccs-rpc.md.

[1] -------------------------------------------------------------------------------------------------------------------------------------------------------------------
[1] | Lookup for token: {c120b080aef988bc88f687acc3756c2800000000, 100} (FAILURE)                                                                                     |
[1] |-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
[1] | dlpi_name (rank_me: 1)                                   | hash                                     | segment #    | flags    | start_addr     | end_addr       |
[1] |----------------------------------------------------------|------------------------------------------|--------------|----------|----------------|----------------|
[1] |   ./ccs2                                                 | c1b3fe30c7213a7cb8d2aee70ce503f100000000 | segment hash |        0 | 0x556ddf12e000 | 0x556ddf1e8ff1 |
[1] |   linux-vdso.so.1                                        | 9a7cfd842208c5ed2d7313c8823432075f6ce683 |            0 |        0 | 0x7ffd991eb000 | 0x7ffd991eb9ba |
[1] |   /usr/lib/gcc/x86_64-pc-linux-gnu/11.2.0/libstdc++.so.6 | e157510e2e2c2ca7ef590db749b6a57900000000 | segment hash |        0 | 0x7f8b56926000 | 0x7f8b56a1d5a9 |
[1] |   /lib64/libm.so.6                                       | 85fcee4f2f9ffef9ce7c3b7bf1e7b27b00000000 | segment hash |        0 | 0x7f8b567c5000 | 0x7f8b56830b45 |
[1] |   /usr/lib/gcc/x86_64-pc-linux-gnu/11.2.0/libgcc_s.so.1  | 29f7f4fb3724acda32e5335087f54b5700000000 | segment hash |        0 | 0x7f8b567a0000 | 0x7f8b567b10f9 |
[1] |   /lib64/libc.so.6                                       | 4d9050284432d45489efef807ddf9fa500000000 | segment hash |        0 | 0x7f8b565cf000 | 0x7f8b5673a45c |
[1] |   /lib64/ld-linux-x86-64.so.2                            | df6c22986b268b6b41dffef5807cb47f00000000 | segment hash |        0 | 0x7f8b56ac8000 | 0x7f8b56aebc9e |
[1] -------------------------------------------------------------------------------------------------------------------------------------------------------------------

The above error message occurs if the above ccs2 example is run without verification enabled. It can be seen that liblibrary2.so is missing from the segment table on the RCP target process. In other cases of asymmetry, such as different versions of the same library being loaded, it may be necessary to use upcxx::experimental::relocation::debug_write_segment_table() to compare between processes.

Segment Flags

The flags column of the segment table is a bit field that may have the following flags set:

  • touched (0x1): This segment has been used at least once. Colorized as bold.
  • verified (0x2): A verify_*() call has verified that this segment has been found to be identical on all ranks. Colorized as cyan.
  • bad_verification (0x4): A verify_*() call has determined that this segment has been found to be asymmetrically loaded. Colorized as red.
  • bad_segment (0x8): This segment has a RWX segment or TEXTRELs and is not built with -Wl,--build-id or equivalent. UPC++ is unable to create a common identifier for these segments and they cannot be used as direct targets for RPCs. Colorized as blue.

If colorized debug output is enabled, your console theme may change the displayed colors.

Updated