Wiki
Clone wikiupcxx / docs / ccs-rpc-debugging
CCS Debugging
Introduction
The purpose of this document is to give an overview of the debugging information produced by the cross-code segment (CCS) RPC feature (see ccs-rpc.md for a general overview). Situations where CCS is require but not enabled and various errors from invalid usage of CCS will be covered.
Linking with -Wl,--build-id
is recommended when debugging a CCS-enabled
application (see debugging.md). Otherwise, the debugger may
modify the code segment, interfering with the ability for UPC++ to identify
and verify code segments by their hashes. This link option will embed an
invariant identifier at link-time. A segment in an executable or library
built with -Wl,--build-id
will have a numerical "segment #" in the CCS debug
tables, while executables and libraries built without will report "segment hash".
CCS Verification Failure
CCS verification is enabled by default with UPCXX_CODEMODE=debug
and disabled
with UPCXX_CODEMODE=opt
. upcxx::init()
will also automatically verify all
segments in debug mode. A verification error is most likely to happen when
using a function in a dlopen
ed library or if verification is manually enabled
in opt mode. In order for UPC++ to verify that all ranks have loaded a library
and that ranks have the same versions, one of the
upcxx::experimental::relo::verify_*()
functions must be used for UPC++ to
check that all segment hashes match.
// ccs1.cpp int main() { upcxx::init(); void (*dlopen_function)() = nullptr; void* handle = dlopen("liblibrary2.so", RTLD_NOW); if (!handle) throw std::runtime_error(dlerror()); dlopen_function = reinterpret_cast<void(*)()>(dlsym(handle, "dlopen_function")); if (!dlopen_function) throw std::runtime_error(dlerror()); // This fixes the verification error // upcxx::experimental::relo::verify_all(); // Would throw `upcxx::segment_verification_error` upcxx::rpc(0, dlopen_function).wait(); upcxx::finalize(); return 0; }
If segment verification is enabled and an RPC call is made to a function in an unverified segment, an error like the following will be produced on the RPC initiator:
terminate called after throwing an instance of 'upcxx::segment_verification_error' what(): Attempted to use unverified segment [7fa70314e000-7fa70314e11d] to relocate function pointer 7fa70314e100. [0] ------------------------------------------------------------------------------------------------------------------------------------------------------------------- [0] | Lookup for pointer: 0x7fa70314e100 (BAD VERIFICATION) | [0] |-----------------------------------------------------------------------------------------------------------------------------------------------------------------| [0] | dlpi_name (rank_me: 0) | hash | segment # | flags | start_addr | end_addr | [0] |----------------------------------------------------------|------------------------------------------|--------------|----------|----------------|----------------| [0] | ./ccs2 | b0c00b3c987f7972b54fa2ad2393055a00000000 | segment hash | 2 | 0x55cf7577c000 | 0x55cf75bad7e9 | [0] | linux-vdso.so.1 | 9a7cfd842208c5ed2d7313c8823432075f6ce683 | 0 | 2 | 0x7fffd0b64000 | 0x7fffd0b649ba | [0] | /usr/lib/gcc/x86_64-pc-linux-gnu/11.2.0/libstdc++.so.6 | e157510e2e2c2ca7ef590db749b6a57900000000 | segment hash | 2 | 0x7fa702fb1000 | 0x7fa7030a85a9 | [0] | /lib64/libm.so.6 | 85fcee4f2f9ffef9ce7c3b7bf1e7b27b00000000 | segment hash | 2 | 0x7fa702e50000 | 0x7fa702ebbb45 | [0] | /usr/lib/gcc/x86_64-pc-linux-gnu/11.2.0/libgcc_s.so.1 | 29f7f4fb3724acda32e5335087f54b5700000000 | segment hash | 2 | 0x7fa702e2b000 | 0x7fa702e3c0f9 | [0] | /lib64/libc.so.6 | 4d9050284432d45489efef807ddf9fa500000000 | segment hash | 2 | 0x7fa702c5a000 | 0x7fa702dc545c | [0] | /lib64/ld-linux-x86-64.so.2 | df6c22986b268b6b41dffef5807cb47f00000000 | segment hash | 2 | 0x7fa703153000 | 0x7fa703176c9e | [0] | * ./liblibrary2.so | c120b080aef988bc88f687acc3756c2800000000 | segment hash | 1 | 0x7fa70314e000 | 0x7fa70314e11d | [0] -------------------------------------------------------------------------------------------------------------------------------------------------------------------
The *
to the left of ./liblibrary2.so
indicates that this is the problem
segment the error references. flags
shows the bit flag state of the segments.
0x1
indicates the segment is in the CCS cache while 0x2
indicates a verified
segment.
CCS Asymmetry
CCS Verification errors can also occur if a library is loaded asymmetrically. Take the following example, where the library is only loaded on rank 0.
// ccs2.cpp int main() { upcxx::init(); if (upcxx::rank_n() < 2) { std::cerr << "This test must be run with at least two ranks" << std::endl; return 2; } if (upcxx::rank_me() == 0) { void (*dlopen_function)() = nullptr; void* handle = dlopen("liblibrary2.so", RTLD_NOW); if (!handle) throw std::runtime_error(dlerror()); dlopen_function = reinterpret_cast<void(*)()>(dlsym(handle, "dlopen_function")); if (!dlopen_function) throw std::runtime_error(dlerror()); } upcxx::experimental::relo::verify_all(); if (upcxx::rank_me() == 0) { // throws `upcxx::segment_verification_error` upcxx::rpc(1, dlopen_function).wait(); } upcxx::finalize(); return 0; }
In this case, the verification error is due to a failed rather than missing
verification. This error looks nearly the same as the previous one, except the
flags
field would read "5" rather than "1", the combination of the
bad_verification
(0x4) flag and the active
(0x1) flag. In this contrived
example, this error could be easily fixed by having all ranks load the library.
In a program where libraries are asymmetrically dlopened
when needed, such as
a Python library, it would be more complicated. In such a use case, this
exception could be caught to trigger all ranks to load the necessary library.
Asymmetry may also occur if different ranks receive different configurations,
leading to different versions of shared libraries to be loaded. If this should
occur, it can be debugged by using
upcxx::experimental::relocation::debug_write_segment_table()
and comparing
the library file paths and hashes between ranks to find the offending
libraries.
CCS Debugging Without Verification
The CCS segment verification feature requires symmetry across all processes.
This may not be possible in heterogeneous configurations where only some ranks
are able to load a library which UPC++ directly RPCs into. In such cases, CCS
relocations can still be debugged manually. Without verification enabled, the
error will occur on the RCP target process rather than the initiating process
and will display the relocation token in the form of {hash, offset}
.
*** FATAL ERROR (proc 1): ////////////////////////////////////////////////////////////////////// UPC++ fatal error: on process 1 (abominable-gentoo) at /home/colin/upcxx/build-fpic/bld/upcxx.assert0.optlev3.dbgsym0.gasnet_seq.udp/include/upcxx/ccs.hpp:306 in function: Fp upcxx::detail::function_token_ms::detokenize(upcxx::detail::segmap_cache&) const [with Fp = void (*)()]() Attempted detokenization in unknown executable segment. See: docs/ccs-rpc.md. [1] ------------------------------------------------------------------------------------------------------------------------------------------------------------------- [1] | Lookup for token: {c120b080aef988bc88f687acc3756c2800000000, 100} (FAILURE) | [1] |-----------------------------------------------------------------------------------------------------------------------------------------------------------------| [1] | dlpi_name (rank_me: 1) | hash | segment # | flags | start_addr | end_addr | [1] |----------------------------------------------------------|------------------------------------------|--------------|----------|----------------|----------------| [1] | ./ccs2 | c1b3fe30c7213a7cb8d2aee70ce503f100000000 | segment hash | 0 | 0x556ddf12e000 | 0x556ddf1e8ff1 | [1] | linux-vdso.so.1 | 9a7cfd842208c5ed2d7313c8823432075f6ce683 | 0 | 0 | 0x7ffd991eb000 | 0x7ffd991eb9ba | [1] | /usr/lib/gcc/x86_64-pc-linux-gnu/11.2.0/libstdc++.so.6 | e157510e2e2c2ca7ef590db749b6a57900000000 | segment hash | 0 | 0x7f8b56926000 | 0x7f8b56a1d5a9 | [1] | /lib64/libm.so.6 | 85fcee4f2f9ffef9ce7c3b7bf1e7b27b00000000 | segment hash | 0 | 0x7f8b567c5000 | 0x7f8b56830b45 | [1] | /usr/lib/gcc/x86_64-pc-linux-gnu/11.2.0/libgcc_s.so.1 | 29f7f4fb3724acda32e5335087f54b5700000000 | segment hash | 0 | 0x7f8b567a0000 | 0x7f8b567b10f9 | [1] | /lib64/libc.so.6 | 4d9050284432d45489efef807ddf9fa500000000 | segment hash | 0 | 0x7f8b565cf000 | 0x7f8b5673a45c | [1] | /lib64/ld-linux-x86-64.so.2 | df6c22986b268b6b41dffef5807cb47f00000000 | segment hash | 0 | 0x7f8b56ac8000 | 0x7f8b56aebc9e | [1] -------------------------------------------------------------------------------------------------------------------------------------------------------------------
The above error message occurs if the above ccs2
example is run without
verification enabled. It can be seen that liblibrary2.so
is missing from the
segment table on the RCP target process. In other cases of asymmetry, such as
different versions of the same library being loaded, it may be necessary to use
upcxx::experimental::relocation::debug_write_segment_table()
to compare
between processes.
Segment Flags
The flags
column of the segment table is a bit field that may have the
following flags set:
touched
(0x1): This segment has been used at least once. Colorized as bold.verified
(0x2): Averify_*()
call has verified that this segment has been found to be identical on all ranks. Colorized as cyan.bad_verification
(0x4): Averify_*()
call has determined that this segment has been found to be asymmetrically loaded. Colorized as red.bad_segment
(0x8): This segment has a RWX segment or TEXTRELs and is not built with-Wl,--build-id
or equivalent. UPC++ is unable to create a common identifier for these segments and they cannot be used as direct targets for RPCs. Colorized as blue.
If colorized debug output is enabled, your console theme may change the displayed colors.
Updated