Invalid GASNet call while deserializing a global ptr
So I can only reproduce this error on cori with the upcxx/nightly but will next try with a development version on my desktop. This is triggered in the unit test of upcxx_utils' global_shared_ptr which is attempting to implement an interface similar to std::shared_ptr but with a global_ptr on the local_team… It basically is a global_ptr< pair< atomic<int64> , global_ptr<T> >, where the atomic is incremented on construction and serialization and decremented on destruction and the global pointers are freed when the count == 0. It has been working fine since 2020.03 up until I tried it on nightly, and haven’t been able to figure out what is breaking.
and
commit 2f8220bd5a9650a044d1d7fe5e894e6f8e627b8d (sorry I couldn’t get the links to hit the latest commit).
I ran on 2 knl nodes with 136 ranks, but it also fails on 1 node. All the ranks throw this error and here is trace proc10
*** Details for bug reporting (proc 10): config=RELEASE=2020.10.3,SPEC=1.16,PTR=64bit,debug,SEQ,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native compiler=GNU/8.3.0 sys=x86_64
-cnl-linux-gnu
[10] /usr/bin/gdb -nx -batch -x /tmp/gasnet_dQXI8Z '/global/cscratch1/sd/regan/mhm2-builds/upcxx-utils/build-debug-nightly/test/test_shared_global_ptr' 241334
[10] [Thread debugging using libthread_db enabled]
[10] Using host libthread_db library "/lib64/libthread_db.so.1".
[10] 0x00002aaaac7fe2ba in waitpid () from /lib64/libc.so.6
[10] To enable execution of this file add
[10] add-auto-load-safe-path /opt/gcc/10.1.0/snos/lib64/libstdc++.so.6.0.28-gdb.py
[10] line to your configuration file "/global/homes/r/regan/.gdbinit".
[10] To completely disable this security protection add
[10] set auto-load safe-path /
[10] line to your configuration file "/global/homes/r/regan/.gdbinit".
[10] For more information about this security protection see the
[10] "Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
[10] info "(gdb)Auto-loading safe path"
[10] #0 0x00002aaaac7fe2ba in waitpid () from /lib64/libc.so.6
[10] #1 0x00002aaaac77b86f in do_system () from /lib64/libc.so.6
[10] #2 0x0000000020143c1b in gasneti_system_redirected (cmd=0x4054a580 <cmd> "/usr/bin/gdb -nx -batch -x /tmp/gasnet_dQXI8Z '/global/cscratch1/sd/regan/mhm2-builds/upcxx-utils/build-debug-nightly/test/test_shared_global_ptr' 241334", stdout_fd=8) at /global/cscratch1/sd/hargrove/upcxx-nightly-cori_knl-gnu/bld/upcxx_install/berkeleylab-upcxx-mk-develop/bld/GASNet-develop/gasnet_tools.c:1276
[10] #3 0x0000000020144395 in gasneti_bt_gdb (fd=8) at /global/cscratch1/sd/hargrove/upcxx-nightly-cori_knl-gnu/bld/upcxx_install/berkeleylab-upcxx-mk-develop/bld/GASNet-develop/gasnet_tools.c:1532
[10] #4 0x0000000020144bdd in gasneti_print_backtrace (fd=2) at /global/cscratch1/sd/hargrove/upcxx-nightly-cori_knl-gnu/bld/upcxx_install/berkeleylab-upcxx-mk-develop/bld/GASNet-develop/gasnet_tools.
c:1810
[10] #5 0x00000000201451d9 in _gasneti_print_backtrace_ifenabled (fd=2) at /global/cscratch1/sd/hargrove/upcxx-nightly-cori_knl-gnu/bld/upcxx_install/berkeleylab-upcxx-mk-develop/bld/GASNet-develop/ga
snet_tools.c:1943
[10] #6 0x0000000020142c96 in gasneti_error_abort () at /global/cscratch1/sd/hargrove/upcxx-nightly-cori_knl-gnu/bld/upcxx_install/berkeleylab-upcxx-mk-develop/bld/GASNet-develop/gasnet_tools.c:764
[10] #7 0x0000000020142e41 in _gasneti_fatalerror (msg=0x204a3208 "Invalid GASNet call (communication injection or poll) while executing a Request handler") at /global/cscratch1/sd/hargrove/upcxx-nigh
tly-cori_knl-gnu/bld/upcxx_install/berkeleylab-upcxx-mk-develop/bld/GASNet-develop/gasnet_tools.c:800
[10] #8 0x000000002034b3c6 in gasneti_check_inject (for_reply=0) at /global/cscratch1/sd/hargrove/upcxx-nightly-cori_knl-gnu/bld/upcxx_install/berkeleylab-upcxx-mk-develop/bld/GASNet-develop/gasnet_in
ternal.c:375
[10] #9 0x00000000203b00a0 in gasneti_Segment_QueryBound (tm=0xffffffffbfa9537f, rank=11, owneraddr_p=0x7fffffff3098, localaddr_p=0x0, size_p=0x7fffffff3090) at /global/cscratch1/sd/hargrove/upcxx-nig
htly-cori_knl-gnu/bld/upcxx_install/berkeleylab-upcxx-mk-develop/bld/GASNet-develop/gasnet_mmap.c:2315
[10] #10 0x0000000020065c86 in upcxx::backend::validate_global_ptr (allow_null=true, rank=11, raw_ptr=0x2aaaf7e003e0, heap_idx=0, KindSet=upcxx::memory_kind::host, T_align=8, T_name=0x2041b920 <typeinf
o name for std::pair<std::atomic<long>, upcxx::global_ptr<int, (upcxx::memory_kind)1> >> "St4pairISt6atomicIlEN5upcxx10global_ptrIiLNS2_11memory_kindE1EEEE", short_context=0x2040b770 <upcxx::global_ptr
<std::pair<std::atomic<long>, upcxx::global_ptr<int, (upcxx::memory_kind)1> > const, (upcxx::memory_kind)1>::is_local() const::__func__> "is_local", context=0x2040b6a0 <upcxx::global_ptr<std::pair<std:
:atomic<long>, upcxx::global_ptr<int, (upcxx::memory_kind)1> > const, (upcxx::memory_kind)1>::is_local() const::__PRETTY_FUNCTION__> "bool upcxx::global_ptr<const T, KindSet>::is_local() const [with T
= std::pair<std::atomic<long int>, upcxx::global_ptr<int, (upcxx::memory_kind)1> >; upcxx::memory_kind KindSet = (upcxx::memory_kind)"...) at /global/cscratch1/sd/hargrove/upcxx-nightly-cori_knl-gnu/bl
d/upcxx_install/be[10] rkeleylab-upcxx-mk-develop/src/backend/gasnet/runtime.cpp:1416
[10] #11 0x00000000200293f2 in upcxx::global_ptr<std::pair<std::atomic<long>, upcxx::global_ptr<int, (upcxx::memory_kind)1> > const, (upcxx::memory_kind)1>::check (this=0x7fffffff36b0, allow_null=true,
short_context=0x2040b770 <upcxx::global_ptr<std::pair<std::atomic<long>, upcxx::global_ptr<int, (upcxx::memory_kind)1> > const, (upcxx::memory_kind)1>::is_local() const::__func__> "is_local", context=
0x2040b6a0 <upcxx::global_ptr<std::pair<std::atomic<long>, upcxx::global_ptr<int, (upcxx::memory_kind)1> > const, (upcxx::memory_kind)1>::is_local() const::__PRETTY_FUNCTION__> "bool upcxx::global_ptr<
const T, KindSet>::is_local() const [with T = std::pair<std::atomic<long int>, upcxx::global_ptr<int, (upcxx::memory_kind)1> >; upcxx::memory_kind KindSet = (upcxx::memory_kind)"...) at /usr/common/ftg
/upcxx/nightly/craype-2.6.2/knl/gnu/PrgEnv-gnu-6.0.5-8.3.0-2021.01.16/upcxx.debug.gasnet_seq.aries/include/upcxx/global_ptr.hpp:94
[10] #12 0x0000000020024bff in upcxx::global_ptr<std::pair<std::atomic<long>, upcxx::global_ptr<int, (upcxx::memory_kind)1> > const, (upcxx::memory_kind)1>::is_local (this=0x7fffffff36b0) at /usr/commo
n/ftg/upcxx/nightly/craype-2.6.2/knl/gnu/PrgEnv-gnu-6.0.5-8.3.0-2021.01.16/upcxx.debug.gasnet_seq.aries/include/upcxx/global_ptr.hpp:142
[10] #13 0x0000000020043215 in upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>::upcxx_serialization::deserialize<upcxx::detail::serialization_reader> (reader=..., storage=0x7fffffff3790) at
/global/homes/r/regan/workspace/mhm2/upcxx-utils/include/upcxx_utils/shared_global_ptr.hpp:247
[10] #14 0x0000000020043cdf in upcxx::detail::serialization_reader::read_into<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>&&, true, upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind
)1> > (this=0x7fffffff3930, raw=0x7fffffff3790) at /usr/common/ftg/upcxx/nightly/craype-2.6.2/knl/gnu/PrgEnv-gnu-6.0.5-8.3.0-2021.01.16/upcxx.debug.gasnet_seq.aries/include/upcxx/serialization.hpp:648
[10] #15 0x00000000200434a3 in upcxx::detail::serialization_tuple<std::tuple<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>&&>, 0, 1>::deserialize_each<std::tuple<upcxx_utils::shared_global
_ptr<int, (upcxx::memory_kind)1> >, upcxx::detail::serialization_reader> (r=..., spot=0x7fffffff3918) at /usr/common/ftg/upcxx/nightly/craype-2.6.2/knl/gnu/PrgEnv-gnu-6.0.5-8.3.0-2021.01.16/upcxx.debug
.gasnet_seq.aries/include/upcxx/serialization.hpp:1552
[10] #16 0x000000002004305b in upcxx::serialization<std::tuple<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>&&> >::deserialize<upcxx::detail::serialization_reader> (r=..., spot=0x7fffffff3
918) at /usr/common/ftg/upcxx/nightly/craype-2.6.2/knl/gnu/PrgEnv-gnu-6.0.5-8.3.0-2021.01.16/upcxx.debug.gasnet_seq.aries/include/upcxx/serialization.hpp:1605
[10] #17 0x0000000020042a8b in upcxx::detail::serialization_reader::read_into<std::tuple<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>&&>, true, std::tuple<upcxx_utils::shared_global_ptr<i
nt, (upcxx::memory_kind)1> > > (this=0x7fffffff3930, raw=0x7fffffff3918) at /usr/common/ftg/upcxx/nightly/craype-2.6.2/knl/gnu/PrgEnv-gnu-6.0.5-8.3.0-2021.01.16/upcxx.debug.gasnet_seq.aries/include/upc
xx/serialization.hpp:648
[10] #18 0x0000000020041e6c in upcxx::detail::deserialized_bound_function_storage<upcxx::backend::send_awaken_lpc<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>, upcxx_utils::shared_global_
ptr<int, (upcxx::memory_kind)1>&&>(int, upcxx::detail::lpc_dormant<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1> >*, std::tuple<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>&&
>&&)::{lambda(std::tuple<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1> >&&)#1}, std::tuple<std::tuple<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>&&> >, upcxx::detail::index_
sequence<0> >::deserialized_bound_function_storage(upcxx::detail::serialization_reader&) (this=0x7fffffff3910, r=...) at /usr/common/ftg/upcxx/nightly/craype-2.6.2/knl/gnu/PrgEnv-gnu-6.0.5-8.3.0-2021.0
1.16/upcxx.debug.gasnet_seq.aries/include/upcxx/bind.hpp:157
[10] #19 0x000000002004146b in upcxx::detail::deserialized_bound_function_base<upcxx::backend::send_awaken_lpc<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>, upcxx_utils::shared_global_ptr
<int, (upcxx::memory_kind)1>&&>(int, upcxx::detail::lpc_dormant<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1> >*, std::tuple<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>&&>&&
)::{lambda(std::tuple<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1> >&&)#1}, std::tuple<std::tuple<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>&&> >, upcxx::detail::index_seq
uence<0>, true>::deserialized_bound_function_base(upcxx::detail::serialization_reader&) (this=0x7fffffff3910, r=...) at /usr/common/ftg/upcxx/nightly/craype-2.6.2/knl/gnu/PrgEnv-gnu-6.0.5-8.3.0-2021.01
.16/upcxx.debug.gasnet_seq.aries/include/upcxx/bind.hpp:206
[10] #20 0x000000002003ffcd in upcxx::detail::deserialized_bound_function<upcxx::backend::send_awaken_lpc<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>, upcxx_utils::shared_global_ptr<int,
(upcxx::memory_kind)1>&&>(int, upcxx::detail::lpc_dormant<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1> >*, std::tuple<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>&&>&&)::{l
ambda(std::tuple<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1> >&&)#1}, std::tuple<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>&&> >::deserialized_bound_function(upcxx::detai
l::serialization_reader&) (this=0x7fffffff3910, r=...) at /usr/common/ftg/upcxx/nightly/craype-2.6.2/knl/gnu/PrgEnv-gnu-6.0.5-8.3.0-2021.01.16/upcxx.debug.gasnet_seq.aries/include/upcxx/bind.hpp:287
[10] #21 0x000000002003edde in upcxx::serialization<upcxx::bound_function<upcxx::backend::send_awaken_lpc<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>, upcxx_utils::shared_global_ptr<int,
(upcxx::memory_kind)1>&&>(int, upcxx::detail::lpc_dormant<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1> >*, std::tuple<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>&&>&&)::{l
ambda(std::tuple<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1> >&&)#1}, std::tuple<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>&&> > >::deserialize<upcxx::detail::serializati
on_reader>(upcxx::detail::serialization_reader&, void*) (r=..., spot=0x7fffffff3910) at /usr/common/ftg/upcxx/nightly/craype-2.6.2/knl/gnu/PrgEnv-gnu-6.0.5-8.3.0-2021.01.16/upcxx.debug.gasnet_seq.aries
/include/upcxx/bind.hpp:368
[10] #22 0x000000002003d42d in upcxx::detail::command<upcxx::detail::lpc_base*>::the_executor<upcxx::bound_function<upcxx::backend::send_awaken_lpc<upcxx_utils::shared_global_ptr<int, (upcxx::memory_ki
nd)1>, upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>&&>(int, upcxx::detail::lpc_dormant<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1> >*, std::tuple<upcxx_utils::shared_global
_ptr<int, (upcxx::memory_kind)1>&&>&&)::{lambda(std::tuple<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1> >&&)#1}, std::tuple<upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1>&&> >
, &upcxx::backend::gasnet::rpc_as_lpc::reader_of, &(void upcxx::backend::gasnet::cleanup<false, true>(upcxx::detail::lpc_base*))>(upcxx::detail::lpc_base*) (a#0=0x7fffffff39a0) at /usr/common/ftg/upcxx
/nightly/craype-2.6.2/knl/gnu/PrgEnv-gnu-6.0.5-8.3.0-2021.01.16/upcxx.debug.gasnet_seq.aries/include/upcxx/command.hpp:67
[10] #23 0x0000000020068045 in (anonymous namespace)::am_eager_restricted (buf=0x2aaaae860078, buf_size=32, buf_align=8) at /global/cscratch1/sd/hargrove/upcxx-nightly-cori_knl-gnu/bld/upcxx_install/be
rkeleylab-upcxx-mk-develop/src/backend/gasnet/runtime.cpp:2121
[10] #24 0x00000000201663c8 in gasneti_AMPSHM_service_incoming_msg (vnet=0x4056b4e0, isReq=1) at /global/cscratch1/sd/hargrove/upcxx-nightly-cori_knl-gnu/bld/upcxx_install/berkeleylab-upcxx-mk-develop/
bld/GASNet-develop/gasnet_pshm.c:1208
[10] #25 0x000000002016802e in gasneti_AMPSHMPoll (repliesOnly=0) at /global/cscratch1/sd/hargrove/upcxx-nightly-cori_knl-gnu/bld/upcxx_install/berkeleylab-upcxx-mk-develop/bld/GASNet-develop/gasnet_ps
hm.c:1245
[10] #26 0x00000000200e0ad9 in gasnetc_AMPoll () at /global/cscratch1/sd/hargrove/upcxx-nightly-cori_knl-gnu/bld/upcxx_install/berkeleylab-upcxx-mk-develop/bld/GASNet-develop/aries-conduit/gasnet_core.
c:1239
[10] #27 0x0000000020056c18 in _gasneti_AMPoll () at /global/cscratch1/sd/hargrove/upcxx-nightly-cori_knl-gnu/bld/upcxx_install/berkeleylab-upcxx-mk-develop/bld/GASNet-develop/gasnet_help.h:1210
[10] #28 0x0000000020056fb1 in _gasnet_AMPoll () at /global/cscratch1/sd/hargrove/upcxx-nightly-cori_knl-gnu/bld/upcxx_install/berkeleylab-upcxx-mk-develop/bld/GASNet-develop/gasnet_help.h:1343
[10] #29 0x0000000020067e13 in upcxx::progress (level=upcxx::progress_level::user) at /global/cscratch1/sd/hargrove/upcxx-nightly-cori_knl-gnu/bld/upcxx_install/berkeleylab-upcxx-mk-develop/src/backend
/gasnet/runtime.cpp:2045
[10] #30 0x0000000020009f46 in upcxx::detail::future_wait_upcxx_progress_user::operator() (this=0x7fffffff46eb) at /usr/common/ftg/upcxx/nightly/craype-2.6.2/knl/gnu/PrgEnv-gnu-6.0.5-8.3.0-2021.01.16/u
pcxx.debug.gasnet_seq.aries/include/upcxx/future/future1.hpp:53
[10] #31 0x0000000020020855 in upcxx::future1<upcxx::detail::future_kind_shref<upcxx::detail::future_header_ops_general, false>, upcxx_utils::shared_global_ptr<int, (upcxx::memory_kind)1> >::wait<-1, u
pcxx::detail::future_wait_upcxx_progress_user>(upcxx::detail::future_wait_upcxx_progress_user&&) && (this=0x7fffffff46d8, progress=...) at /usr/common/ftg/upcxx/nightly/craype-2.6.2/knl/gnu/PrgEnv-gnu-
6.0.5-8.3.0-2021.01.16/upcxx.debug.gasnet_seq.aries/include/upcxx/future/future1.hpp:327
[10] #32 0x000000002000ecac in test_shared_global_ptr (argc=1, argv=0x7fffffff4c68) at /global/homes/r/regan/workspace/mhm2/upcxx-utils/test/test_shared_global_ptr.cpp:133
[10] #33 0x00000000200097a6 in main (argc=1, argv=0x7fffffff4c68) at /global/homes/r/regan/workspace/mhm2/upcxx-utils/test/main_shell.cpp:17
[10] [Inferior 1 (process 241334) detached]
Comments (7)
-
reporter -
- changed milestone to 2021.3.0 release
-
assigned issue to
- marked as blocker
This looks like a conflict between a debug check that @Paul Hargrove recently added in the GASNet bleeding-edge development branch and the UPC++ runtime. This GASNet change should not yet appear in the stable branches of GASNet or in the mk-develop branch, but it looks like both cori installs are currently pinned to the bleeding-edge develop which is affected.
We definitely need to resolve this before the next GASNet stable advance.
-
Here is a simple reproducer:
#include <upcxx/upcxx.hpp> #include <iostream> #include <assert.h> using namespace upcxx; struct A { global_ptr<int> g; bool local; A(global_ptr<int> gp) { this->g = gp; this->local = gp.is_local(); } UPCXX_SERIALIZED_VALUES(g); }; int main() { upcxx::init(); if (upcxx::rank_me()) { auto f = upcxx::rpc(0,[]() { global_ptr<int> gp = upcxx::new_<int>(); A a(gp); return a; }); A const &a = f.wait_reference(); assert(a.g); } upcxx::barrier(); if (!upcxx::rank_me()) std::cout<<"SUCCESS"<<std::endl; upcxx::finalize(); return 0; }
The root cause here is the eager path for returning the upcxx::rpc() return value to the initiator is performing deserialization in AM handler context.
global_ptr<T>
itself is TriviallySerializable, but other user-supplied deserialization code operating onglobal_ptr<T>
values activates the (DEBUG mode only)global_ptr<T>
validity checks, which violate the AM handler context restrictions (because I did not consider this case when writing it).The likely solution is to disable the checking (or at least the problematic part) in this context.
-
Proposed solution in PR 309
-
- changed status to resolved
issue
#440: Deploy a temporary solution for 'Invalid GASNet call' in deserializationAdd debug-mode only TLS to track the problematic AM handler context and skip the portion of global_ptr checking that issues the prohibited call. The solution notably preserves global_ptr segment bounds checking for same-process affinity, and degrades correctly in NDEBUG mode.
This will eventually be replaced with a better solution once GASNet segment query support is expanded.
Fixes issue
#440.→ <<cset b244cddc9a42>>
-
@Rob Egan : cori's
upcxx/nightly
module has already been updated to "pre-date" this defect, and starting tonight theupcxx/bleeding-edge
module will include the fix that I've just merged. -
PR 330 deploys a permanent and more robust solution to this problem, using the new
gex_EP_QueryBoundSegmentNB()
call in the forthcoming GASNet release. - Log in to comment
The good news is that I cannot replicate it using the latest version from the mk-development branch on my single machine: 2635ac137feff9036fe24ca31092112145dc7922
On dirac I can reproduce it on both 1 and 2 nodes with the nightly version, but not the next latest upcxx/2020.11.0/auto. Here is the trace for 2 ranks on a single node