dist_object assertion error regression with 2023.3 on Ubuntu 18.04.6 and g++ 7.5.0-3ubuntu1~18.04

Issue #599 wontfix
Rob Egan created an issue
*** Details for bug reporting (proc 0): config=RELEASE=2023.3.0,SPEC=1.20,PTR=64bit,debug,SEQ,timers_forced_posixrt,membars_native,atomics_native,atomic32_native,atomic64_native compiler=GNU/7.5.0 sys=x86_64-pc-linux-gnu
[0] Invoking GDB for backtrace...
[0] /usr/bin/gdb -nx -batch -x /tmp/gasnet_tTHVsP '/work/gitlab-ci/scratch/upcxx-utils-cionly-e304f960-Upcxx2023.3-/build-dbg/test/test_flat_aggr_store' 3345
[0] [Thread debugging using libthread_db enabled]
[0] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[0] 0x00007ffff6d06337 in __GI___waitpid (pid=3347, stat_loc=stat_loc@entry=0x7fffffffb528, options=options@entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
[0] #0  0x00007ffff6d06337 in __GI___waitpid (pid=3347, stat_loc=stat_loc@entry=0x7fffffffb528, options=options@entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
[0] #1  0x00007ffff6c71047 in do_system (line=<optimized out>) at ../sysdeps/posix/system.c:149
[0] #2  0x000055555576672e in gasneti_system_redirected (cmd=0x555555dc58e0 <cmd> "/usr/bin/gdb -nx -batch -x /tmp/gasnet_tTHVsP '/work/gitlab-ci/scratch/upcxx-utils-cionly-e304f960-Upcxx2023.3-/build-dbg/test/test_flat_aggr_store' 3345", stdout_fd=3) at /dev/shm/upcxx-gitlab-runner-mhm2-builds/upcxx-2023.3.0/bld/GASNet-2023.3.0/gasnet_tools.c:1679
[0] #3  0x0000555555766e8b in gasneti_bt_gdb (fd=3) at /dev/shm/upcxx-gitlab-runner-mhm2-builds/upcxx-2023.3.0/bld/GASNet-2023.3.0/gasnet_tools.c:1944
[0] #4  0x00005555557677cd in gasneti_print_backtrace (fd=2) at /dev/shm/upcxx-gitlab-runner-mhm2-builds/upcxx-2023.3.0/bld/GASNet-2023.3.0/gasnet_tools.c:2228
[0] #5  0x0000555555767e60 in _gasneti_print_backtrace_ifenabled (fd=2) at /dev/shm/upcxx-gitlab-runner-mhm2-builds/upcxx-2023.3.0/bld/GASNet-2023.3.0/gasnet_tools.c:2359
[0] #6  0x0000555555764da6 in gasneti_error_abort () at /dev/shm/upcxx-gitlab-runner-mhm2-builds/upcxx-2023.3.0/bld/GASNet-2023.3.0/gasnet_tools.c:807
[0] #7  0x000055555576504c in gasneti_fatalerror_nopos (msg=0x555555a78e4d "\n%s") at /dev/shm/upcxx-gitlab-runner-mhm2-builds/upcxx-2023.3.0/bld/GASNet-2023.3.0/gasnet_tools.c:851
[0] #8  0x00005555556e84ec in upcxx::detail::fatal_error (msg=0x555555e2cdf0 "Failed condition: has_value()", title=0x555555a78e51 "assertion failure", func=0x555555a13da0 <upcxx::dist_object<std::function<void (KV)> >::operator*() const::__PRETTY_FUNCTION__> "T& upcxx::dist_object<T>::operator*() const [with T = std::function<void(KV)>]", file=0x555555a108b0 "/work/gitlab-ci/ci-install-upcxx-utils-cionly-upcxx-2023.3.0/include/upcxx/dist_object.hpp", line=221) at /dev/shm/upcxx-gitlab-runner-mhm2-builds/upcxx-2023.3.0/src/./diagnostic.cpp:63
[0] #9  0x00005555556e8524 in upcxx::detail::assert_failed (func=0x555555a13da0 <upcxx::dist_object<std::function<void (KV)> >::operator*() const::__PRETTY_FUNCTION__> "T& upcxx::dist_object<T>::operator*() const [with T = std::function<void(KV)>]", file=0x555555a108b0 "/work/gitlab-ci/ci-install-upcxx-utils-cionly-upcxx-2023.3.0/include/upcxx/dist_object.hpp", line=221, msg=0x555555e2cdf0 "Failed condition: has_value()") at /dev/shm/upcxx-gitlab-runner-mhm2-builds/upcxx-2023.3.0/src/./diagnostic.cpp:76
[0] #10 0x0000555555567f3c in upcxx::detail::assert_failed (func=0x555555a13da0 <upcxx::dist_object<std::function<void (KV)> >::operator*() const::__PRETTY_FUNCTION__> "T& upcxx::dist_object<T>::operator*() const [with T = std::function<void(KV)>]", file=0x555555a108b0 "/work/gitlab-ci/ci-install-upcxx-utils-cionly-upcxx-2023.3.0/include/upcxx/dist_object.hpp", line=221, str="Failed condition: has_value()") at /work/gitlab-ci/ci-install-upcxx-utils-cionly-upcxx-2023.3.0/include/upcxx/diagnostic.hpp:22
[0] #11 0x000055555557d61a in upcxx::dist_object<std::function<void (KV)> >::operator*() const (this=0x7fffffffe100) at /work/gitlab-ci/ci-install-upcxx-utils-cionly-upcxx-2023.3.0/include/upcxx/dist_object.hpp:221
[0] #12 0x000055555557413a in upcxx_utils::FlatAggrStore<KV>::set_update_func(std::function<void (KV)>) (this=0x7fffffffddb0, update_func=...) at /var/lib/gitlab-runner/builds/DSnhzJox/0/ExaBiome/upcxx-utils-cionly/include/upcxx_utils/flat_aggr_store.hpp:371
[0] #13 0x000055555556cc8d in test_flat_aggr_store (argc=1, argv=0x7fffffffe2a8) at /var/lib/gitlab-runner/builds/DSnhzJox/0/ExaBiome/upcxx-utils-cionly/test/test_flat_aggr_store.cpp:31
[0] #14 0x0000555555567cac in main (argc=1, argv=0x7fffffffe2a8) at /var/lib/gitlab-runner/builds/DSnhzJox/0/ExaBiome/upcxx-utils-cionly/test/main_shell.cpp:17
[0] [Inferior 1 (process 3345) detached]
*** Caught a fatal signal (proc 0): SIGABRT(6)
*** Details for bug reporting (proc 0): config=RELEASE=2023.3.0,SPEC=1.20,PTR=64bit,debug,SEQ,timers_forced_posixrt,membars_native,atomics_native,atomic32_native,atomic64_native compiler=GNU/7.5.0 sys=x86_64-pc-linux-gnu

This code works with previous versions of upcxx and on other more modern OS and on Ubuntu 20.04.5 with a slightly newer version of g++ Ubuntu 7.5.0-6ubuntu2. This machine is arguably out of Ubuntu support next month

The flat_aggr_store.hpp code that breaks is “*(this->update_func) = update_func;”

void set_update_func(UpdateFunc update_func) {
    barrier(aggr_team);  // to avoid race of first update
    if (max_store_size_per_target > 1 && store.empty())
      DIE("Invalid condition - FlatAggrStore not initialized yet - call set_size() after construction or clear()!\n");
    *(this->update_func) = update_func;
    barrier(aggr_team);  // to avoid race of first update
  }

where

template <typename T, typename... Data>
class FlatAggrStore {
 public:
  using RankStore = vector<T>;
  using RankStoreIterator = typename RankStore::iterator;
  using Store = vector<RankStore>;
  using UpdateFunc = std::function<void(T, Data &...)>;
  using DistUpdateFunc = dist_object<UpdateFunc>;

I know in the release notes for 2023.3 there have been changes to dist_objects and how they can be assigned / constructed / moved, but I do not think that this test should behave any differently than in 2022.9.

And as additional information, the advice that Colin offered in slack this week to copy-capture a task with a mutable lambda (instead of constructing a tuple copy and moving it) also broke on this old OS / g++ with a SEGFAULT instead of the assertion error above.

Let me know if yo think this is a but worthy of fixing. If it is, I’ll wait to upgrade this old machine, and if it is not, then just update any relevant deprecations and I’ll start upgrading its OS.

Comments (5)

  1. Dan Bonachea

    Hi @Rob Egan - Thanks for the report and apologies for the trouble.

    Based on the crash stack, it seems this->update_func has type upcxx::dist_object<std::function<void (KV)> >.

    This assertion error corresponds to a violation of the new precondition on dist_object <T>::operator*(), namely:

    Precondition: this->has_value().

    Although this precondition is new, all unmodified legacy code should trivially satisfy this condition, because before this release it was impossible to construct a dist_object lacking a value (and there is still no way to transition back to a state lacking a value).

    My guess is the actual problem is this code at flat_aggr_store.hpp:120 that I believe constructs the dist_object in question:

      // save the update function to use in both update and flush
      DistUpdateFunc update_func = DistUpdateFunc({});
    

    dist_object has been MoveConstructible for some time. With the 2023.3.0 release dist_object<T> becomes DefaultConstructible, and the particular expression above gains a potential overload ambiguity; i.e. (1) collective construction of a dist_object<UpdateFunc> with a default-constructed UpdateFunc value (the only option prior to 2023.3.0), or (2) non-collective default construction of dist_object<UpdateFunc> without a value (for more details see spec issue 192). In an effort to retain backwards-compatibility 2023.3.0 introduces a constructor overload (dist_object( std::initializer_list<X>)) specifically designed to resolve this ambiguity in favor of the first interpretation. However I suspect that type gimmick has somehow malfunctioned here.

    My unconfirmed guess is that g++ 7.5's overload resolution for this expression is buggy/non-conformant wrt std::initializer_list, leading it to incorrectly select the second/wrong overload resolution. It's especially puzzling that it did not issue an overload ambiguity error and just arbitrarily selected the wrong one.

    Please try this change:

    --- a/include/upcxx_utils/flat_aggr_store.hpp
    +++ b/include/upcxx_utils/flat_aggr_store.hpp
    @@ -117,7 +117,7 @@ class FlatAggrStore {
       string description;
    
       // save the update function to use in both update and flush
    -  DistUpdateFunc update_func = DistUpdateFunc({});
    +  DistUpdateFunc update_func = DistUpdateFunc({}, upcxx::world());
       // save all associated data structures as a tuple of a variable number of parameters
       std::tuple<Data &...> data;
    

    or if you prefer:

    --- a/include/upcxx_utils/flat_aggr_store.hpp
    +++ b/include/upcxx_utils/flat_aggr_store.hpp
    @@ -117,7 +117,7 @@ class FlatAggrStore {
       string description;
    
       // save the update function to use in both update and flush
    -  DistUpdateFunc update_func = DistUpdateFunc({});
    +  DistUpdateFunc update_func = DistUpdateFunc(UpdateFunc{});
       // save all associated data structures as a tuple of a variable number of parameters
       std::tuple<Data &...> data;
    

    either of these alternative spellings should more explicitly resolve the ambiguity in the desired way.

    It's strange this slipped through our testing, because our test/dist_object.cpp is attempting to test an exactly analogous case, and passed successfully on both older and newer versions of g++ (although possibly not exactly matching 7.5.0).

    Please report back with your findings, and update this issue if you discover any other C++ compiler versions that resolve this incorrectly.

  2. Rob Egan reporter

    Thanks Dan!

    That looks like it was the problem and constructing with the correctly typed empty object solved that and the cascade of similar problems throughout MHM2’s code (on that old machine).

    This looks like it was triggered by a bug in g++ 7.5.0-3 that is no longer present in Ubuntu 7.5.0-6.

  3. Dan Bonachea

    Hi Rob - thanks for confirming the diagnosis as an apparent (short-lived?) compiler bug.

    Please update this issue if you encounter a similar problem elsewhere.

  4. Log in to comment