Cannot register multiple completions against a non-copyable results type

Issue #408 resolved
Amir Kamil created an issue

This is a known issue, but it does not seem to be recorded in this issue tracker, so I’m added it now.

We do not currently allow multiple notifications for a completion that produces a non-copyable value. The reason is that lpc_dormat::awaken copies the result for all but the last completion, for which it moves the result. We currently have a runtime check for this in awaken.

There is nothing in the spec that prohibits requesting multiple notifications in this case. We do not require the result to be CopyConstructible.

This is only relevant to rpc, since that is the only operation that produces a non-TriviallySerializable result.

Sample program:

#include <upcxx/upcxx.hpp>
#include "../util.hpp"

using namespace upcxx;

struct T {
  int x;
  T() {}
  T(int x_) : x(x_) {}
  T(const T&) = delete;
  T(T&&) = default;
  UPCXX_SERIALIZED_FIELDS(x)
};

int main() {
  upcxx::init();
  print_test_header();

  auto futs = rpc(
      0,
      operation_cx::as_future() | operation_cx::as_future(),
      []() {
          return T(-3);
        }
    );
  UPCXX_ASSERT_ALWAYS(std::get<0>(futs).wait_reference().x == -3);
  UPCXX_ASSERT_ALWAYS(std::get<1>(futs).wait_reference().x == -3);

  print_test_success();
  upcxx::finalize();
}

Running this program produces:

Test: issue408.cpp
Ranks: 1
*** FATAL ERROR (proc 0): 
//////////////////////////////////////////////////////////////////////
UPC++ assertion failure:
 on process 0 (marauder.lan)
 at /Users/akamil/work/upcxx-akamil/bld/upcxx.assert1.optlev0.dbgsym1.gasnet_seq.smp/include/upcxx/lpc_dormant.hpp:169
 in function: void upcxx::detail::lpc_dormant<T>::awaken(std::tuple<T...> &&) [T = <T>]()

You have attempted to register multiple completions against a non-copyable results type.

To have UPC++ freeze during these errors so you can attach a debugger,
rerun the program with GASNET_FREEZE_ON_ERROR=1 in the environment.
//////////////////////////////////////////////////////////////////////

NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. 
*** Caught a fatal signal (proc 0): SIGABRT(6)
Abort trap: 6

Options:

  1. Leave the implementation as is and either document the non-compliance or modify the spec to prohibit this.

    1. We incur 1 deserialization, 2 moves, and n-1 copies (in the copyable case) when there are n completions.
  2. We modify the logic in awaken (and send_awaken_lpc) so that the result is deserialized separately into each completion.

    1. We incur n deserializations, n moves, and no copies. But we pay the cost of deserialization for each completion. (Probably not a big deal, because who requests multiple notifications for the same completion event?)
  3. We build in more complex logic in awaken so that it deserializes for each completion only when the result is non-copyable, otherwise it deserializes into the first completion and then copies from there to the rest.

    1. We incur 1 deserialization, 1 move, and n-1 copies in the copyable case and n deserializations, n moves, and no copies in the non-copyable case.

I have a prototype implementation of #2, and #3 should be straightforward to implement over that.

Thoughts?

Comments (6)

  1. Dan Bonachea

    For the specific example Amir provides, I'd much rather see us merge those completions and return the same future twice, since that's the cheapest possible thing we could do at runtime. However as he points out it's also quite silly for the user to have done that in the first place, so we shouldn't worry too much about optimizing it.

    I think the more interesting case is:

    rpc(...,operation_cx::as_promise(p) | operation_cx::as_future(), ...)
    

    This is still a "silly" duplicate completion (and I don't know a good reason to do this), but at least here the user is requesting the result value be fulfilled into two explicitly different heap cells (one owned by the runtime and another owned by the implementation). So the question posed here is should we be deserializing N times, deserializing once and copying, or some hybrid of those based on Copyability.

    I think the only relevant spec-ese is 7.1.-18:

    If a completion event is associated with some values, the UPC++ runtime is permitted to pass separate copies of those values to each notification registered on that completion event.

    which I think implicitly agrees with the current implementation behavior of deserialize exactly once and copy if needed (and error if that's not possible). This semantics (ie Amir's option 1) is easy to explain, and we can easily clarify that asking for multiple completions on the same value-producing event is an error for non-Copyable values. I'm ok with that restriction because I think asking for duplicate completions on a value-producing event is already a strange/bad thing to be doing, and will carry a runtime cost regardless of how we implement it (even for Copyable types).

    I'm uneasy with Amir's options 2 + 3, because they seem to implicitly assume that deserialization and copy are semantically equivalent or at least interchangeable, and I can think of several different situations where this is not the case. Examples include:

    1. Types whose state is immutable or COW, where deserialization is expensive but copy is cheap due to state-sharing
    2. Conversely: Subset serialization, where deserialization may be quite cheap (allocate a large heap object but only populate a small portion) but copy is expensive (copying a bunch of cleared memory)
    3. Registration-based deserialization, where deserialization registers an incoming object with a central registry in global state using an identity that persists across object moves (or even copies). Here silently converting one deserialization into many deserializations could break the system invariants.

    In all these cases I think it would be surprising/confusing for us to silently change copies into deserialization, across UPC++ versions or across a change in CopyConstructibility of the value type. So I think l I'm currently leaning towards clarifying the implemented restriction as a spec constraint (option 1) -- until/unless someone can product a compelling motivation for requesting multiple completions on the same value-producing event at all (let alone for a non-Copyable type).

  2. Amir Kamil reporter

    If we keep the currently implemented semantics, the implementation can be modified to reduce one move in the common case:

    4. If there is only one completion, deserialize directly into that completion. Otherwise, deserialize into a temporary and copy/move it into each completion.

    • We incur 1 move if there is only one completion. If there are n > 1 completions, we incur 2 moves and n-1 copies.

    Note that due to type complications with const, we cannot easily copy from one completion to the others as suggested in option 3 above.

  3. Amir Kamil reporter

    PR 272 (now merged into develop) implements option 4 above. (If there is only one completion, deserialize directly into that completion. Otherwise, deserialize into a temporary and copy/move it into each completion.)

  4. Dan Bonachea

    Merge develop for 2021.3.0 release

    • develop: Bump version to 2021.3.0 for release Add spec 2021.3.0 PDF Update GASNet default URL ChangeLog updates INSTALL.md updates Tweak issue462 test for slightly stronger coverage. Fix issue 462 -- restore DefaultConstructibility of dist_id. Also fix stream-insertion operator. Add test. Update ChangeLog test/copy: issue #241: Add missing CUDA device synchronization test/copy: add error checking to cuda calls test/copy: Update tests to abide by the strengthened requirement copy: Revert issue #423 workaround Update UPCXX_ASSERT_MASTER_IFSEQ to enforce the strengthened requirement issue 423: prohibit SEQ-mode communication using non-master personas configure: allow nvcc setting to contain args Minor optimization to upcxx::current_persona() Update ChangeLog issue #455: don't force synchronous source_cx for rput(remote_cx) Add an rput completions coverage test Fix a think-o in UPCXX_ASSERT_ALWAYS_MASTER() ChangeLog: updates for recent ibv-conduit work Add issue 460 to the change log. Implement detail::invoke_result as an abstraction over std::result_of and std::invoke_result (issue 460). fixup: Resolve issue 430: cannot disable the de... tweak UPCXX_INTERNAL_ONLY to include "detail" in the symbol name Update ChangeLog Resolve issue 430: cannot disable the default network Update INSTALL.md list of (un)supported platforms Update ChangeLog Move upcxx::os_env to upcxx::experimental::os_env Move upcxx::say to upcxx::experimental::say Bump development version to 2020.11.11 Add ChangeLog entry for issue #459. Add ChangeLog entries for issues 25 and 276. Move destroy_heap, restore_heap, op_min, and op_max to experimental namespace. Add ChangeLog entry. Move notrivial reduce and broadcast to experimental namespace (issue 459). Move non-fast reduce ops to experimental. Fix for CI failure with Clang floor. Add detail:internal_only parameter to future1(impl_type) constructor. Update documentation to reflect move of future1 into detail. Add detail::internal_only parameter to future1::then_lazy(). Remove obsolete wait.hpp. Move assert_failed() and fatal_error() into detail namespace. Move view_default_iterator into detail namespace. Rename helper class templates. Move serialization_align_max to detail namespace. Rename operation_cx_as[internal]_future to operation_cx_as[_internal]_future_t. Move apply[_tupled]_as_future into detail namespace. Move future1 and future_is_trivially_ready into detail namespace. Remove long-obsolete reflection.hpp. Move global_fnptr into detail namespace. Move completions into detail namespace. Move digest into detail namespace. Move storage_size and friends into detail namespace. Move bind and helpers into the detail namespace. Protect two more persona_scope members. Proctect persona members. Protect persona_scope members (issue #276). update global_ptr::check() to use gex_EP_QueryBoundSegmentNB Add OFI as an unofficial conduit configure: enable GASNet memory kinds as needed rpc-ctor-trace: add coverage for VIS rput(as_rpc) VIS rput: ensure synchronous serialization of remote_cx rpc-ctor-trace: Add coverage for device memory and GEX MK copy: ensure synchronous serialization of remote_cx for reference PUT case copy: ensure synchronous serialization of remote_cx for MK PUT case copy: ensure synchronous serialization of remote_cx for GET cases Implement backend::prepare_deferred_am_master() rpc-ctor-trace: update expected counts Remove unnecessary move/copy of internal lambda completions deferred in the cuda queue rpc-ctor-trace: expand to exercise upcxx::copy Add a verbose-ctor-trace test lpc-ctor-trace: strengthen as_lpc coverage lpc-ctor-trace: expand to exercise upcxx::copy Remove extraneous qualifiers from friend declarations. Move UPCXX_INTERNAL_ONLY macro to backend_fwd.hpp. Add internal accessor for raw pointer of a global_ptr. Make members of team_id, dist_id, and global_fnptr private and add friend declarations. Wrap global_ptr members with a macro to signify that they are intended to be private. Address issue #276. Refactor deserialized move constructibility check to make it liklier to trip the static assertions. Add assertions to rpc/rpc_ff/as_rpc that deserialized arguments/return values are MoveConstructible. Add tests. Fix a major bug introduced a few commits ago in rpc-ctor-trace. Add a few more test cases. Move utility code to utility.hpp. Modify the text of an error message. Add static_asserts for movability/copyability of rpc/rpc_ff/as_rpc rvalue arguments. Add tests. Fix rpc-ctor-trace test to avoid violating movability requirements in rpc. issue427b: fix a harmless warning rpc-ctor-trace: remove unnecessary barriers Optimize computation of element size with alignment. Updates to documentation and test case. Fix some small issues from the last commit. Account for alignment constraints when computing space per element in serialization of a sequence. Fixes #427. Update Apple compiler detection utility.hpp: clarify a comment Update copy to use bound function instead of serialized completion. Remove obsolete declarations. Move event binding of completions back into completions_state to take advantage of the event-matching logic. Reorganize code and add comments. Rearchitect implementation of bind_event to work on completions in addition to completions_state. Allows reduction of copies/moves. Update documentation for bind_event() and remove requirement that it be called on an rvalue. Refactoring in cx_remote_dispatch to reduce future chaining. Rename a type for clarity. Undo obsolete changes introduced by PR#263. Add some comments. Excise serialization of completions in favor of directly serializing remote bound functions. Use bind_event() in rget and vis. Update ChangeLog Avoid other possible instances of issue #421 Fix issue #421: PGI optimizer problem with upcxx::copy() Add issue421c: upgraded version of issue421 Partially revert "bld/tests.mak: Ban upcxx::copy tests on PGI, avoiding issue 421" Revert "issue 421: Disable use of upcxx::copy with PGI" fixup^2: Support more delimiters in NETWORKS Add change log entry for PR 323. Update static assertion messages to reflect feedback on PR 323. Add static_asserts to detect use of new and new_array with array types. fixup: Support more delimiters in NETWORKS Support more delimiters in NETWORKS ChangeLog: document issue 422 as fixed configure: gasnet/version.git for auto-viv tarball configure: automate bootstrap of GASNet-EX sources configure: relax assumption on GASNet tarball Add issue #408 as fixed to change log. Resolve issue 446 configure: recognize new Intel compilers Add build/run commands to test-results logging Preserve return value of apply_as_future_then_lazy in all three cases. Consolidate apply_as_future_then and apply_as_future_then_lazy. Move logic for avoiding future overheads in apply_as_future().then[lazy] to apply.hpp. Make use of it in other contexts. Rearchitect rpc reply to sidestep future machinery when possible; reduces moves. Restore check for copyability in lpc_dormant::awaken. Deserialize into temporary storage and copy from there when there are multiple completions. Modify lpc_dormant::awaken to deserialize arguments separately for each completion. Fixes #408. test/util.hpp: Add a backwards compatilbity hack lpc-stress: strengthen the test to check LPC return path Add an lpc-stress test Bump development version to 2020.11.9 Update lpc-ctor-trace for issue #450 Update ChangeLog Fix issue #450: upcxx::lpc callback return of rvalue reference not decayed as specified Add issue450 test Allow building view in seq threadmode Add threaded SEQ testing (issue 451) bld/tests.mak: fix harmless typos in comments Update ChangeLog Add issue447 test Fix issue #447: REGRESSION: bulk upcxx::rput with l-value completions fixup: Add some alignment assertions Add some alignment assertions Relocate UPCXX_MPSC_QUEUE[FOO] defn to header fixup: issue #440: Deploy a temporary solution for 'Invalid GASNet call' in deserialization Update ChangeLog fix issue #245: persona-example deadlocks when --with-mpsc-queue=biglock intru_queue.hpp: Note a design limitation Bump development version to 2020.11.7 Remove PROTOTYPE BRANCH notices src/cuda.cpp: Fix a harmless warning from PGI 20.4 bld/tests.mak: Ban upcxx::copy tests on PGI, avoiding issue 421 Update ChangeLog issue 421: Disable use of upcxx::copy with PGI Revert "INSTALL.md: Prohibit PGI compilers for memory_kinds branch" Revert "Prohibit PGI on the memory_kinds branch" re-enable test/regression/issue336.cpp for dev-check test/shared-seg-query: fix a harmless pedantic sign-compare warning Bump patch number to 2020.11.5 Update ChangeLog Add shared segment query test issue #382: Expose shared heap usage at runtime build-devel.md: update ALCF server URL issue440.cpp: Fix a harmless -pedantic warning Update ChangeLog issue #440: Deploy a temporary solution for 'Invalid GASNet call' in deserialization strengthen the issue440.cpp test Add issue440 test Add GitLab CI information to developer's docs issue #438: Report degenerate local_team in upcxx-run -vv verbose output Bump copyright year mpi-hybrid.md: Update udp-conduit recommended CSPAWN settings fixup: CI testing: add probe for OpenMP support Move GitLab CI and dev-ci scripts to a new repo Update ChangeLog configure: default to 8kb max-medium on aries-conduit dev-check: work-arounds for known issues CI testing: add probe for OpenMP support GitLab CI: renames to get junit/xml reporting GitLab CI: enable junit/xml report generation Update ChangeLog Add issue #432 test issue #432: Some upcxx::copy cases do not discharge() properly aries: cap default UPCXX_RPC_EAGER_THRESHOLD to 8kb Bump patch number to 2020.11.3 Update ChangeLog runtime.cpp: ensure max_medium calc accounts for GEX_FLAG_AM_PREPARE_LEAST_ALLOC rma_put_then_am_master_protocol(): Fix gex_AM_MaxRequestMedium arg count UPCXX_RPC_EAGER_THRESHOLD tweaks Update spelling for GASNET_NATIVE_NP_ALLOC_REQ_MEDIUM bench/misc_perf: expand rpc tests to cover the internal code paths Statically optimize am_send_buffer for NPAM Repair am_send_buffer tiny buffer optimization Add NPAM support for Eager RPC injection Add post-configure check for conduit native NPAM Add UPCXX_RPC_EAGER_THRESHOLD_LOCAL to control local threshold Upgrade prepare_am with local threshold and NPAM prereqs issue #261: tweak gasnet::rma_put_then_am_master runtime.hpp: Remove dead code that is no longer maintained RPC: Hoist TM/rank translation overheads rpc.hpp: "Invert" tm/no-tm overloads of rpc/rpc_ff bench/misc_perf: Add test of rput-then-rpc Change UPCXX_RPC_EAGER_THRESHOLD default to the maximum value issue #164: Add UPCXX_RPC_EAGER_THRESHOLD knob Add an RPC microbenchmark Update ChangeLog move remaining upcxx code in to separate header file Move UPC++ dl_malloc defines into separate header file Complete dlmalloc function list Name shift UPC++'s dlmalloc symbols Fix dev-check when configured using --without-cuda tests.mak: Add bench directory to dev-check Add issue428 test Update ChangeLog issue #428: Regression in rpc(team,rank,..,view) overload resolution leads to confusing type errors ChangeLog: spell-check issue #420: Suppress harmless GASNet debug-mode configure warning GitLab CI / CI scripts: de-version macOS bits Remove the deprecated top-level install script GitLab CI: Resume testing of Theta PrgEnv-cray Resolve issue #425 copy-cover: Fix harmless sign-compare warnings from clang 11 cuda.hpp: Fix a harmless warning from clang 11 -Wall Revert "Set GASNet 2020.11.0 tarball URL" Bump version to 2020.11.1 for mk-develop Bump package version for 2020.11.0 release Set GASNet 2020.11.0 tarball URL Merge 2020.3.2 section into ChangeLog Bump development version to 2020.10.1 Add PDFs for 2020.11.0 spec and 2020.10.0 guide INSTALL.md: Expand instructions for GDR memory kinds INSTALL.md: Prohibit PGI compilers for memory_kinds branch Add prototype branch notices to major documents Update ChangeLog syncx3 updated examples from guide re-sync updated examples from guide sync updated examples from guide Change configuration of private segment tests Add support for using private segment for host array Fix order of arguments being passed Add host to host transfers Use promises rather than futures in flood benchmark Local GPU and remote host tests Prohibit PGI on the memory_kinds branch Makefile: Fix Git version stamp detection copy-cover: Exercise remote-only completion events copy-cover: Add source completion validation copy-cover: sweep buf sizes fixupx2: Add new copy-cover test fixup: Implement UPCXX_BUG4148_WORKAROUND fixup: Add new copy-cover test Implement UPCXX_BUG4148_WORKAROUND and RC-only opt for puts copy.hpp: factor initiator variable upcxx::copy: Tune use_mk for no RC fixup: issue #221: Fix upcxx::copy() mishandling of private memory arguments GitLab CI: PGI testers opt-in only (issue #421) Upgrade global_ptr::check() to bounds-check device pointers Change device_allocator to throw upcxx::bad_segment_alloc device_allocator construct: prohibit empty active segments Add bad-segment-alloc test to validate device_allocator exn behavior device_allocator: fix undocumented device memory allocation Fix delete of array in view-accumulate.cpp to delete[] example/prog-guide: apply updates from guide repo cuda_microbenchmark: correct windowing logic Less specific GitLab tags for Xcode runners. correct an assert typo appearing in multiple functions Increase max_heaps in the reference implementation to 33 Avoid issue 423 for upcxx::copy() Deploy GPUDirect transfers in upcxx::copy Construct GEX EP/MK/Segment objects Implement heap_idx recycling Add negative compilation tests copy: assert additional preconditions test/regression/issue405: fix a violated copy() precondition upcxx::copy()/detail::rma_copy_local() use memcpy copy-cover: exploit fix to issue421 Add new copy-cover test issue #221: Fix upcxx::copy() mishandling of private memory arguments Change default --with-gasnet URL to memory_kinds nightly snapshot Split issue 421 test into two, one with lambdas and another with an asymmetric fuction object. Fix issue #421. Add issue421 Ban PGI 20.7+ due to known bugs fixup: Bump required GEX spec version to 0.10 and remove cruft test/copy: update for new collective requirement on device_allocator construction Add collective device_allocator creation Add a high-level coverage test for memory kinds fix an unreported bug in cuda_device move constructor fix unreported bug in device_allocator template inheritance Improve heap_state safety Probe for GASNET_MAXEPS Rework internal memory kinds objects global_ptr: Rename device_ field to heap_idx_

    → <<cset efd61603c031>>

  5. Log in to comment