Regression: upcxx::copy(remote_cx::as_rpc), PR 289 fix breaks PGI optimizer
Test program:
#include <upcxx/upcxx.hpp>
#include <iostream>
#include <assert.h>
using namespace upcxx;
int main() {
upcxx::init();
global_ptr<int> gp1 = new_<int>();
global_ptr<int> gp2 = new_<int>();
*gp1.local() = 42;
static bool done = false;
copy(gp1, gp2, 1, remote_cx::as_rpc([]() { done = true; }));
while (!done) upcxx::progress();
assert(*gp2.local() == 42);
barrier();
delete_(gp1);
delete_(gp2);
if (!rank_me()) std::cout << "SUCCESS" << std::endl;
upcxx::finalize();
return 0;
}
This compiles and works as expected in all of 2019.9.0, 2020.3.2 and 2020.3.8. However on our pending release branch (effectively equal to develop) this fails to compile as follows:
In file included from /usr/local/pkg/upcxx-dirac/gcc-10.2.0/nightly-2020.10.14/upcxx.debug.gasnet_seq.smp/include/upcxx/team_fwd.hpp:4,
from /usr/local/pkg/upcxx-dirac/gcc-10.2.0/nightly-2020.10.14/upcxx.debug.gasnet_seq.smp/include/upcxx/team.hpp:4,
from /usr/local/pkg/upcxx-dirac/gcc-10.2.0/nightly-2020.10.14/upcxx.debug.gasnet_seq.smp/include/upcxx/backend.hpp:7,
from /usr/local/pkg/upcxx-dirac/gcc-10.2.0/nightly-2020.10.14/upcxx.debug.gasnet_seq.smp/include/upcxx/allocate.hpp:8,
from /usr/local/pkg/upcxx-dirac/gcc-10.2.0/nightly-2020.10.14/upcxx.debug.gasnet_seq.smp/include/upcxx/upcxx.hpp:5,
from copy-remote_cx.cpp:1:
/usr/local/pkg/upcxx-dirac/gcc-10.2.0/nightly-2020.10.14/upcxx.debug.gasnet_seq.smp/include/upcxx/bind.hpp: In instantiation of 'struct upcxx::detail::deserialized_bound_function_base<upcxx::detail::copy<upcxx::completions<upcxx::rpc_cx<upcxx::remote_cx_event, upcxx::bound_function<main()::<lambda()> > > > >::<lambda(void*)>::<lambda()>::<lambda(cxs_remote_t&&)>, std::tuple<upcxx::detail::completions_state<upcxx::detail::event_is_remote, upcxx::detail::rput_event_values, upcxx::completions<upcxx::rpc_cx<upcxx::remote_cx_event, upcxx::bound_function<main()::<lambda()> > > >, 0> >, upcxx::detail::index_sequence<0>, true>':
/usr/local/pkg/upcxx-dirac/gcc-10.2.0/nightly-2020.10.14/upcxx.debug.gasnet_seq.smp/include/upcxx/bind.hpp:281:12: required from 'struct upcxx::detail::deserialized_bound_function<upcxx::detail::copy<upcxx::completions<upcxx::rpc_cx<upcxx::remote_cx_event, upcxx::bound_function<main()::<lambda()> > > > >::<lambda(void*)>::<lambda()>::<lambda(cxs_remote_t&&)>, upcxx::detail::completions_state<upcxx::detail::event_is_remote, upcxx::detail::rput_event_values, upcxx::completions<upcxx::rpc_cx<upcxx::remote_cx_event, upcxx::bound_function<main()::<lambda()> > > >, 0> >'
/usr/local/pkg/upcxx-dirac/gcc-10.2.0/nightly-2020.10.14/upcxx.debug.gasnet_seq.smp/include/upcxx/bind.hpp:301:70: required from 'struct upcxx::bound_function<upcxx::detail::copy<upcxx::completions<upcxx::rpc_cx<upcxx::remote_cx_event, upcxx::bound_function<main()::<lambda()> > > > >::<lambda(void*)>::<lambda()>::<lambda(cxs_remote_t&&)>, upcxx::detail::completions_state<upcxx::detail::event_is_remote, upcxx::detail::rput_event_values, upcxx::completions<upcxx::rpc_cx<upcxx::remote_cx_event, upcxx::bound_function<main()::<lambda()> > > >, 0> >'
/usr/local/pkg/upcxx-dirac/gcc-10.2.0/nightly-2020.10.14/upcxx.debug.gasnet_seq.smp/include/upcxx/copy.hpp:238:24: required from 'typename upcxx::detail::completions_returner<upcxx::detail::event_is_here, upcxx::detail::rput_event_values, typename std::decay<_Tp>::type>::return_t upcxx::detail::copy(int, upcxx::intrank_t, void*, int, upcxx::intrank_t, void*, std::size_t, Cxs&&) [with Cxs = upcxx::completions<upcxx::rpc_cx<upcxx::remote_cx_event, upcxx::bound_function<main()::<lambda()> > > >; typename upcxx::detail::completions_returner<upcxx::detail::event_is_here, upcxx::detail::rput_event_values, typename std::decay<_Tp>::type>::return_t = void; upcxx::intrank_t = int; std::size_t = long unsigned int]'
/usr/local/pkg/upcxx-dirac/gcc-10.2.0/nightly-2020.10.14/upcxx.debug.gasnet_seq.smp/include/upcxx/copy.hpp:87:24: required from 'typename upcxx::detail::completions_returner<upcxx::detail::event_is_here, upcxx::detail::rput_event_values, typename std::decay<Cxs>::type>::return_t upcxx::copy(upcxx::global_ptr<const T, KindSet>, upcxx::global_ptr<T, K>, std::size_t, Cxs&&) [with T = int; upcxx::memory_kind Ks = upcxx::memory_kind::host; upcxx::memory_kind Kd = upcxx::memory_kind::host; Cxs = upcxx::completions<upcxx::rpc_cx<upcxx::remote_cx_event, upcxx::bound_function<main()::<lambda()> > > >; typename upcxx::detail::completions_returner<upcxx::detail::event_is_here, upcxx::detail::rput_event_values, typename std::decay<Cxs>::type>::return_t = void; std::size_t = long unsigned int]'
copy-remote_cx.cpp:14:61: required from here
/usr/local/pkg/upcxx-dirac/gcc-10.2.0/nightly-2020.10.14/upcxx.debug.gasnet_seq.smp/include/upcxx/bind.hpp:213:7: error: no type named 'type' in 'class std::result_of<upcxx::detail::copy<upcxx::completions<upcxx::rpc_cx<upcxx::remote_cx_event, upcxx::bound_function<main()::<lambda()> > > > >::<lambda(void*)>::<lambda()>::<lambda(cxs_remote_t&&)>&&(upcxx::detail::completions_state<upcxx::detail::event_is_remote, upcxx::detail::rput_event_values, upcxx::completions<upcxx::rpc_cx<upcxx::remote_cx_event, upcxx::detail::deserialized_bound_function<main()::<lambda()> > > >, 0>&&)>'
213 | operator()() && {
| ^~~~~~~~
/usr/local/pkg/upcxx-dirac/gcc-10.2.0/nightly-2020.10.14/upcxx.debug.gasnet_seq.smp/include/upcxx/bind.hpp: In instantiation of 'struct upcxx::bound_function<upcxx::detail::copy<upcxx::completions<upcxx::rpc_cx<upcxx::remote_cx_event, upcxx::bound_function<main()::<lambda()> > > > >::<lambda(void*)>::<lambda()>::<lambda(cxs_remote_t&&)>, upcxx::detail::completions_state<upcxx::detail::event_is_remote, upcxx::detail::rput_event_values, upcxx::completions<upcxx::rpc_cx<upcxx::remote_cx_event, upcxx::bound_function<main()::<lambda()> > > >, 0> >':
/usr/local/pkg/upcxx-dirac/gcc-10.2.0/nightly-2020.10.14/upcxx.debug.gasnet_seq.smp/include/upcxx/copy.hpp:238:24: required from 'typename upcxx::detail::completions_returner<upcxx::detail::event_is_here, upcxx::detail::rput_event_values, typename std::decay<_Tp>::type>::return_t upcxx::detail::copy(int, upcxx::intrank_t, void*, int, upcxx::intrank_t, void*, std::size_t, Cxs&&) [with Cxs = upcxx::completions<upcxx::rpc_cx<upcxx::remote_cx_event, upcxx::bound_function<main()::<lambda()> > > >; typename upcxx::detail::completions_returner<upcxx::detail::event_is_here, upcxx::detail::rput_event_values, typename std::decay<_Tp>::type>::return_t = void; upcxx::intrank_t = int; std::size_t = long unsigned int]'
/usr/local/pkg/upcxx-dirac/gcc-10.2.0/nightly-2020.10.14/upcxx.debug.gasnet_seq.smp/include/upcxx/copy.hpp:87:24: required from 'typename upcxx::detail::completions_returner<upcxx::detail::event_is_here, upcxx::detail::rput_event_values, typename std::decay<Cxs>::type>::return_t upcxx::copy(upcxx::global_ptr<const T, KindSet>, upcxx::global_ptr<T, K>, std::size_t, Cxs&&) [with T = int; upcxx::memory_kind Ks = upcxx::memory_kind::host; upcxx::memory_kind Kd = upcxx::memory_kind::host; Cxs = upcxx::completions<upcxx::rpc_cx<upcxx::remote_cx_event, upcxx::bound_function<main()::<lambda()> > > >; typename upcxx::detail::completions_returner<upcxx::detail::event_is_here, upcxx::detail::rput_event_values, typename std::decay<Cxs>::type>::return_t = void; std::size_t = long unsigned int]'
copy-remote_cx.cpp:14:61: required from here
/usr/local/pkg/upcxx-dirac/gcc-10.2.0/nightly-2020.10.14/upcxx.debug.gasnet_seq.smp/include/upcxx/bind.hpp:301:70: error: no match for call to '(upcxx::detail::deserialized_bound_function<upcxx::detail::copy<upcxx::completions<upcxx::rpc_cx<upcxx::remote_cx_event, upcxx::bound_function<main()::<lambda()> > > > >::<lambda(void*)>::<lambda()>::<lambda(cxs_remote_t&&)>, upcxx::detail::completions_state<upcxx::detail::event_is_remote, upcxx::detail::rput_event_values, upcxx::completions<upcxx::rpc_cx<upcxx::remote_cx_event, upcxx::bound_function<main()::<lambda()> > > >, 0> >) ()'
301 | std::declval<detail::deserialized_bound_function<Fn, B...>&&>()()
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
/usr/local/pkg/upcxx-dirac/gcc-10.2.0/nightly-2020.10.14/upcxx.debug.gasnet_seq.smp/include/upcxx/bind.hpp:303:5: warning: 'decltype (declval<upcxx::detail::deserialized_bound_function<Fn, B ...>&&>()()) upcxx::bound_function<Fn, B>::operator()() && [with Fn = main()::<lambda()>; B = {}]' used but never defined
303 | operator()() &&;
| ^~~~~~~~
This was not noticed until now because we have no automated coverage of remote_cx
applied to copy()
(yet).
We need to quickly triage this problem and make a decision to either fix it or determine what to rollback for the release. Based on git history, pull request #279 seems the very likely root cause - it's the only modification to copy.hpp during the interval in question and it changed the type of the cxs_remote
variable which appears to be what broke things.
Comments (18)
-
-
reporter @Amir Kamil - I agree the current code is "doing it wrong". Clearly we should always be serializing and deserializing the
as_rpc
argument before invocation, even for the case that it happens to run locally - so I fully agree with my understanding of your changes, and that they probably solve a second problem I had not noticed.However I think the new test code introduced in pull request #289 to evaluate/trigger this secondary issue is flawed. Let's iterate on that in the PR, but regardless the copy.hpp fixes appear to be what we need.
-
reporter - changed status to resolved
Fix issue
#421.→ <<cset affa327fc7a8>>
-
reporter Merge pull request #289 into memory_kinds
- issue421:
Split issue 421 test into two, one with lambdas and another with an asymmetric fuction object.
Fix issue
#421.
→ <<cset 5cec52e45a14>>
- issue421:
Split issue 421 test into two, one with lambdas and another with an asymmetric fuction object.
Fix issue
-
reporter - changed status to open
-
reporter - marked as major
PR 289 was merged to the memory_kinds branch, this defect remains on develop (and in forthcoming 2020.10.0).
As documented and triaged in pull request #289, the currently best-known solution had the unpleasant side effect of breaking most/all use of
upcxx::copy()
with the PGI optimizer for reasons we don't yet fully understand.We hope to find a better resolution to this problem post-release. In the meantime, GPU users are advised to favor the forthcoming memory_kinds prototype (for this and other reasons).
-
reporter Last night's results on EX-summit-ibv-pgi-opt for the memory_kinds branch @ 5cec52e post-merge of PR 289 show the expected opt-mode regression on the following tests:
- cannon_cuda
- copy
- cuda_microbenchmark
- cuda_vecadd
- h-d
- jac3d
Some of these failures had not been seen before (likely because they are not part of dev-check), but they all invoke
upcxx::copy()
and are thus subject to the known problem. The merge of PR 289 is the only significant change on this branch since these tests all passed on 10-15 @ 322ffe7. -
reporter - changed milestone to 2021.3.0 release
Mass roll-over of open issues to next release milestone
-
reporter - changed title to Regression: upcxx::copy(remote_cx::as_rpc), PR 289 fix breaks PGI optimizer
- changed version to mk-develop branch
- marked as blocker
Fixing this regression for the PGI optimizer (or possibly disclaiming PGI) is a blocker for merging mk-develop and the next release.
-
-
reporter pull request 311 "restores" PGI support by banning the use of
upcxx::copy()
when using PGI.NOT a solution to this issue, just an imperfect compromise in the interests of expediency to avoid stalling unrelated development.
-
reporter - changed status to resolved
Merge branch 'mk-develop' into develop
All conflicts are resolved in favor of mk-develop, making the result textually identical to the last commit on mk-develop.
conflicts:
- .gitlab-ci.yml - replaced with new out-of-tree link
- ChangeLog - removed a duplicate addition of 2020.3.2, otherwise trivial
- bench/cuda_microbenchmark.cpp - minor conflict due to a bugfix on develop and a rewrite on mk-develop (kept the latter).
- docs/spec.pdf - Revision 2020.11.0-draft is kept as latest
- src/backend/gasnet/runtime_internal.hpp - trivial
- src/version.hpp - trivial
-
utils/upcxx-run - trivial
-
mk-develop: src/cuda.cpp: Fix a harmless warning from PGI 20.4 bld/tests.mak: Ban upcxx::copy tests on PGI, avoiding issue 421 Update ChangeLog issue 421: Disable use of upcxx::copy with PGI Revert "INSTALL.md: Prohibit PGI compilers for memory_kinds branch" Revert "Prohibit PGI on the memory_kinds branch" re-enable test/regression/issue336.cpp for dev-check test/shared-seg-query: fix a harmless pedantic sign-compare warning Bump patch number to 2020.11.5 Update ChangeLog Add shared segment query test issue
#382: Expose shared heap usage at runtime build-devel.md: update ALCF server URL issue440.cpp: Fix a harmless -pedantic warning Update ChangeLog issue#440: Deploy a temporary solution for 'Invalid GASNet call' in deserialization strengthen the issue440.cpp test Add issue440 test Add GitLab CI information to developer's docs issue#438: Report degenerate local_team in upcxx-run -vv verbose output Bump copyright year mpi-hybrid.md: Update udp-conduit recommended CSPAWN settings fixup: CI testing: add probe for OpenMP support Move GitLab CI and dev-ci scripts to a new repo Update ChangeLog configure: default to 8kb max-medium on aries-conduit dev-check: work-arounds for known issues CI testing: add probe for OpenMP support GitLab CI: renames to get junit/xml reporting GitLab CI: enable junit/xml report generation Update ChangeLog Add issue#432test issue#432: Some upcxx::copy cases do not discharge() properly aries: cap default UPCXX_RPC_EAGER_THRESHOLD to 8kb Bump patch number to 2020.11.3 Update ChangeLog runtime.cpp: ensure max_medium calc accounts for GEX_FLAG_AM_PREPARE_LEAST_ALLOC rma_put_then_am_master_protocol(): Fix gex_AM_MaxRequestMedium arg count UPCXX_RPC_EAGER_THRESHOLD tweaks Update spelling for GASNET_NATIVE_NP_ALLOC_REQ_MEDIUM bench/misc_perf: expand rpc tests to cover the internal code paths Statically optimize am_send_buffer for NPAM Repair am_send_buffer tiny buffer optimization Add NPAM support for Eager RPC injection Add post-configure check for conduit native NPAM Add UPCXX_RPC_EAGER_THRESHOLD_LOCAL to control local threshold Upgrade prepare_am with local threshold and NPAM prereqs issue #261: tweakgasnet::rma_put_then_am_master
runtime.hpp: Remove dead code that is no longer maintained RPC: Hoist TM/rank translation overheads rpc.hpp: "Invert" tm/no-tm overloads of rpc/rpc_ff bench/misc_perf: Add test of rput-then-rpc Change UPCXX_RPC_EAGER_THRESHOLD default to the maximum value issue#164: Add UPCXX_RPC_EAGER_THRESHOLD knob Add an RPC microbenchmark Update ChangeLog move remaining upcxx code in to separate header file Move UPC++ dl_malloc defines into separate header file Complete dlmalloc function list Name shift UPC++'s dlmalloc symbols Fix dev-check when configured using --without-cuda tests.mak: Add bench directory to dev-check Add issue428 test Update ChangeLog issue#428: Regression inrpc(team,rank,..,view)
overload resolution leads to confusing type errors ChangeLog: spell-check issue#420: Suppress harmless GASNet debug-mode configure warning GitLab CI / CI scripts: de-version macOS bits Remove the deprecated top-level install script GitLab CI: Resume testing of Theta PrgEnv-cray Resolve issue#425copy-cover: Fix harmless sign-compare warnings from clang 11 cuda.hpp: Fix a harmless warning from clang 11 -Wall Revert "Set GASNet 2020.11.0 tarball URL" Bump version to 2020.11.1 for mk-develop Bump package version for 2020.11.0 release Set GASNet 2020.11.0 tarball URL Merge 2020.3.2 section into ChangeLog Add PDFs for 2020.11.0 spec and 2020.10.0 guide INSTALL.md: Expand instructions for GDR memory kinds INSTALL.md: Prohibit PGI compilers for memory_kinds branch Add prototype branch notices to major documents Update ChangeLog syncx3 updated examples from guide re-sync updated examples from guide sync updated examples from guide Change configuration of private segment tests Add support for using private segment for host array Fix order of arguments being passed Add host to host transfers Use promises rather than futures in flood benchmark Local GPU and remote host tests Prohibit PGI on the memory_kinds branch Makefile: Fix Git version stamp detection copy-cover: Exercise remote-only completion events copy-cover: Add source completion validation copy-cover: sweep buf sizes fixupx2: Add new copy-cover test fixup: Implement UPCXX_BUG4148_WORKAROUND fixup: Add new copy-cover test Implement UPCXX_BUG4148_WORKAROUND and RC-only opt for puts copy.hpp: factor initiator variable upcxx::copy: Tune use_mk for no RC fixup: issue#221: Fix upcxx::copy() mishandling of private memory arguments GitLab CI: PGI testers opt-in only (issue#421) Upgrade global_ptr::check() to bounds-check device pointers Change device_allocator to throw upcxx::bad_segment_alloc device_allocator construct: prohibit empty active segments Add bad-segment-alloc test to validate device_allocator exn behavior device_allocator: fix undocumented device memory allocation Fix delete of array in view-accumulate.cpp to delete[] example/prog-guide: apply updates from guide repo cuda_microbenchmark: correct windowing logic Less specific GitLab tags for Xcode runners. correct an assert typo appearing in multiple functions Increase max_heaps in the reference implementation to 33 Avoid issue 423 for upcxx::copy() Deploy GPUDirect transfers inupcxx::copy
Construct GEX EP/MK/Segment objects Implement heap_idx recycling Add negative compilation tests copy: assert additional preconditions test/regression/issue405: fix a violated copy() precondition upcxx::copy()/detail::rma_copy_local() use memcpy copy-cover: exploit fix to issue421 Add new copy-cover test issue#221: Fix upcxx::copy() mishandling of private memory arguments Change default --with-gasnet URL to memory_kinds nightly snapshot Split issue 421 test into two, one with lambdas and another with an asymmetric fuction object. Fix issue#421. Add issue421 Ban PGI 20.7+ due to known bugs fixup: Bump required GEX spec version to 0.10 and remove cruft test/copy: update for new collective requirement on device_allocator construction Add collective device_allocator creation Add a high-level coverage test for memory kinds fix an unreported bug in cuda_device move constructor fix unreported bug in device_allocator template inheritance Improve heap_state safety Probe for GASNET_MAXEPS Rework internal memory kinds objects global_ptr: Rename device_ field to heap_idx_
→ <<cset 240d2d069d40>>
-
reporter - removed version
Removing version: mk-develop branch (automated comment)
-
reporter - changed status to open
This was accidentally/automatically resolved - this problem remains, breaking the PGI optimizer
-
reporter -
assigned issue to
- changed version to Development Branch
I have a lead on a potential workaround, so I'm taking ownership (at least temporarily).
-
assigned issue to
-
reporter Possible solution in pull request #321
-
reporter - changed status to resolved
Fix issue
#421: PGI optimizer problem withupcxx::copy()
This commit converts a lambda capture of a remote completion on the
upcxx::copy()
loopback path into explicit passing through a heap location (as done along other paths).This change is observed to resolve the stack corruption issues seen on dirac/x86_64/linux/{smp,udp} with PGI 20.4/opt.
→ <<cset ee6311365ce8>>
-
reporter Merge pull request #321 into develop
- issue421-pgi:
Update ChangeLog
Avoid other possible instances of issue
#421Fix issue#421: PGI optimizer problem withupcxx::copy()
Add issue421c: upgraded version of issue421 Partially revert "bld/tests.mak: Ban upcxx::copy tests on PGI, avoiding issue 421" Revert "issue 421: Disable use of upcxx::copy with PGI"
→ <<cset 9735a6789df2>>
- issue421-pgi:
Update ChangeLog
Avoid other possible instances of issue
- Log in to comment
This compilation failure was introduced by PR #268.
However, if my understanding is correct, the root issue is that remote-completion handling in
copy()
is broken in several ways, making assumptions that are invalid:upcxx::rank_me() != rank_d && upcxx::rank_me() != rank_s
), it capturescxs_remote
by value in a lambda, then sends that lambda out over the wire. This only works for TriviallySerializable remote completions.upcxx::rank_me() == rank_d
), it directly invokes the bound function inside ofcxs_remote
. As of PR #268, this does not link, but more fundamentally, it violates the semantics of RPC completion which require the arguments to be serialized an deserialized.upcxx::rank_me() == rank_s && upcxx::rank_me() != rank_d
), it makes copies (by lambda capture) of the deserialized remote completion. (Currently, the type iscxs_remote_t
, but it should bedeserialized_type_t<cxs_remote_t>
.) This is not valid in general, as there is no requirement that the deserialized arguments toas_rpc()
by copyable.Here’s what I think is needed to fix these cases:
upcxx::bind()
to bindcxs_remote
, as is currently done in the fourth case.serialization_traits<cxs_remote_t>::deserialized_value([cxs_remote,*cxs_remote_heaped])
.cxs_remote
and capture the pointer instead.Here’s the diff: