UPCXX_SERIALIZED_VALUES misoptimized by GCC{7,8,9} with -O2+
Minimal test case:
#include <upcxx/upcxx.hpp>
#include <iostream>
struct A {
int x;
A(int v) : x(v) {}
UPCXX_SERIALIZED_VALUES(x);
};
int main() {
upcxx::init();
upcxx::rpc(0,[](A const &a) {
assert(a.x == 10);
}, A(10)).wait();
upcxx::finalize();
return 0;
}
This works correctly on MacOS/Clang with either -g or -O and MacOS/GCC9 with -g, but the assertion fails on the latter with -O. At this point, we do not know why this is failing.
Official response
Comments (12)
-
-
reporter I don’t claim to understand what GCC is doing, but I have found that inserting a SW barrier between deserialization of the values and construction of the resulting object fixes the issue:
diff --git a/src/serialization.hpp b/src/serialization.hpp index ec06d5a0..5f5be947 100644 --- a/src/serialization.hpp +++ b/src/serialization.hpp @@ -1087,6 +1087,7 @@ namespace upcxx { template<typename Obj, typename Reader, typename ...Ptrs> static Obj* deserialize(Reader &r, void *spot, Ptrs ...ptrs) { //return ::new(spot) Obj(static_cast<typename std::remove_pointer<Ptrs>::type&&>(*ptrs)...); + asm volatile("": : :"memory"); return Obj::upcxx_serialization::template construct<Obj>(spot, static_cast<typename std::remove_pointer<Ptrs>::type&&>(*ptrs)...); }
-
Portably deploying a memory barrier in a public header is non-trivial (unless we resort to something like
std::atomic_thread_fence()
). We have portable memory barriers in gasnet_tools.h, but that's not available in this header.It's worth trying if an extern do-nothing function call (eg into libupcxx) inserted at the same location can achieve the same effect. It's also easy to deploy a memory barrier in the library *.cpp files inside the do-nothing function if that's required.
Ideally any workaround can be made conditional on affected compiler versions, but a single static function call is cheap enough that we could just conservatively pay the cost more places than strictly required.
-
reporter A do-nothing function call does appear to fix the issue. PR forthcoming.
-
I've reproduced this on el-capitan with gcc 9.3.0 when using
-O2
or higher (-O1 is correct).So far, the following workarounds seem effective:
"Big hammer":
--- a/src/utility.hpp +++ b/src/utility.hpp @@ -106,7 +106,7 @@ namespace detail { // constructed as such. template<typename T> T* launder_unconstructed(T *p) noexcept { - asm("" : "+rm"(p) : "rm"(p) :); + asm volatile ("" : "+rm"(p) : "rm"(p) : "memory"); return p; }
"Surgical":
--- a/src/utility.hpp +++ b/src/utility.hpp @@ -106,7 +106,8 @@ namespace detail { // constructed as such. template<typename T> T* launder_unconstructed(T *p) noexcept { - asm("" : "+rm"(p) : "rm"(p) :); + volatile char *q = (volatile char *)p; + asm volatile ("" : "+rm"(*q) : "rm"(q) : ); return p; }
Basically, I don't think our
launder_unconstructed
as written is giving the optimizer the signal we expect, because it only asserts that the pointer valuep
may have changed and nothing about the memory*p
where the object lives that we are trying to launder. -
Another effective workaround:
--- a/src/utility.hpp +++ b/src/utility.hpp @@ -155,11 +155,11 @@ namespace detail { template<typename T> T* construct_trivial(void *dest, const void *src, std::true_type deft_ctor, std::true_type triv_copy) { using T1 = typename std::remove_const<T>::type; T1 *ans = reinterpret_cast<T1*>(::new(dest) T1); detail::template memcpy_aligned<alignof(T1)>(ans, src, sizeof(T1)); - return ans; + return detail::launder_unconstructed<T1>(ans); } template<typename T> T* construct_trivial(void *dest, const void *src, std::true_type deft_ctor, std::false_type triv_copy) { using T1 = typename std::remove_const<T>::type; ::new(dest) T1;
-
Another effective workaround:
--- a/src/utility.hpp +++ b/src/utility.hpp @@ -156,7 +156,7 @@ namespace detail { T* construct_trivial(void *dest, const void *src, std::true_type deft_ctor, std::true_type triv_copy) { using T1 = typename std::remove_const<T>::type; T1 *ans = reinterpret_cast<T1*>(::new(dest) T1); - detail::template memcpy_aligned<alignof(T1)>(ans, src, sizeof(T1)); + std::memcpy(ans, src, sizeof(T1)); return ans; } template<typename T>
We seem to be tripping something very brittle in the analysis.
-
reporter Here’s the extern do-nothing workaround:
diff --git a/src/serialization.cpp b/src/serialization.cpp index a85a5121..f5d9d48b 100644 --- a/src/serialization.cpp +++ b/src/serialization.cpp @@ -10,6 +10,12 @@ constexpr std::uintptr_t hunk_size_large = 8192; constexpr std::uintptr_t align_max = serialization_align_max; // shorthand +namespace upcxx { + namespace detail { + void opaque() {} // used for inserting opaque function calls into code + } +} + void upcxx::detail::serialization_writer<false>::grow(std::size_t size0, std::size_t size1) { static_assert(2*align_max <= hunk_size_small, "Small hunk size (hunk_size_small) not big enough"); diff --git a/src/serialization.hpp b/src/serialization.hpp index ec06d5a0..5ce39aad 100644 --- a/src/serialization.hpp +++ b/src/serialization.hpp @@ -1072,6 +1072,8 @@ namespace upcxx { } }; + void opaque(); // used for inserting opaque function calls into code + template<typename TupRefs, int n> struct serialization_values_each<TupRefs, n, n> { template<typename Prefix> @@ -1087,6 +1089,7 @@ namespace upcxx { template<typename Obj, typename Reader, typename ...Ptrs> static Obj* deserialize(Reader &r, void *spot, Ptrs ...ptrs) { //return ::new(spot) Obj(static_cast<typename std::remove_pointer<Ptrs>::type&&>(*ptrs)...); + opaque(); return Obj::upcxx_serialization::template construct<Obj>(spot, static_cast<typename std::remove_pointer<Ptrs>::type&&>(*ptrs)...); }
-
- changed title to UPCXX_SERIALIZED_VALUES misoptimized by GCC{7,8,9} with -O2+
- marked as blocker
More verbose test case:
#include <upcxx/upcxx.hpp> #include <iostream> struct A { int x; A(int v) : x(v) {} A(A &&other) : x(other.x){} // move cons #if COPYABLE A(const A &) = default; #else A(const A &) = delete; // non-copyable #endif UPCXX_SERIALIZED_VALUES(x); }; using namespace upcxx; int main() { upcxx::init(); rpc(0,[](A const &a) { int val = a.x; std::cout << val << std::endl; assert(val == 10); }, A(10)).wait(); upcxx::barrier(); if (!upcxx::rank_me()) { std::cout << "SUCCESS" << std::endl; } upcxx::finalize(); return 0; }
I've now replicated the problem on dirac Linux/x86_64/opt using develop @ acc5936 with all of:
- gcc 7.1.0
- gcc 7.4.0
- gcc 8.2.0
- gcc 9.1.0
- gcc 9.3.0
The following gcc versions do NOT appear to be affected, at least for the
spec-issue144b
test case in pull request #251 and the reduced test case:- gcc 6.4.0
- gcc 10.1.0
- gcc 10.2.0
The problem is not specific to develop - I've also replicated the problem using the following releases on dirac:
- UPC++ 2020.3.0 + gcc 9.3.0
- UPC++ 2020.3.2 + gcc 8.4.0
Observations/Conclusions:
- problem is not specific to macOS, also affects Linux
- problem affects a relatively wide swath of gcc versions: 7.x,8.x,9.x
- problem affects UPC++ 2020.3.0 .. develop
- All of the cases listed above pass at
-O1
and fail at-O2
or-O3
. - All results are unchanged whether the class is copyable or not.
- All results are unchanged if the argument object is declared on the stack and moved in.
- All results are unchanged if the argument object is declared on the stack, made copyable and passed in by lvalue.
- So far the only user-side workaround that appears 100% effective is to avoid
UPCXX_SERIALIZED_VALUES()
entirely, eg useUPCXX_SERIALIZED_FIELDS
(which additionally requires make the class DefaultConstructible)
Upgrading to priority to release blocker.
-
I have not confirmed, but based on gcc versions I suspect PrgEnv-gnu may be impacted on all three of the XC systems I have access to, as well as some of our installs on Summit.
- Cori
- gcc/8.3.0 is default
- gcc/9.3.0 is newest
- Theta
- gcc/9.3.0 is both default and newest
- Piz-Daint
- gcc/8.3.0 is both newest and default
In the case of Summit, the default gcc is 6.4.0 which is unaffected according to Dan's testing on Dirac. However, we also have installs for gcc 7,8,9 versions, built at the request of the ExaBiome team who needs 8 or 9 due to their use of C++14 or 17 features.
- Cori
-
Proposed solution now in pull request #255
-
- changed status to resolved
Workaround issue
#400Insert a theoretically unnecessary call to launder along the path that deserializes a TriviallyCopyable type via memcpy.
Resolves issue
#400→ <<cset ceac43d7d4a6>>
- Log in to comment
More verbose test case:
I've now replicated the problem on dirac Linux/x86_64/opt using develop @ acc5936 with all of:
The following gcc versions do NOT appear to be affected, at least for the
spec-issue144b
test case in pull request #251 and the reduced test case:The problem is not specific to develop - I've also replicated the problem using the following releases on dirac:
Observations/Conclusions:
-O1
and fail at-O2
or-O3
.UPCXX_SERIALIZED_VALUES()
entirely, eg useUPCXX_SERIALIZED_FIELDS
(which additionally requires make the class DefaultConstructible)Upgrading to priority to release blocker.