UPCXX_SERIALIZED_VALUES misoptimized by GCC{7,8,9} with -O2+

Issue #400 resolved
Amir Kamil created an issue

Minimal test case:

#include <upcxx/upcxx.hpp>
#include <iostream>

struct A {
  int x;
  A(int v) : x(v) {}
  UPCXX_SERIALIZED_VALUES(x);
};

int main() {
  upcxx::init();

  upcxx::rpc(0,[](A const &a) {
                 assert(a.x == 10); 
               }, A(10)).wait();

  upcxx::finalize();
  return 0;
}

This works correctly on MacOS/Clang with either -g or -O and MacOS/GCC9 with -g, but the assertion fails on the latter with -O. At this point, we do not know why this is failing.

Official response

  • Dan Bonachea

    More verbose test case:

    #include <upcxx/upcxx.hpp>
    #include <iostream>
    
    struct A {
      int x;
      A(int v) : x(v) {}
      A(A &&other) : x(other.x){} // move cons
      #if COPYABLE
        A(const A &) = default;
      #else
        A(const A &) = delete; // non-copyable
      #endif
      UPCXX_SERIALIZED_VALUES(x);
    };
    
    using namespace upcxx;
    
    int main() {
      upcxx::init();
    
      rpc(0,[](A const &a) {
            int val = a.x;
            std::cout << val << std::endl;
            assert(val == 10);
            }, A(10)).wait();
    
      upcxx::barrier();
      if (!upcxx::rank_me()) { std::cout << "SUCCESS" << std::endl; }
    
      upcxx::finalize();
      return 0;
    }
    

    I've now replicated the problem on dirac Linux/x86_64/opt using develop @ acc5936 with all of:

    • gcc 7.1.0
    • gcc 7.4.0
    • gcc 8.2.0
    • gcc 9.1.0
    • gcc 9.3.0

    The following gcc versions do NOT appear to be affected, at least for the spec-issue144b test case in pull request #251 and the reduced test case:

    • gcc 6.4.0
    • gcc 10.1.0
    • gcc 10.2.0

    The problem is not specific to develop - I've also replicated the problem using the following releases on dirac:

    • UPC++ 2020.3.0 + gcc 9.3.0
    • UPC++ 2020.3.2 + gcc 8.4.0

    Observations/Conclusions:

    • problem is not specific to macOS, also affects Linux
    • problem affects a relatively wide swath of gcc versions: 7.x,8.x,9.x
    • problem affects UPC++ 2020.3.0 .. develop
    • All of the cases listed above pass at -O1 and fail at -O2 or -O3.
    • All results are unchanged whether the class is copyable or not.
    • All results are unchanged if the argument object is declared on the stack and moved in.
    • All results are unchanged if the argument object is declared on the stack, made copyable and passed in by lvalue.
    • So far the only user-side workaround that appears 100% effective is to avoid UPCXX_SERIALIZED_VALUES() entirely, eg use UPCXX_SERIALIZED_FIELDS (which additionally requires make the class DefaultConstructible)

    Upgrading to priority to release blocker.

Comments (12)

  1. Amir Kamil reporter

    I don’t claim to understand what GCC is doing, but I have found that inserting a SW barrier between deserialization of the values and construction of the resulting object fixes the issue:

    diff --git a/src/serialization.hpp b/src/serialization.hpp
    index ec06d5a0..5f5be947 100644
    --- a/src/serialization.hpp
    +++ b/src/serialization.hpp
    @@ -1087,6 +1087,7 @@ namespace upcxx {
           template<typename Obj, typename Reader, typename ...Ptrs>
           static Obj* deserialize(Reader &r, void *spot, Ptrs ...ptrs) {
             //return ::new(spot) Obj(static_cast<typename std::remove_pointer<Ptrs>::type&&>(*ptrs)...);
    +        asm volatile("": : :"memory");
             return Obj::upcxx_serialization::template construct<Obj>(spot, static_cast<typename std::remove_pointer<Ptrs>::type&&>(*ptrs)...);
           }
    

  2. Dan Bonachea

    Portably deploying a memory barrier in a public header is non-trivial (unless we resort to something like std::atomic_thread_fence()). We have portable memory barriers in gasnet_tools.h, but that's not available in this header.

    It's worth trying if an extern do-nothing function call (eg into libupcxx) inserted at the same location can achieve the same effect. It's also easy to deploy a memory barrier in the library *.cpp files inside the do-nothing function if that's required.

    Ideally any workaround can be made conditional on affected compiler versions, but a single static function call is cheap enough that we could just conservatively pay the cost more places than strictly required.

  3. Dan Bonachea

    I've reproduced this on el-capitan with gcc 9.3.0 when using -O2 or higher (-O1 is correct).

    So far, the following workarounds seem effective:

    "Big hammer":

    --- a/src/utility.hpp
    +++ b/src/utility.hpp
    @@ -106,7 +106,7 @@ namespace detail {
       // constructed as such.
       template<typename T>
       T* launder_unconstructed(T *p) noexcept {
    -    asm("" : "+rm"(p) : "rm"(p) :);
    +    asm volatile ("" : "+rm"(p) : "rm"(p) : "memory");
         return p;
       }
    

    "Surgical":

    --- a/src/utility.hpp
    +++ b/src/utility.hpp
    @@ -106,7 +106,8 @@ namespace detail {
       // constructed as such.
       template<typename T>
       T* launder_unconstructed(T *p) noexcept {
    -    asm("" : "+rm"(p) : "rm"(p) :);
    +    volatile char *q = (volatile char *)p;
    +    asm volatile ("" : "+rm"(*q) : "rm"(q) : );
         return p;
       }
    

    Basically, I don't think our launder_unconstructed as written is giving the optimizer the signal we expect, because it only asserts that the pointer value p may have changed and nothing about the memory *p where the object lives that we are trying to launder.

  4. Dan Bonachea

    Another effective workaround:

    --- a/src/utility.hpp
    +++ b/src/utility.hpp
    @@ -155,11 +155,11 @@ namespace detail {
         template<typename T>
         T* construct_trivial(void *dest, const void *src, std::true_type deft_ctor, std::true_type triv_copy) {
           using T1 = typename std::remove_const<T>::type;
           T1 *ans = reinterpret_cast<T1*>(::new(dest) T1);
           detail::template memcpy_aligned<alignof(T1)>(ans, src, sizeof(T1));
    -      return ans;
    +      return detail::launder_unconstructed<T1>(ans);
         }
         template<typename T>
         T* construct_trivial(void *dest, const void *src, std::true_type deft_ctor, std::false_type triv_copy) {
           using T1 = typename std::remove_const<T>::type;
           ::new(dest) T1;
    
  5. Dan Bonachea

    Another effective workaround:

    --- a/src/utility.hpp
    +++ b/src/utility.hpp
    @@ -156,7 +156,7 @@ namespace detail {
         T* construct_trivial(void *dest, const void *src, std::true_type deft_ctor, std::true_type triv_copy) {
           using T1 = typename std::remove_const<T>::type;
           T1 *ans = reinterpret_cast<T1*>(::new(dest) T1);
    -      detail::template memcpy_aligned<alignof(T1)>(ans, src, sizeof(T1));
    +      std::memcpy(ans, src, sizeof(T1));
           return ans;
         }
         template<typename T>
    

    We seem to be tripping something very brittle in the analysis.

  6. Amir Kamil reporter

    Here’s the extern do-nothing workaround:

    diff --git a/src/serialization.cpp b/src/serialization.cpp
    index a85a5121..f5d9d48b 100644
    --- a/src/serialization.cpp
    +++ b/src/serialization.cpp
    @@ -10,6 +10,12 @@ constexpr std::uintptr_t hunk_size_large = 8192;
    
     constexpr std::uintptr_t align_max = serialization_align_max; // shorthand
    
    +namespace upcxx {
    +  namespace detail {
    +    void opaque() {} // used for inserting opaque function calls into code
    +  }
    +}
    +
     void upcxx::detail::serialization_writer<false>::grow(std::size_t size0, std::size_t size1) {
       static_assert(2*align_max <= hunk_size_small, "Small hunk size (hunk_size_small) not big enough");
    
    diff --git a/src/serialization.hpp b/src/serialization.hpp
    index ec06d5a0..5ce39aad 100644
    --- a/src/serialization.hpp
    +++ b/src/serialization.hpp
    @@ -1072,6 +1072,8 @@ namespace upcxx {
           }
         };
    
    +    void opaque(); // used for inserting opaque function calls into code
    +
         template<typename TupRefs, int n>
         struct serialization_values_each<TupRefs, n, n> {
           template<typename Prefix>
    @@ -1087,6 +1089,7 @@ namespace upcxx {
           template<typename Obj, typename Reader, typename ...Ptrs>
           static Obj* deserialize(Reader &r, void *spot, Ptrs ...ptrs) {
             //return ::new(spot) Obj(static_cast<typename std::remove_pointer<Ptrs>::type&&>(*ptrs)...);
    +        opaque();
             return Obj::upcxx_serialization::template construct<Obj>(spot, static_cast<typename std::remove_pointer<Ptrs>::type&&>(*ptrs)...);
           }
    

  7. Dan Bonachea

    More verbose test case:

    #include <upcxx/upcxx.hpp>
    #include <iostream>
    
    struct A {
      int x;
      A(int v) : x(v) {}
      A(A &&other) : x(other.x){} // move cons
      #if COPYABLE
        A(const A &) = default;
      #else
        A(const A &) = delete; // non-copyable
      #endif
      UPCXX_SERIALIZED_VALUES(x);
    };
    
    using namespace upcxx;
    
    int main() {
      upcxx::init();
    
      rpc(0,[](A const &a) {
            int val = a.x;
            std::cout << val << std::endl;
            assert(val == 10);
            }, A(10)).wait();
    
      upcxx::barrier();
      if (!upcxx::rank_me()) { std::cout << "SUCCESS" << std::endl; }
    
      upcxx::finalize();
      return 0;
    }
    

    I've now replicated the problem on dirac Linux/x86_64/opt using develop @ acc5936 with all of:

    • gcc 7.1.0
    • gcc 7.4.0
    • gcc 8.2.0
    • gcc 9.1.0
    • gcc 9.3.0

    The following gcc versions do NOT appear to be affected, at least for the spec-issue144b test case in pull request #251 and the reduced test case:

    • gcc 6.4.0
    • gcc 10.1.0
    • gcc 10.2.0

    The problem is not specific to develop - I've also replicated the problem using the following releases on dirac:

    • UPC++ 2020.3.0 + gcc 9.3.0
    • UPC++ 2020.3.2 + gcc 8.4.0

    Observations/Conclusions:

    • problem is not specific to macOS, also affects Linux
    • problem affects a relatively wide swath of gcc versions: 7.x,8.x,9.x
    • problem affects UPC++ 2020.3.0 .. develop
    • All of the cases listed above pass at -O1 and fail at -O2 or -O3.
    • All results are unchanged whether the class is copyable or not.
    • All results are unchanged if the argument object is declared on the stack and moved in.
    • All results are unchanged if the argument object is declared on the stack, made copyable and passed in by lvalue.
    • So far the only user-side workaround that appears 100% effective is to avoid UPCXX_SERIALIZED_VALUES() entirely, eg use UPCXX_SERIALIZED_FIELDS (which additionally requires make the class DefaultConstructible)

    Upgrading to priority to release blocker.

  8. Paul Hargrove

    I have not confirmed, but based on gcc versions I suspect PrgEnv-gnu may be impacted on all three of the XC systems I have access to, as well as some of our installs on Summit.

    • Cori
      • gcc/8.3.0 is default
      • gcc/9.3.0 is newest
    • Theta
      • gcc/9.3.0 is both default and newest
    • Piz-Daint
      • gcc/8.3.0 is both newest and default

    In the case of Summit, the default gcc is 6.4.0 which is unaffected according to Dan's testing on Dirac. However, we also have installs for gcc 7,8,9 versions, built at the request of the ExaBiome team who needs 8 or 9 due to their use of C++14 or 17 features.

  9. Log in to comment