Deploy Atomic Domains

Issue #20 resolved
Former user created an issue

We had some good discussion yesterday. While the template tricks discussed are certainly cool, it is my opinion that their added value may not be enough to justify the departure from idiomatic c++ practices. It does not seem likely for a user to make a mistake in specifying the set of atomic ops capable of contending for the same memory. Algorithms involving atomics require very careful thought. The state space of what each concurrent agent could be doing is monstrous. For their own sanity, a programmer would be wise to keep atomic algorithms textually localized. At least that's how I approach racy code. Would anyone disagree that being the likely case for our users? If we can assume the atomic-op-set will not be wrong, then I think the goofiness of domains-as-types only clutters things up.

This brings us back to specifying the atomic-op-set at runtime. The UPC API makes domain construction collective. @bonachea said that restriction could be lifted. But I think he later contradicted that statement when saying that there may be some state associated with the domain (the mutex for AM-based AMO's lacking a CPU equivalent instruction). If there is such state, then domain construction would have to be collective (or at least named) so that when AM-based atomics land, they have the information necessary to find the state (again, the proper mutex).

My goal here is to plead with gasnet that they not adopt the collective requirement for domain construction since it limits how the upper layer can present the API to the user (for instance, it takes on-the-fly memoization off the table). So long as we agree that a non-collective API has an efficient implementation everywhere there shouldn't be any resistance. Two such implementations could be:

  1. Let the user name the domain, then gasnet would use that name in a local hashtable or something to get the mutex (first occurrence allocates the mutex).

  2. Just statically allocate a bunch of mutexes and hash the memory address of the AMO to determine the mutex. One mutex would work, more would only reduce contention.

Number 2 looks like the winner to me: no names for the user and no per-domain state management for gasnet, how nice! (@bonachea @PHHargrove, I know that both of you are well aware of these strategies. I'm just trying to produce a self-contained post.)

Now that we're hopefully all on board with non-collective and non-named domain construction, I would like to ask for a little bit more, again in the name of flexibility for the layers above. I would like a guarantee that any two domains over the same team and op-set are concurrently compatible. I want this to work:

team = ...;
opset = FETCH_ADD | FETCH_XOR;

// construct equivalent domains (possible that id's compare unequal)
d1 = gasnet_atomic_domain(team, opset);
d2 = gasnet_aotmic_domain(team, opset);

gp = /*some gasnet analog to global pointer*/

// these non-blocking atomics race to access gp along different but compatible domains
handle_t op1 = gasnet_fetch_add_nb(d1, gp);
handle_t op2 = gasnet_fetch_xor_nb(d2, gp);

With a non-collective and non-named domain construction semantics, I think you'll have to admit this is always legal since there would be no way to equate the domains of AMO's originating from different ranks other than comparing the arguments passed to their construction. Please guarantee it!

Now, in my limited imagination, I can only conceive of four possible implementations for the ops of a domain:

  1. All CPU-native: for the case when the domain's team is densely connected via shared memory windows (node local) and all ops are cpu-supported.
  2. All NIC-native: when the team is not node local, but every op in the op-set is NIC supported.
  3. All AM-based over CPU-natives: when the team is not node local but all ops are cpu-supported.
  4. All AM-based with mutexes: otherwise.

Is there conceivable architecture in which you would actually mix these within the same domain? As in fetch_add goes cpu-native, but fetch_xor goes nic-native? That's just silly right? Great, then the only information actually needed by gasnet after the domain is constructed is a four-valued enum. Producing id's for domains would be unnecessary, just pass the internal enum back to the user. This way if they do domain1 == domain2, it will just work. You could even recapture ~most~ of the error catching without carrying the full construction argument list around. I'm not looking for a guarantee that domains reduce to an enum, I just want the guarantee that domains are a value-semantics thing. So, not an id pointing to internally tracked state. This way I don't have to worry about properly destroying domains or creating too many of them redundantly.

Thus, I have concluded. My wishlist is such:

  1. Non-collective, non-named atomic domain construction.
  2. Redundant construction of equivalent domains are compatible.
  3. Domains are values, not objects. The gasnet_atomicdomain_t is at most a byte-copyable struct. No destruction necessary. (this implies point 2).
  4. Domain construction and compatibility/equivalence testing is dirt-cheap, as in it does not query the nic-drivers for atomic capabilities. I want to know that on-the-fly construction will be negligible in the face of the fastest thing: CPU-native atomics. If the op-set is encoded in a bitmap, and querying a team for its being entirely node-local is dirt cheap, then I see no reason why this can't be true in the context of the implementation I have outlined.

These are the properties I would like to expose to upcxx users in our domain analog.

I eagerly await hearing about all the things I have overlooked (or gotten miserably wrong) due to cognitive error or ignorance.

Official response

  • Dan Bonachea

    This issue was discussed in the 1/10/18 Pagoda meeting, and subsequently in the 1/11/18 GEX atomics meeting. We resolved Steve would look into spec and implementation for March, with extensions that may slip to September.

    Roughly the proposed update to the atomics interface looks like this:

    namespace atomic { 
      enum operation { 
            get, set, 
            add, fetch_add, // May shorten to “fadd” ?
            sub, fetch_sub,
            inc, fetch_inc,
            dec, fetch_dec,
            cas             // May lengthen to “cswap”
      };
    
      template <typename T>
      class domain { // MoveConstructible, Destructible 
         typedef T value_type;
      };  
    
      template <typename T>
      const domain<T> &implicit_domain(); 
        // returns ref to "catch-all" atomic domain 
        // created for type T at upcxx::init()
    }
    
    template <atomic::operation OP, typename T, 
                 typename Completions=decltype(operation_cx::as_future())>
    RType atomic_op(global_ptr <T> p, T val1, [T val2,] // val2 for cas
                           std::memory_order order,
                           Completions cxs=Completions{},
                           const atomic::domain<T> &domain = 
                                   atomic::implicit_domain<T>());
      // possibly also provide a convenience overload with the last two
      // defaulted arguments in opposite order
    
    // Examples:
    using namespace upcxx;
    future<int64_t> f1 = atomic_op<atomic::get>(gptr, std::memory_order_relaxed);
    future<> f2 = atomic_op<atomic::set>(gptr, val, std::memory_order_relaxed);
    future<> f3 = atomic_op<atomic::add>(gptr, arg, std::memory_order_relaxed);
    future<int64_t> f4 = atomic_op<atomic::fetch_inc>(gptr, std::memory_order_relaxed);
    

    The "beginner" user code above can basically ignore the existence of Atomic Domains, and might be the only supported way to use domains in the March release. The initial implementation of atomic::implicit_domain<T>() would return the atomic domain created collectively at startup for that type (one for each of T=(u)int{32,64}) that includes every possible atomic operation. Assuming we limit ourselves to the ops above, this restricted set of types+ops is chosen to ensure offload on both InfiniBand and aries for supported types with even this straightforward implementation (other networks vary).

    A more aggressive implementation of atomic::implicit_domain<T>() could use template meta-programming "tricks" to restrict each of those four init-created domains to include only the set of operations actually "uttered" by atomic_op<OP,T>() calls linked into the program. This allows us to expand the universe of operations without penalizing the application for a lack of offload support for operations it never statically calls (however it would still include operations that are never dynamically reached, and conflates all application data of type T into a single atomicity domain and NIC-level algorithm).

    Eventually (possibly in the next release) we add the mechanism for user-constructed explicit Atomic Domains, eg:

    namespace atomic {
      template <typename T>
      class domain {
        // constructor:
        domain(std::vector<atomic::operation> ops, upcxx::team &t = world(), 
                    int flags=0 /* possible optional tuning knobs */) {}
        };
       team &team();  // convenience accessor
    }
    
    // create several domains for various different purposes
    
    atomic::domain<int64_t> cas_dom({atomic::get, atomic::set, atomic::cas }); 
        // this set of ops IS offloadable on aries
    
    atomic::domain<int64_t> mult_dom({atomic::get, atomic::set, atomic::cas, 
                                                         atomic::fetch_mult }); 
        // this set of ops is NOT offloadable on aries
    
    atomic::domain<int64_t> local_dom({atomic::get, atomic::set, atomic::cas }, 
                                                             local_team()); 
       // this domain uses only shared memory transport
    
    int64_t r1 = atomic_op<atomic::cas>(gptr1, oldval, newval, 
                           std::memory_order_relaxed, operation_cx::as_future(), 
                           cas_dom).wait();  // offloaded to aries
    
    int64_t r2 = atomic_op<atomic::cas>(gptr2, oldval, newval, 
                           std::memory_order_relaxed, operation_cx::as_future(), 
                           mult_dom).wait(); // not offloaded (AM-based)
    int64_t r3 = atomic_op<atomic::fetch_mult>(gptr2, val, 
                           std::memory_order_relaxed, operation_cx::as_future(), 
                           mult_dom).wait(); // not offloaded (AM-based)
    
    int64_t r4 = atomic_op<atomic::cas>(gptr3, oldval, newval, 
                           std::memory_order_relaxed, operation_cx::as_future(), 
                           local_dom).wait();  // uses CPU atomic instruction
    

    This allows atomic ops implemented with different (non-coherent) protocols over the same type (int64_t in this case) to be used in the same program (operating on distinct data). Note none of the other approaches discussed in earlier comments can provide aries offload for r1 above, because safely doing so in general requires the programmer's omniscient/global knowledge that the accessed data regions are disjoint (eg that gptr1 and gptr2 are not aliased on any rank executing this code). Similarly, the r4 access can safely bypass the aries adapter and use a (faster) CPU atomic instruction on shared memory, because the domain created over local_team() requires/enforces that gptr3.is_local(), and that any concurrent accesses to that data are also from members of local_team() using the same shared-memory protocol.

    Note on declaration details: If we wanted even stronger static checking we could optionally cram the operations into variadic template arguments of atomic::domain and check it matches at operation time (although the supported operations are an unordered set and we don't want to make it clumsy for users to declare them). Regardless, GASNet debug mode already checks the operation passed at runtime matches the provided domain.

    CC: @PHHargrove @shofmeyr

Comments (36)

  1. BrianS

    This still leaves the problem of query. I think the querying happens before this point in the code, through a team-based query function.

    class team
    {
        public:
        // construction patterns
    .
    .
    .
        uint32_t hardware_enabled_mask(); // just riffing here. for this team, what atomics have hardware support?
        bool compatible_atomics(uint32_t mask); // can I make a domain with this combination of atomics over this team?
    

    We would include in the list of atomics the case where there is a fast local atomic::read

  2. BrianS

    Would we want to catch the case where a user issues a get on a global_ptr that is participating in atomic operations? have we settled on global_ptr being the holder for atomic operations?

  3. Former user Account Deleted reporter

    @bvstraalen, correct I have not addresses how upcxx will let users query the perf of atomics. I'm hopeful for this:

    enum upcxx::atomic_opset {
      atomic_op_fetch_add = 0x1,
      atomic_op_fetch_xor = 0x2,
      atomic_op_exchange = 0x4,
      atomic_op_compare_exchange = 0x8,
      atomic_op_get = 0x10,
      atomic_op_put = 0x20,
      /*etc*/
    };
    
    template<class T> // type on which atomics operate
    struct upcxx::atomic_domain {
      atomic_domain(team&, atomic_opset); // fast construction
      bool attentiveness_required() const; // fast query
    };
    

    Doing a vanilla get(global_ptr) while atomics are concurrently happening would probably be ok assuming the domain was told to support that at construction. We'll see what @bonachea and @PHHargrove have to say.

    global_ptr should denote the address of the atomic. Like so:

    template<class T>
    T upcxx::atomic_fetch_add(atomic_domain<T> dom, global_ptr<T> addr, T x);
    
    // or this, it saves the user typing "upcxx::" and "atomic_"
    T atomic_domain<T>::fetch_add(global_ptr<T> addr, T x);
    
  4. Dan Bonachea

    A few comments:

    First, the reference implementation of UPC 1.3 atomics (written fully in UPC using UPC language-level locks) is here. It's a low-performance but complete/compliant implementation I wrote that demonstrates some of the ways mutexes could be used to implement the interface, in the absence of AM (we'd never do it this way inside GASNet, but some of the concepts are the same).

    The UPC spec defines domain creation as collective for several reasons:

    1. Any given application is only expected to need a small handful of domains (usually just one) so creating it with a one-liner at startup does not seem overly burdensome.
    2. It ensures that all participating threads have a globally-unique NAME for the domain in the form of a pointer, which can be referenced during AMOs in the cheapest possible way (ie a phaseless pointer-to-shared dereference into the symmettric heap). Even if creation involves collective communication in some implementation, this was seen as the "right" tradeoff because domain creation should be exceedingly rare (and hence the performance is irrelevant) whereas AMO accesses may be critical-path and hence need to be highly optimized (because in the extreme case the actual AMO part might be a single atomic CPU instruction).
    3. Domain object constructors provide a convenient place to do any resource allocation that might be required in a particular implementation. In the extreme this might even involve opening additional hardware resources on a NIC.
    4. Domain object constructors provide a convenient place for the implementation to perform any other expensive logic or setup required - eg querying the adapter for optype support, or walking a decision tree to select the right implementation based on the domain creation arguments. We want to ensure the critical-path access contains the absolute bare minimum code and none of this setup crap that can be factored out.
    5. Domain object constructors give a place for app users to hang hints to the implementation, that can be processed at initialization and not during critical-path access.
    6. Domain names give a way for the application to ask about the expected performance characteristics of the current implementation under the provided set of constraints.

    Now when I said the collective creation was not fundamental, I meant that if we devise an alternate mechanism to address ALL of the above properties then we could consider it.

    In particular, I'm strongly against John's ideas of "non-named"/"non-object" domains, because I believe they force additional non-trivial overheads into the critical-path access operation, which I consider unacceptable.

    Great, then the only information actually needed by gasnet after the domain is constructed is a four-valued enum. Producing id's for domains would be unnecessary, just pass the internal enum back to the user.
    

    We may be able to do something along these lines, although we should discuss details further. In particular, I definitely won't guarantee that mapping the domain arguments to the result will be fast enough to shove into the critical path. We really want the application to factor domain construction out into setup code.

    I want to know that on-the-fly construction will be negligible in the face of the fastest thing: CPU-native atomics
    

    I think what you are asking for may be fundamentally unobtainable in some cases, and definitely overconstrains the implementations without sufficiently good motivation.

  5. BrianS

    It would seem there are best practices where an expensive operation like domain creation, and as fast as possible operations, like altering or reading the atomic state. So a UPC++ design should reflect that. Is that manifest in the UPC design? domains are alloc'd

    rupc_all_atomicdomain_alloc(upc_type_t type, upc_op_t ops, upc_atomichint_t hints);
    

    rupc_all indicates collective, which should signal the user that this is an expensive operation and should be amortized across many invocations of rupc_atomic_strict and rupc_atomic_relaxed.

    There were use cases where strict and relaxed are not a property of the domain?

  6. Former user Account Deleted reporter

    @bvstraalen, you are right in that Dan's design by splitting domain construction from issuing the atomic ops he is giving implementations a natural place to put expensive logic, if any is necessary. I was hoping to hear that no expensive logic would be necessary for modern conduits. I'm pretty sure Cray GNI implementation wouldn't require this splitting, but my experience with RMA hardware stops there.

    I don't know what you mean by this: There were use cases where strict and relaxed are not a property of the domain?

    The thing I don't like about the split construction API is that it forces the user to thread the domain object through their callstack as an argument or stash it somewhere (like a global) for later use. Also, it's one more object lifetime the user has to setup and destroy. If I am forced to do these things when my gut tells me its unnecessary for the hardware I'm on, my eye twitches a little. This is only a matter of my personal tastes, but it happens to contradict Dan's first point: I am saying this is too burdensome in that it becomes a needless code blemish on the hardware I care about it.

    For the upcxx spec, I think the right approach for now is to present a very limited atomic semantics such that we don't need to expose domains at all. I'm thinking we just enumerate the atomic ops people are likely to want, and then say that they cannot race to the same memory location concurrently, eg fetch_add's only work with other fetch_add's (possibly implicitly allow for concurrent reads and writes). Then at upcxx init time, we just create one gasnet domain per operation internally and dispatch to it under the covers. This API could easily be extended at a later time to support mixing of atomic ops by exposing domains through the upcxx API.

  7. BrianS

    UPC has two atomic functions

    void rupc_atomic_strict(upc_atomicdomain_t *domain,
        void * restrict fetch_ptr, upc_op_t op,
        shared void * restrict target,
        const void * restrict operand1,
        const void * restrict operand2);
    #define upc_atomic_strict rupc_atomic_strict
    
    void rupc_atomic_relaxed(upc_atomicdomain_t *domain,
        void * restrict fetch_ptr, upc_op_t op,
        shared void * restrict target,
        const void * restrict operand1,
        const void * restrict operand2);
    #define upc_atomic_relaxed rupc_atomic_relaxed
    

    so strict and relaxed are distinguished at the atomic operation site, not the domain building call.

    I think the main differences here are domain building is collective and named in the UPC world. They are assumed to not be fast constructable. Building all possible domains in init seems non-scalable to me (domains are combinatoric) and I don't want them being invisible to the user. We also have to allow mixed op types (we need atomic read, atomic write, atomic fetch_add on the same location in most use cases).

    I don't know what you mean by nameable.

    What's wrong with giving users

    upc_atomicdomain_t *rupc_all_atomicdomain_alloc(upc_type_t type, upc_op_t ops, upc_atomichint_t hints);
    

    In a UPC++ code?

    If an atomic domain has been created once with the same parameters, then the cached domain is handed back to the user.... or does that lead to a reference-counting/garbage collection problem? We can make team a factory for atomic domains

    class atomic_domain_base;
    
    template <class T>
    atomic_domain : public atomic_domain_base;
    
    class team
    {
      public:
        template<class T>
        std::shared_ptr<atomic_domain<T> > atomic_domain(atomic_opset_t set);
      private:
        // Key is made from typeid(T), atomic_opset_t  
       std::unordered_map<std::pair<std::size_t, atomic_opset_t>, std::shared_ptr<atomic_domain_base> > m_atomic_domains;
    };
    

    no quite the right implementation, but it lets us cache and retrieve atomic_domains. If the team goes out of scope and is destructed then we can verify the shared_ptrs are all use_count==1, but in a multithreaded model these lifetime issues are all tricky. teams are likely to have long lifetimes, perhaps long enough to amortize usage but still be automatic variables on the stack, or a live as long as a module using UPC++ is in scope...

    I'm trying to find a lifetime design that is not global but also not on-the-fly. If the user keeps a team around a long time then in their code they can pull the suitable atomic domain from the team in O(1) time with no communication after the first time. They can also call these build functions once outside their main loops and just discard the return value, thus creating a place users can put timing code to measure set up costs.

    If teams are scoped for the user, then just grabbing the current team is handled.

    As long as a remote atomic operation takes a little while, the local look up should not be bad.

  8. Former user Account Deleted reporter

    I see no technical issue with collective domain construction. It is a good and prudent design. I just like the on-the-fly API better because it seems less awkward to me, not necessarily less error prone.

    Nameable means that if there is some state associated with the domain on each rank, then when shipping an AM to do the atomic I need some kind of agreed upon name that will allow me to find where that state lives on the remote rank. Making the construction collective affords on easy recipe for the gasnet implementation to generate names internally, since they're the ones doing this state management.

    There's nothing wrong with exposing rupc_all_atomicdomain_alloc in upcxx, except according to my subjective tastes I would prefer an API that did not need that at all.

    We could pursue caching domains like in your example, but that only works if the domain constructor is non-collective. Collectivity breaks lazy construction. That's why the first thing on my list is a request that it won't be collective.

    My last suggestion was internally building one domain per operation (or maybe two or four for the cases of allowing combinations of concurrent reads/writes). So definitely not the full combinatoric space. Allowing fetch_xor to work alongside fetch_add is not something ECP (or anyone?) is clamoring for. We are hard pressed to find customers for anything other than fetch_add actually. I think limiting our ops to only playing well with themselves is just fine for now. And like I said, being too restrictive is not the problem. We can always add domains later. Maybe we can even get some spooky agency to pay for it as an enhancement.

  9. BrianS

    "team as factory" might give us enough on-the-fly ness to have cleaner code, while still allowing a definable place where domains live and are retrievable without communicating when it is not required.

    perhaps std::pair<size_t, atomic_opset_t> > is the agreed upon name. size_t is the hash of T.

    I think I see what is driving this: The desire to invoke remote atomics from inside an rpc. That is a tough feature to ask for. We would also need to encode the team, since some atomic domains might only be over a tight cluster of ranks that have a fast atomic.

  10. Former user Account Deleted reporter

    Exactly! Issuing atomics from within rpc's becomes a little tricky since you'll need a way for the rpc to locate the local rank's domain instance. The easiest thing to do is shove the domain in a file-scoped variable after creation so that its easy to grab from anywhere in the code on any rank. If that seems too hackish, then this is exactly the problem we addressed in the calling of member functions across ranks. There needs to be universal name for a distributed object and some registry for translating that to local "this" pointers. So if we want to expose domains that can't be done on-the-fly (per gasnet's choice in constructor collectivity) then we would be best building it on a dist_object for uniformity. If domains could be built on the fly, there would be no more need for this name to instance translation.

  11. BrianS

    I would have liked to use file scoped atomic domains, but that will preclude team-scoped atomic domains. (I don't know how to bring the correct team to a file scoped atomic domain).

    the tricky part of dist_object is that now shows up in the critical path of an atomic operation. If we can't declare the dist_object "closed" at some point in the computation (like after an init phase) then it needs to be mutex protected, and now we have a mutex in the atomic operation critical path.

    I don't think we can do it. Even having a this object look up in a dist_object might be a critical path problem. We will almost certainly be judged by our ability to exploit hardware remote atomics. Anything in the critical path is a risk for our performance . If the remote atomic is being service by an active message we might do OK with a lookup.

    I think to have optimal remote atomic the atomic domain needs state. I think looking up state on a remote rank will be slow and hence I would recommend we do not encourage this programming style. I will like to find ways to use native atomic operations in rpc calls.

    The rest of the design is likely fine. atomic domain lookup can be done, but I don't think we will implement this capability for the coming year, nor spec it (except perhaps as an Approved Extension). I think team scoped atomic domains is the right granularity. Not static initialization scoped, nor file-scoped (the only team that might be legal at file-scope is the global team).

  12. Dan Bonachea

    A few random thoughts:

    I think it makes sense to attach atomicdomains as sub-objects to a team object. We definitely want team-specific atomicdomains, and any communication (including remote atomics) will need to state the team it's operating on (either explicitly or implicitly, depending on library design). Team construction is inarguably a collective operation (over at least the members of the new team), so that's a natural place to also collectively create whatever atomicdomains the team will need, and establish a replicated table of atomicdomains embedded in the local team datastructure of every team member. After setup time, no communication should ever be required to "find the right domain" for an AMO access, and ideally this is always a fast O(1) lookup. There is no such thing as a "remote" domain, in the same way there is no such thing as a "remote" MPI_Communicator - it is a distributed object that is always accessed via the representative local to the code initiating the access.

    The initial release might restrict teams to contain at most one atomicdomain per datatype, in which case AMO access need only name the team and datatype (which needs to happen anyhow, either explicitly or implicitly). Later releases could relax this restriction and allow a team to contain several atomicdomains for a single type (with different opsets), and then the AMO access needs to name the team, datatype, and domain index (which can be a small integer agreed upon during team/domain creation).

    With this design you don't need any file-scope or namespace-scope globals for atomic domains. However you do still need a way to handle the special "primordial" team that is not explicitly created by the program. This could be handled by a special version of the collective call that does atomicdomain setup for the primordial team, or alternatively atomicdomain creation can be a collective call that attaches the created domain to a named team.

    Of course none of this solves the problem of how to best implement the team datastructure, which needs to be a distributed object that can be efficiently named by any member (even inside an RPC), but that's an orthogonal problem that needs to be solved for other reasons.

    Finally an important clarification to something John mentioned:

    I'm thinking we just enumerate the atomic ops people are likely to want, and then say 
    that they cannot race to the same memory location concurrently, eg fetch_add's only 
    work with other fetch_add's (possibly implicitly allow for concurrent reads and writes). 
    

    The parenthetical remark is a much bigger problem than you seem to think. In particular, many hardware-supported network atomics will not be coherent wrt CPU stores, and some might also be non-atomic wrt CPU loads. The solution we are planning here is that any "concurrent reads and writes" must also go through the GASNet AMO interface with the same atomicdomain. On Aries in particular the writes will need to be funneled through the NIC to ensure coherency. Any "bare" load/stores on locations concurrently being modified by fetch_add will have undefined results. This means the end-user API needs a way to "spell" read and write wrt an atomicdomain. UPC uses the atomicops UPC_GET and UPC_SET for this purpose.

    Brian wrote:

    UPC has two atomic functions so strict and relaxed are distinguished at the atomic operation site, not the domain building call. 
    

    That's correct - I should clarify that AMO's always enforce coherence of the accessed memory, but there's a separate orthogonal property that we didn't get time to cover in our last call regarding their fencing behavior wrt surrounding operations - this is especially relevant since AMOs are often to used to implement cross-thread synchronization. strict and relaxed in UPC refer to the memory consistency behavior of the AMO access with respect to surrounding (non-conflicting) operations, in particular whether the compiler/runtime can reorder surrounding accesses across the AMO. This is something the UPC++ API will also need to deal with, although the version 1.0 answer may be to always enforce the strictest possible fencing at AMO accesses.

  13. Former user Account Deleted reporter

    Dan, your interpretation of the parenthetical remark is exactly what I was thinking. Though I was more hopeful about Aries allowing bare load/store, since I would assume an AMO is the type of thing that fits in a single flit.

    The relaxed/strict issue does not apply to us since we won't be pushing anything stronger than the gasnet memory model. Blocking calls are sequentially consistent, async calls force you to manage the dependencies explicitly.

  14. Dan Bonachea

    This issue was discussed in the 1/10/18 Pagoda meeting, and subsequently in the 1/11/18 GEX atomics meeting. We resolved Steve would look into spec and implementation for March, with extensions that may slip to September.

    Roughly the proposed update to the atomics interface looks like this:

    namespace atomic { 
      enum operation { 
            get, set, 
            add, fetch_add, // May shorten to “fadd” ?
            sub, fetch_sub,
            inc, fetch_inc,
            dec, fetch_dec,
            cas             // May lengthen to “cswap”
      };
    
      template <typename T>
      class domain { // MoveConstructible, Destructible 
         typedef T value_type;
      };  
    
      template <typename T>
      const domain<T> &implicit_domain(); 
        // returns ref to "catch-all" atomic domain 
        // created for type T at upcxx::init()
    }
    
    template <atomic::operation OP, typename T, 
                 typename Completions=decltype(operation_cx::as_future())>
    RType atomic_op(global_ptr <T> p, T val1, [T val2,] // val2 for cas
                           std::memory_order order,
                           Completions cxs=Completions{},
                           const atomic::domain<T> &domain = 
                                   atomic::implicit_domain<T>());
      // possibly also provide a convenience overload with the last two
      // defaulted arguments in opposite order
    
    // Examples:
    using namespace upcxx;
    future<int64_t> f1 = atomic_op<atomic::get>(gptr, std::memory_order_relaxed);
    future<> f2 = atomic_op<atomic::set>(gptr, val, std::memory_order_relaxed);
    future<> f3 = atomic_op<atomic::add>(gptr, arg, std::memory_order_relaxed);
    future<int64_t> f4 = atomic_op<atomic::fetch_inc>(gptr, std::memory_order_relaxed);
    

    The "beginner" user code above can basically ignore the existence of Atomic Domains, and might be the only supported way to use domains in the March release. The initial implementation of atomic::implicit_domain<T>() would return the atomic domain created collectively at startup for that type (one for each of T=(u)int{32,64}) that includes every possible atomic operation. Assuming we limit ourselves to the ops above, this restricted set of types+ops is chosen to ensure offload on both InfiniBand and aries for supported types with even this straightforward implementation (other networks vary).

    A more aggressive implementation of atomic::implicit_domain<T>() could use template meta-programming "tricks" to restrict each of those four init-created domains to include only the set of operations actually "uttered" by atomic_op<OP,T>() calls linked into the program. This allows us to expand the universe of operations without penalizing the application for a lack of offload support for operations it never statically calls (however it would still include operations that are never dynamically reached, and conflates all application data of type T into a single atomicity domain and NIC-level algorithm).

    Eventually (possibly in the next release) we add the mechanism for user-constructed explicit Atomic Domains, eg:

    namespace atomic {
      template <typename T>
      class domain {
        // constructor:
        domain(std::vector<atomic::operation> ops, upcxx::team &t = world(), 
                    int flags=0 /* possible optional tuning knobs */) {}
        };
       team &team();  // convenience accessor
    }
    
    // create several domains for various different purposes
    
    atomic::domain<int64_t> cas_dom({atomic::get, atomic::set, atomic::cas }); 
        // this set of ops IS offloadable on aries
    
    atomic::domain<int64_t> mult_dom({atomic::get, atomic::set, atomic::cas, 
                                                         atomic::fetch_mult }); 
        // this set of ops is NOT offloadable on aries
    
    atomic::domain<int64_t> local_dom({atomic::get, atomic::set, atomic::cas }, 
                                                             local_team()); 
       // this domain uses only shared memory transport
    
    int64_t r1 = atomic_op<atomic::cas>(gptr1, oldval, newval, 
                           std::memory_order_relaxed, operation_cx::as_future(), 
                           cas_dom).wait();  // offloaded to aries
    
    int64_t r2 = atomic_op<atomic::cas>(gptr2, oldval, newval, 
                           std::memory_order_relaxed, operation_cx::as_future(), 
                           mult_dom).wait(); // not offloaded (AM-based)
    int64_t r3 = atomic_op<atomic::fetch_mult>(gptr2, val, 
                           std::memory_order_relaxed, operation_cx::as_future(), 
                           mult_dom).wait(); // not offloaded (AM-based)
    
    int64_t r4 = atomic_op<atomic::cas>(gptr3, oldval, newval, 
                           std::memory_order_relaxed, operation_cx::as_future(), 
                           local_dom).wait();  // uses CPU atomic instruction
    

    This allows atomic ops implemented with different (non-coherent) protocols over the same type (int64_t in this case) to be used in the same program (operating on distinct data). Note none of the other approaches discussed in earlier comments can provide aries offload for r1 above, because safely doing so in general requires the programmer's omniscient/global knowledge that the accessed data regions are disjoint (eg that gptr1 and gptr2 are not aliased on any rank executing this code). Similarly, the r4 access can safely bypass the aries adapter and use a (faster) CPU atomic instruction on shared memory, because the domain created over local_team() requires/enforces that gptr3.is_local(), and that any concurrent accesses to that data are also from members of local_team() using the same shared-memory protocol.

    Note on declaration details: If we wanted even stronger static checking we could optionally cram the operations into variadic template arguments of atomic::domain and check it matches at operation time (although the supported operations are an unordered set and we don't want to make it clumsy for users to declare them). Regardless, GASNet debug mode already checks the operation passed at runtime matches the provided domain.

    CC: @PHHargrove @shofmeyr

  15. Paul Hargrove

    @bonachea Thanks for putting all of that together from our call notes.
    I believe it accurately reflects what we discussed and I endorse this interface proposal.

    @shofmeyr You should note that after you departed the call Dan and I fleshed out a lot of additional details.

  16. Dan Bonachea

    I'm attaching a proof-of-concept prototype I tossed together to help validate my design proposal. It's incomplete but I think demonstrates how the main moving parts fit together.

    There may be more elegant ways to accomplish some of this, and in particular note it will need tweaks to work within generalized completion.

  17. john bachan

    Is there interest in reducing the initiation syntax from atomic_op<fetch_add>(...) to fetch_add(...)? This would involve hoisting the enums to constexpr global objects (each having operator()) and all being of a common user facing base type so they can be collected into containers for easy domain construction.

  18. Dan Bonachea

    Is there interest in reducing the initiation syntax from atomic_op<fetch_add>(...) to fetch_add(...)? This would involve hoisting the enums to constexpr global objects (each having operator()) and all being of a common user facing base type so they can be collected into containers for easy domain construction.

    I like John's idea, assuming all the template goop can be made to work.

    It gives us nicer initiation syntax, but the common base class still makes it easy to express APIs that operate on sets of operations or are parametric over operation.

    I assume you mean the ops would still be an atomic namespace (ie upcxx::atomic::fetch_add(...)), since the token "atomic" is pretty key to what the function is doing and upcxx::add() seems too ambiguous (eg we may eventually want a upcxx::reduce::add for collectives, or named Binary operators for some other purpose).

  19. Steven Hofmeyr

    This seems like going backwards to me - why not have the atomic::fetch_add syntax to start with?

    I'm still against the atomic_op<fetch_add> form - I think it's unwieldy and ugly. Users that want an easy way to switch operations could use their own templates, if they wanted to shun macros.

    I know Dan wants the atomic_op approach; what do others think?

  20. Dan Bonachea

    @shofmeyr - I'm not arguing for a particular initiation syntax, I'm arguing that there should be a base class/enum/type representing all atomic operations, and that same type should be used in both initiation and domain construction. John's proposal accomplishes that and still provides syntax that "looks" like the current free functions (for the benefit of novice users), but underneath the covers it still has a base class that support operation-generic calling.

  21. Steven Hofmeyr

    The API in the current pull request is as follows:

    namespace upcxx {
      // supported atomic operations                                                                                                                                                                  
      enum class atomic_op : int { load, store,
                                   add, fetch_add,
                                   sub, fetch_sub,
                                   inc, fetch_inc,
                                   dec, fetch_dec,
                                   compare_exchange };
    
      // atomic domain - T must be an integral type of signed or unsigned 32- or 64-bits                                                                                                              
      template<typename T>
      class atomic_domain {
        // the constructor takes a vector of atomic operations; currently, flags is unsupported                                                                                                       
        atomic_domain(std::vector<atomic_op> const &ops, int flags = 0);
        ~atomic_domain();
        // the atomic operations
        // all operate on an integral type at a global pointer location 
        // all return futures by default
        // all have default memory_order_relaxed                                                                                                
        template<typename Cxs = completions<future_cx<operation_cx_event> > >
        RType store(global_ptr<T> gptr, T val, std::memory_order order = std::memory_order_relaxed, Cxs cxs = Cxs{{}});
        template<typename Cxs = completions<future_cx<operation_cx_event> > >
        RType load(global_ptr<T> gptr, std::memory_order order = std::memory_order_relaxed, Cxs cxs = Cxs{{}});
        template<typename Cxs = completions<future_cx<operation_cx_event> > >
        RType inc(global_ptr<T> gptr, std::memory_order order = std::memory_order_relaxed, Cxs cxs = Cxs{{}});
        template<typename Cxs = completions<future_cx<operation_cx_event> > >
        RType dec(global_ptr<T> gptr, std::memory_order order = std::memory_order_relaxed, Cxs cxs = Cxs{{}});
        template<typename Cxs = completions<future_cx<operation_cx_event> > >
        RType fetch_inc(global_ptr<T> gptr, std::memory_order order = std::memory_order_relaxed, Cxs cxs = Cxs{{}});
        template<typename Cxs = completions<future_cx<operation_cx_event> > >
        RType fetch_dec(global_ptr<T> gptr, std::memory_order order = std::memory_order_relaxed, Cxs cxs = Cxs{{}});
        template<typename Cxs = completions<future_cx<operation_cx_event> > >
        RType add(global_ptr<T> gptr, T val1, std::memory_order order = std::memory_order_relaxed, Cxs cxs = Cxs{{}});
        template<typename Cxs = completions<future_cx<operation_cx_event> > >
        RType sub(global_ptr<T> gptr, T val1, std::memory_order order = std::memory_order_relaxed, Cxs cxs = Cxs{{}});
        template<typename Cxs = completions<future_cx<operation_cx_event> > >
        RType fetch_add(global_ptr<T> gptr, T val, std::memory_order order = std::memory_order_relaxed, Cxs cxs = Cxs{{}});
        template<typename Cxs = completions<future_cx<operation_cx_event> > >
        RType fetch_sub(global_ptr<T> gptr, T val, std::memory_order order = std::memory_order_relaxed, Cxs cxs = Cxs{{}});
        template<typename Cxs = completions<future_cx<operation_cx_event> > >
        RType compare_exchange(global_ptr<T> gptr, T val1, T val2, std::memory_order order = std::memory_order_relaxed, Cxs cxs = Cxs{{}});
      };
    }
    

    Using the atomic domain:

    upcxx::atomic_domain<unsigned int> ad_ui({upcxx::atomic_op::store, upcxx::atomic_op::load, upcxx::atomic_op::fetch_add});
    global_ptr<unsigned int> gp_ui = upcxx::allocate<unsigned int>();
    ad_ui.store(gp_ui, (unsigned)1);
    auto x = ad_ui.load(gp_up).wait();
    auto y = ad_ui.fetch_add(gp_up, 5).wait();
    
  22. Dan Bonachea

    @shofmeyr - thanks for the update!

    Some questions:

    1. Is there any reason the constructor copies the vector by value? Should this perhaps instead be: atomic_domain(std::vector<atomic_op> const & ops,...
    2. What C++ Concepts will atomic_domain implement? (MoveConstructible, Destructible, Assignable, etc)
    3. Are the permitted std::memory_order values restricted based on the operation (as in spec Draft 5) or always all supported values (ie even when they are not meaningful)?
  23. BrianS

    I'm guessing we are not going to make generalized completion for atomic_domain operations. I'm ok if that's correct, but is there anything incompatible between atomic operations and generalized completion?

  24. Steven Hofmeyr

    @bonachea:

    1. nope - changed it now.
    2. Currently, the only one is Destructible.
    3. The upc++ implementation doesn't do this. It appears that neither does gasnet...
  25. Dan Bonachea

    @shofmeyr :

    Currently, the only one is Destructible.

    I'm worried that lacking DefaultConstructible and MoveConstructible might make it more awkward or slower for the expected use case where applications declare their one or few upcxx::atomic_domain objects as global variables initialized early at startup and used throughout. We don't want to force users to thread these through all their API call signatures, since a global works just fine for a monolithic application. But we also don't want to incur an extra pointer indirection when doing shared memory atomics because we forced their global to have type atomic_domain *.

  26. Dan Bonachea

    Are the permitted std::memory_order values restricted based on the operation (as in spec Draft 5) or always all supported values (ie even when they are not meaningful)? The upc++ implementation doesn't do this. It appears that neither does gasnet...

    The GEX spec/impl allows memory fencing on atomics that the C++11 std::atomics consider non-meaningful. UPC++ needs to take a stance on which to emulate, and my inclination is it should continue to follow the C++ example.

    The GEX behavior is motivated by the fact that GEX does not impose a memory consistency model on clients, so we don't prohibit a runtime from doing things like gex_AD_Op*(GEX_OP_GET,GEX_FLAG_AD_REL) and gex_AD_Op*(GEX_OP_SET,GEX_FLAG_AD_ACQ) to fold memory fences into atomics where they normally usually wouldn't be used, under the assumption the client "knows what its doing" wrt its memory model.

  27. Log in to comment