Library: Atomic Memory Operations (AMO)

Former user Account Deleted

``` I'd like to see a bit of discussion on the pros and cons of the three approaches before we decide.

My perspective is that there are two issues of some degree of importance:

1) How full a collection of types and operations we want. There are quite a few entry points and our experience is that only a few are used. But if implementers and documenters are not concerned, I don't think users will be.

2) How we "spell" them. Again, I don't think users will be concerned as they will likely use a macro in their code no matter what we do :)

```

Reported by `wwc@uuuuc.us` on 2012-03-16 11:23:33

2012-03-16T11:23:33+00:00

Former user Account Deleted

``` Is it acceptable if we came up with a subset of the BUPC functions?

(1) I dislike the "local" versions of the atomic functions (they feel like oversweet syntax sugar to me)

(2) I don't really "get" the mswap operator, and I can't identify any particular unique use of it. ```

Reported by `ga10502` on 2012-04-24 21:07:49

2012-04-24T21:07:49+00:00

Former user Account Deleted

``` Gheorghe wrote: (1) I dislike the "local" versions of the atomic functions (they feel like oversweet syntax sugar to me) (2) I don't really "get" the mswap operator, and I can't identify any particular unique use of it.

In response to (1): How does one perform atomic operations on private pointers if one removes the "local" functions? Manual "privatization" of pointer-to-shared is a common optimization, and the upc_cast() under consideration for the spec will make it MORE common. Unless one also provides a way to convert private->shared [Ick!] then the "local" variants of the atomic operations will be needed to avoid potentially forcing the user to keep track of both private and shared pointers to the same datum.

In response to (2): The "mswap" (masked swap) is, I believe, intended to aide in implementation of atomic updates to "flag bits" (think bit-fields w/o the help of the syntax). It is among the SHMEM atomics, and thus made its way on to the "short list" when collecting information from Lauren about her community of programmers. ```

Reported by `phhargrove@lbl.gov` on 2012-04-24 21:22:19

2012-04-24T21:22:19+00:00

Former user Account Deleted

``` Hi Paul, in response to your argument about private pointers - I could argue (not excessively facetiously) that it is none of my damn business what you do with private pointers - UPC is about shared data and pointers. ... In fact I think Yili used as one of his arguments to shut me down when I argued that collectives should take private pointers to data.

I agree with you that converting private pointers to "fat" pointers may not be such a hot idea. It so happens that in xlupc we could do it without breaking a sweat, but I cannot say whether that would be a path to suicide on other systems.

Damn me for seeing both sides of the issue. But in return I would have you acknowledge the essential awkwardness of having "local" versions of every UPC function.

```

Reported by `ga10502` on 2012-04-25 14:41:14

2012-04-25T14:41:14+00:00

Former user Account Deleted

``` I *do* acknowledge that "local" versions of every function would be a mess, but I doubt that polymorphism as an alternative will get many supporters. So, it comes down (in my mind) to what do you LOSE if no local variant is included. One can argue against local versions of collectives by claiming one can always make a copy. The act of copying some atomic datum sort of destroys its purpose. So, I think an argument could be made for why this might be a special case. However, I won't get too hung up on this, as BUPC will continue to support local atomics as an extension if they are not included in the spec.

So, are there any other opinions on the inclusion/exclusion of atomic operation on pointer-to-private? ```

Reported by `phhargrove@lbl.gov` on 2012-04-25 20:09:01

2012-04-25T20:09:01+00:00

Former user Account Deleted

``` Going back to Bill's two issues: 1) How full a collection of types and operations we want. 2) How we "spell" them.

To (1): I think the minimal set of types that my users are interested in is T={int64_t, uint64_t}. For operations, the primary interest is in fetch-and-OP and OP (no fetch), where OP={ADD, AND, OR, XOR}. While there is interest in compare-and-swap, I think this is a good bit further down their list of priorities. I am pretty sure that we only care about AMOs as relaxed shared accesses.

The interest in the non-fetching atomic OP is that it would be a non-blocking call for which completion is only guaranteed by the next fence. The goal would be that one could issue a large set of atomic OPs for high throughput.

This is definitely a reduced subset from what BUPC and Cray offer. Maybe this answers my position on George's mswap question. I understand that some implementers may want to expand the type and operation sets, but I think this is the minimal set that I care about.

To (2): While everyone will probably have their own desired flavor of spelling, I would probably go for something relatively short, like: TYPE upc_amo_fopT(OP, shared TYPE* p, TYPE v); void upc_amo_opT(OP, shared TYPE* p, TYPE v); With this, we could use the existing upc_op_t definitions for OP (only accepting a subset of them, naturally). This would bring up the "where should we put the upc_op_t enum?" issue, as it is currently part of the collectives library and sharing this with an AMO library (which would make sense) would mean they'd need some common header for these types. This is just another version of the upc_flag_t discussion in Issue #10. ```

Reported by `nspark.work` on 2012-04-28 21:46:29

2012-04-28T21:46:29+00:00

Former user Account Deleted

``` 1) I support Nick's request for having atomic OPs without fetch because they can have better performance when fetch is not needed. And I see at least one app (Graph500) can benefit from atomic OPs without fetch. I would like to propose to extend OP to include MAX and MIN, i.e., OP={ADD, AND, OR, XOR, MAX, MIN}. FYI, MPI_Accumulate is something similar.

2) Does UPC guarantee atomicity for basic ops with built-in types? For example, assuming int64_t == long long in C99, shared int64_t *p; int64_t a; Is there difference between: i) (*p) += a; ii) upc_amo_op_int64(ADD, p, a);

3) For the discussion of AMOs with private/local pointers, if we want to include them in UPC spec, we should probably consider their compatibility and/or potential redundancy with C11 atomics.

```

Reported by `yzheng@lbl.gov` on 2012-04-29 00:56:33

2012-04-29T00:56:33+00:00

Former user Account Deleted

``` I don't think my users really care about local atomics. It might make more sense to address shared atomics now and save local atomics for the bigger discussion of whether UPC moves to C11. ```

Reported by `nspark.work` on 2012-04-30 20:49:37

2012-04-30T20:49:37+00:00

Former user Account Deleted

``` With regard to "Does UPC guarantee atomicity for basic ops with built-in types?", the answer is unequivocally no. As far as the memory model is concerned, (*p) += a; becomes (*p) = (*p) + a; which becomes (in pseudo code, READ is either a relaxed or strict read of a shared object, WRITE is either a strict or relaxed write of a shared object):

READ( *p ) => t1 READ( a ) => t2 t1 + t2 => t3 t3 => WRITE( *p )

There is nothing to guarantee that some other thread doesn't come in and modify *p or a after the local thread reads it, but before it writes the new result back to *p. UPC statements do not have transaction semantics (though it'd likely be a useful extension if anyone wants to come up with such a proposal!). Assuming all strict accesses, the compiler/runtime must ensure that this race is consistent in that all threads observe the same ordering, but it doesn't need to do anything to prevent the race from occurring. For relaxed accesses, it doesn't even need to do that, though local ordering must still be maintained. ```

Reported by `sdvormwa@cray.com` on 2012-05-11 16:22:23

2012-05-11T16:22:23+00:00

Former user Account Deleted

``` To amplify Yili's point about adds w/o a fetch. Does this make sense as far as semantics?

Level 1: basic atomic operation. Essentially, guaranteeing that e.g. a+=b happens atommically. Examples: atomic increment, atomic set, atomic or, xor, ...

Level 2: fetch + basic atomic operation. There is one for every operation defined in Level 1. The value *before* the operation is returned to the user.

Level 3: compare + fetch + op. The operation supplies two values - a "compare" value and an "update" value - and returns the "old" value. The operation is executed if the "old" value matches the "compare" value. The "old" value is returned in any case. Typical example: compare-and-swap, which is really a compare+fetch+set.

Is there anything you can think of that is not covered by this taxonomy?

```

Reported by `ga10502` on 2012-05-22 13:41:31

2012-05-22T13:41:31+00:00

Former user Account Deleted

``` --- AMO Taxonomy & Hardware Support --- I don't think this taxonomy covers the "masked swap" in the BUPC AMO extensions. I don't know how strongly people feel about this particular AMO, but (and this may be a stupid reason), I would be inclined to leave it out for the sake of having a more concise set of function declarations.

Again, the spelling isn't /that/ important, but I think this would be a relatively terse set: void upc_amo_opT( upc_op_t op, shared TYPE* ptr, TYPE val ); TYPE upc_amo_fopT( upc_op_t op, shared TYPE* ptr, TYPE val ); TYPE upc_amo_cfopT( upc_op_t op, shared TYPE* ptr, TYPE cmp, TYPE val );

From what I can see, there seem to be compare-and-swap extensions, but not the more-general compare+fetch+op. For the implementers, would this general 'Level 3' AMO be an implementation challenge? More specifically, would it not see the same level of hardware support that the others do? Or, would a lack of hardware support for a general 'Level 3' AMO constrain the performance of compare-and-swap (or Level 1 & 2 AMOs) in order to guarantee atomicity?

--- Local AMO Support --- Thinking back to the issue of local AMO support, it seems from existing extensions that local-pointer AMOs are generally not atomic with respect to shared-pointer AMOs. It's a somewhat confusing point, so (maybe I'm beating a dead horse) I'd probably be inclined to leave out the local AMOs to prevent this sort of confusion. I expect the typical use case for AMOs to be on shared memory anyway, but maybe that's an incorrect assumption.

--- Relaxed vs. Shared --- One item not yet discussed here is whether the AMO function definitions should explicitly address whether the accesses are shared or relaxed (as in BUPC) or elide the distinction (as I think is the case in Cray UPC). I think I'd prefer to leave out the relaxed/shared distinction in the AMO function definition and leave the access to be determined by the reference-type qualifier (or the associated pragma). ```

Reported by `nspark.work` on 2012-05-22 14:59:59

2012-05-22T14:59:59+00:00

Former user Account Deleted

``` Yes, oops - I forgot the masked-swap operation. I support Nick's motion to leave it out *unless* someone can think of a "killer app" for this. Please speak up :)

--- Hardware support ---

'Level 3' would obviously not be a challenge for IBM - I would not have suggested it otherwise [insert evil grin here]. But you bring up an important point. All these operations can be emulated given a set of basic primitives - and those primitives are different on every vendors' HW.

Is there a canonical subset of these operations that will have "native" performance on most vendors' HW?

If this canonical subset can be identified, maybe we should highlight this subset in some way in the AMO specification?

---- Local AMO support ---- If we add local AMOs they should be interoperable with shared ones - or else a lot of user confusion will result. So binary decision: either guarantee interoperability or leave them out completely (not UPC's concern).

```

Reported by `ga10502` on 2012-05-23 14:05:00

2012-05-23T14:05:00+00:00

Former user Account Deleted

``` Gheorghe wrote:

Yes, oops - I forgot the masked-swap operation. I support Nick's motion to leave it out *unless* someone can think of a "killer app" for this. Please speak up :)

Tracker issue #35 discusses write to shared bit fields without disrupting adjacent ones. Providing that assurance would require a masked-swap operation exist within the runtime implementation. If that is the case, then the question becomes whether one exposes this capability to the UPC user as a part of the atomics library as well. ```

Reported by `phhargrove@lbl.gov` on 2012-05-23 20:23:30

2012-05-23T20:23:30+00:00

Former user Account Deleted

``` Nick wrote:

I don't think this taxonomy covers the "masked swap" in the BUPC AMO extensions. I don't know how strongly people feel about this particular AMO, but (and this may be a stupid reason), I would be inclined to leave it out for the sake of having a more concise set of function declarations.

Berkeley includes the masked-swap due to input we received from Lauren Smith. We are quite willing to leave it out of the spec and retain it as only a Berkeley extension.

I think I'd prefer to leave out the relaxed/shared distinction in the AMO function definition and leave the access to be determined by the reference-type qualifier (or the associated pragma).

Unless I am missing something important, what Nick requests above is not possible in a library function. Neither the relaxed/strict qualification of the pointer nor the pragma in effect at the call site can be known inside the called function. Now if this were UPC++ we might have a chance via polymorphism assuming relaxed/strict are significant in the type matching.

If support were "deeper" than a library of functions (including some compiler support), then what Nick requests would become possible. That would make atomic operations more along the lines of "compiler intrinsics" than functions. I don't have any strong objection to that, but it may significantly raise the burden on an implementer (the actual burden being very implementation specific already). ```

Reported by `phhargrove@lbl.gov` on 2012-05-23 20:33:12

2012-05-23T20:33:12+00:00

Former user Account Deleted

``` Leaving aside the issue of compiler support, is there any implementation of UPC where the difference between strict and relaxed is *not* a UPC fence?

Could we leave strict AMOs out of the picture and rely on users being able to bracket the AMOs with fences?

```

Reported by `ga10502` on 2012-05-30 11:43:56

2012-05-30T11:43:56+00:00

Former user Account Deleted

``` Gheorghe asked:

Leaving aside the issue of compiler support, is there any implementation of UPC where the difference between strict and relaxed is *not* a UPC fence?

Could we leave strict AMOs out of the picture and rely on users being able to bracket the AMOs with fences?

It is not as simple as that...

In the BUPC implementation of "upc_fence" we need to include both architectural memory fences and a compiler optimization fence. In the AMO's for some architecture the atomic instructions already imply the architectural memory fence (the LOCK prefix on x86/x86-64 being the most important example to those outside of IBM). So, asking a user on such an architecture to use BOTH an AMO and a upc_fence would result in TWO (or more, see below) memory fences.

Additionally, what is the user expected to use: Option 1) upc_fence; relaxed_AMO(); Option 2) relaxed_AMO(); upc_fence(); Option 3) upc_fence(); relaxed_AMO(); upc_fence();

In Option 1 it is possible for shared access after the AMO to move "up" and take place between the AMO and the fence. This is OK for "release" semantics.

Conversely, in Option 2 shared accesses before the AMO might "move down" and take place after the AMO. This is OK for "acquire" semantics.

Only with Option 3 do we get the property that the name "strict AMO" implies to me: all shared access issued before the AMO complete, then the AMO completes before any later references can begin. That is what I believe 5.1.2.3 of the UPC 1.2 spec says for a strict access, and is therefore what I think we should provide for a "strict AMO".

BUPC's strict AMOs are intended to "work like" Option 3, but typically w/o incurring THREE architectural memory fences. ```

Reported by `phhargrove@lbl.gov` on 2012-06-01 04:28:17 - Labels added: Spec-1.3

2012-06-01T04:28:17+00:00

Former user Account Deleted

Reported by `phhargrove@lbl.gov` on 2012-06-01 06:06:09 - Labels added: Milestone-Spec-1.3 - Labels removed: Spec-1.3

2012-06-01T06:06:09+00:00

Former user Account Deleted

``` I see your point. I withdraw my proposal about no strict AMOs. So it boils down to a choice between:

Move AMOs deep inside the compiler just to help figure out whether AMOs are strict or relaxed, based on whether we are in strict or relaxed mode, the variable is denoted as strict or relaxed etc.

Make AMO strictness/relaxedness explicit and double our namespace complexity.

This is similar to the "strict library approach vs. get the language involved" dichotomy that also plagues issues 41 (nonblocking memory copies) and 42 (nonblocking collectives). I sense that we will have to take a unified approach to decide all three of these.

```

Reported by `ga10502` on 2012-06-15 15:26:50

2012-06-15T15:26:50+00:00

Former user Account Deleted

``` As suggested by Nick. I'd like to have part time ownership of this issue as we write it up. I'm willing to take fulltime ownership, but I certainly don't want to muscle anyone else out. -- George ```

Reported by `ga10502` on 2012-06-15 17:18:41

2012-06-15T17:18:41+00:00

Former user Account Deleted

``` In issue #41 I am backing down from my position that changes to upc_fence()'s implementation are unacceptable. So, in this issue, perhaps we should poll the implementations to determine is "getting the compiler involved" is a reasonable possibility before assuming that it is not. While that would mean that the proposed extension is not strictly (pun totally intentional) a pure library, it would avoid the doubled namespace.

So, the question is:

Does your implementation (or could it w/o excessive burden) have sufficient "smarts" to distinguish calls to an AMO in which a dereference of the pointer argument is strict vs relaxed?

For Berkeley UPC, the answer is YES. As a source-to-source translator we generate different calls to our communication library for strict and non-strict accesses. By treating AMOs as compiler intrinsics, rather than as calls to arbitrary C functions, we could leverage the same internal mechanism(s) to implement distinct relaxed/shared versions INTERNALLY, while using only a "generic" name in the user's code.

So, are other implementers able/willing to consider AMOs that have a polymorphic aspect with respect to relaxed-vs-strict? ```

Reported by `phhargrove@lbl.gov` on 2012-06-16 01:14:05

2012-06-16T01:14:05+00:00

Former user Account Deleted

``` I don't really see a doubling of the interface for strict/relaxed to be that much of a problem. Yes, it makes our header file a little bit longer, but the documentation can be written in a generic way (as in the BUPC AMO spec) to cover both cases and avoid page bloat of the spec. This seems preferable to creating a large number of compiler intrinsics (which will then be harder to change as the spec evolves) or trying to explain to the user how this is a library but has magical extra properties. It also allows third party implementations of atomics (eg proof of concept prototypes, open source reference implementations) which would otherwise be prohibited.

There are already examples of this type of interface doubling in the C spec, for a similar reason (lack of argument polymorphism): See the wide character library in C99 7.24 and wchar.h (which basically duplicates stdio.h string.h time.h and ctype.h in their entirety). ```

Reported by `danbonachea` on 2012-06-16 19:19:44

2012-06-16T19:19:44+00:00

Former user Account Deleted

``` I think I'm okay with the interface double from using a suffix for strict or relaxed AMOs. As Dan points out, it doesn't necessarily ruin the documentation. I didn't realize at first how this would affect the compiler or the pure library approach. I'd also like to be part of writing the spec text, along with George (and Yili, I think).

I am curious as to what Cray does with their current global AMOs with regard to strict vs. relaxed accesses in their extensions.

(Updated 'Type' to "Enhancement") ```

Reported by `nspark.work` on 2012-06-18 21:16:10 - Labels added: Type-Enhancement - Labels removed: Type-Defect

2012-06-18T21:16:10+00:00

Former user Account Deleted

``` Our global AMOs are essentially treated as relaxed updates. This is true even when forcing relaxed accesses to be strict via pragma or the inclusion of upc_strict.h (which is probably a bug now that I think about it). ```

Reported by `sdvormwa@cray.com` on 2012-06-18 21:36:54

2012-06-18T21:36:54+00:00

Former user Account Deleted

``` Steven wrote:

Our global AMOs are essentially treated as relaxed updates.

Does this mean that Cray AMOs cannot be used, for instance, to implement a semaphore (because the UP lacks release semantics and the DOWN lacks acquire semantics) without the addition of an additional strict reference (such as a upc_fence())?

I am asking because I want to better understand what users expect to DO with AMOs.

For the case where the value of the atomic variable is of importance by itself (as an accumulator, for instance) the relaxed access is sufficient. However, once you use the atomic variable's value to control when/if one accesses additional locations (spinlock, semaphore, etc.) there needs to be a "strict" somewhere. As I illustrated for George, there is a strong motivation to avoid making the user insert fences for this purpose. Does anybody have users that use atomics in this way? ```

Reported by `phhargrove@lbl.gov` on 2012-06-18 22:06:10

2012-06-18T22:06:10+00:00

Former user Account Deleted

``` I think I'll have to expand a little--they don't really fit in with the current UPC memory model right now.

The global AMOs are "relaxed" in the sense that they do not provide a full fence like strict accesses do. They do provide acquire semantics, so relaxed accesses issued after an AMO will be ordered "correctly". You still technically need a strict write (or a fence followed by a relaxed write) for release semantics. However, many users have noticed that a relaxed write alone works in most cases--and is much faster--and therefore leave the fence out until something breaks.

With regard to what users do with them, I can't really answer that because we typically don't get to see source code from the customers that use them. That said, I'd guess that it's more the former (updating a value) than the later (synchronization) at this point given the bugs we've seen to date. ```

Reported by `sdvormwa@cray.com` on 2012-06-18 23:45:55

2012-06-18T23:45:55+00:00

Former user Account Deleted

``` I'm coming a bit late to this discussion, but I really like that we're exploring passing an atomic op enum to a few functions instead of having one function per operation. Cray has been stuck with supporting a variety of _amo_* functions because that was how it was originally implemented, but internally we use an enum passed to just a few functions, very much like Comment #11. The legacy support has caused numerous headaches when adding support for new AMO operations just due to entry point explosion.

Also, for historical reasons, the Cray AMO extensions work on either local or shared data. Aside from these extensions, our users have the option of using the same builtin syntax that GCC provides for local AMOs in C; however, our GCC-style builtins and the Cray AMO extensions are not atomic with respect to each other due to the way the hardware works. Therefore, I can fully sympathize with not wanting to provide local AMOs in UPC because if we did so, it would be natural for users to expect the local UPC AMOs to be atomic with respect to the global UPC AMOs...and some systems may not be able to support that. ```

Reported by `johnson.troy.a` on 2012-06-19 14:57:16

2012-06-19T14:57:16+00:00

Former user Account Deleted

``` Troy said:

I really like that we're exploring passing an atomic op enum to a few functions instead of having one function per operation.

Would the implementers here be interested in reducing the interface size by including the TYPE as a function parameter? I had thought about that, but it does not seem to be common in UPC (including extensions -- except for BUPC Value-Based Collectives inteface).

George generalized compare-and-swap into compare-fetch-op, noting with an evil grin that IBM could support the general case. Is this general case of interest to other vendors (and would they be hardware-supported)? Or is CAS the common subset of this class that is supported by most vendors?

From a spec-writing perspective, would it make sense for the spec to include compare-fetch-op with "set" as the only required op and leave other operations as vendor-supported options? This could allow us to potentially expand the list of required operations in future releases if multiple networks increased hardware AMO support without drastically changing the AMO spec. ```

Reported by `nspark.work` on 2012-06-19 15:32:18

2012-06-19T15:32:18+00:00

Former user Account Deleted

``` Paul wrote:

BUPC's strict AMOs are intended to "work like" Option 3, but typically w/o incurring

THREE architectural memory fences.

Why does option 3 incur three architectural memory fences? It seems like a trivial peephole optimization for the compiler to throw away the superfluous fences, assuming the compiler has sufficient knowledge of how the target runtime works. ```

Reported by `sdvormwa@cray.com` on 2012-06-19 15:46:42

2012-06-19T15:46:42+00:00

Former user Account Deleted

``` I haven't read the AMO spec. in detail, but would like to note that it is convenient that the compare-swap operation supports 128 bit data types (presumably aligned on at least a 64-bit boundary). This comes up in UPC applications (and UPC runtimes) when there is a need to compare-swap a pointer-to-shared value. For GUPC, using the "struct" PTS representation on a 64 bit host, a fully general PTS is stored in a 128-bit container. Perhaps a feature macro is needed that indicates whether the AMO implementation supports compare-swap on 128 bit sized values. Also, perhaps, the minimum alignment needs to indicated via pre-processor macro. ```

Reported by `gary.funck` on 2012-06-19 15:50:51

2012-06-19T15:50:51+00:00

Former user Account Deleted

``` Going back to the memory semantics, perhaps we should consider providing more fence options than simply upc_fence? This would benefit both the AMOs and the non-blocking proposal (our non-blocking proposal includes acquire semantics for the completion of non-blocking operations). Should that be split out to a separate issue? ```

Reported by `sdvormwa@cray.com` on 2012-06-19 16:11:18

2012-06-19T16:11:18+00:00

Former user Account Deleted

``` In comment #28 Steven wrote:

Paul wrote:

BUPC's strict AMOs are intended to "work like" Option 3, but typically w/o incurring THREE architectural memory fences.

Why does option 3 incur three architectural memory fences? It seems like a trivial peephole optimization for the compiler to throw away the superfluous fences, assuming the compiler has sufficient knowledge of how the target runtime works.

I agree that this is a trivial optimization if atomics are "known" to the compiler. But the current implementation is a LIBRARY and the compiler doesn't know a call to an AMO from any other function call. ```

Reported by `phhargrove@lbl.gov` on 2012-06-19 17:09:05

2012-06-19T17:09:05+00:00

Former user Account Deleted

``` In comment #27 Nick asks:

Would the implementers here be interested in reducing the interface size by including the TYPE as a function parameter?

This would not work for any function which returns a value. So we would need to pass a pointer to the result in any function generating a result. For this reason I dislike passing the type.

George generalized compare-and-swap into compare-fetch-op, noting with an evil grin that IBM could support the general case. Is this general case of interest to other vendors (and would they be hardware-supported)? Or is CAS the common subset of this class that is supported by most vendors?

If even ONE required operation lacks h/w support, then we risk requiring ALL operations being implemented via software just to ensure they are all atomic with respect to each other. Therefore I strongly support the idea that compare-and-swap be required but nothing more general. I would actually go so far as to discourage documenting OPTIONAL atomics in the spec text because this would encourage writing of non-portable code.

What I *would* encourage is that vendors providing extensions to the atomics (more operations, more types, support for "private", etc) all agree OUTSIDE OF THE SPEC on the "spelling" of their extensions. This paves a smooth(er) path to their later addition to the spec, and eases their use. ```

Reported by `phhargrove@lbl.gov` on 2012-06-19 17:27:09

2012-06-19T17:27:09+00:00

Former user Account Deleted

``` Paul wrote:

Therefore I strongly support the idea that compare-and-swap be required but nothing more general. I would actually go so far as to discourage documenting OPTIONAL atomics in the spec text because this would encourage writing of non-portable code.

What I *would* encourage is that vendors providing extensions to the atomics (more operations, more types, support for "private", etc) all agree OUTSIDE OF THE SPEC on the "spelling" of their extensions. This paves a smooth(er) path to their later addition to the spec, and eases their use.

If you're going to go that far, I'd say lets just abandon the AMOs in the spec altogether. Putting only compare-and-swap in the spec would encourage users to only use compare-and-swap in portable codes. So, for example, if they needed to do an atomic fetch-and-add in a portable fashion (say, to atomically reserve array elements...), they'd need to do something like:

do { old = last; new = old + reservation_size; } while( upc_amo_cas( &last, old, new ) != old );

This is going to perform terribly on most systems, particularly in the presence of contention, which will only get worse as you scale up the number of threads. The point of adding atomics to the spec is to make codes run faster, not slower. ```

Reported by `sdvormwa@cray.com` on 2012-06-19 18:54:24

2012-06-19T18:54:24+00:00

Former user Account Deleted

``` Perhaps we could add a query function(macro? intrinsic?) that could be used to figure out which amos an implementation supports. Then users could do something to the effect of:

if ( UPC_AMO_SUPPORTED( UPC_OP_FADD, UPC_TYPE_LONG ) ) { myidx = upc_amo_fadd( &last, reservation_size ); } else if ( UPC_AMO_SUPPORTED( UPC_OP_CAS, UPC_TYPE_LONG ) ) { do { old = last; new = old + reservation_size; } while( upc_amo_cas( &last, old, new ) != old ); myidx = old; } ```

Reported by `sdvormwa@cray.com` on 2012-06-19 19:12:00

2012-06-19T19:12:00+00:00

Former user Account Deleted

``` Steven wrote in comment #33:

Paul wrote:

Therefore I strongly support the idea that compare-and-swap be required but nothing more general.

[...] If you're going to go that far, I'd say lets just abandon the AMOs in the spec altogether.

Sorry if I was unclear about what I was objecting to.

I DO WANT the "Level 2" fetch-and-op AMO's, such as the "upc_amo_fadd" in Steven's example. I DO NOT WANT George's "Level 3" COMPARE-fetch-op for op != "set" [see comment 10 for Level 1,2,3 descriptions]

Now that I think more about it, I actually don't see how "compare-fetch-OP" is more useful than compare-and-swap. Specifically, if the OP is only going to take place only if the comparison is TRUE, then I must have KNOWN the previous value and could have used compare-and-swap having computed the OP against the KNOWN previous value. So, I guess I've just debunked my own original implementability argument against these ops, and replaced it with a they-are-just-syntactic-sugar argument.

```

Reported by `phhargrove@lbl.gov` on 2012-06-19 19:34:39

2012-06-19T19:34:39+00:00

Former user Account Deleted

Reported by `gary.funck` on 2012-07-03 18:07:50 - Labels added: Type-Lib-Required - Labels removed: Type-Enhancement

2012-07-03T18:07:50+00:00

Former user Account Deleted

Reported by `gary.funck` on 2012-07-03 18:10:25

2012-07-03T18:10:25+00:00

Former user Account Deleted

``` Cray UPC supports the following AMO extensions. c = currently supported, f = supported on future hardware. If supported, both a fetching and a non-fetching version exist. (For the bitwise ops, type doesn't really matter, so you can typecast fp32 and fp64 pointers to int32 and int64 pointers and use the integer AMO extensions, but we don't _directly_ provide them.)

int32 int64 fp32 fp64 add f c f f and f c and-with-xor f c compare-and-swap f c min f f f f max f f f f swap f c or f c xor f c

Observations:

1) Existing Cray network hardware does not support atomic 32-bit, floating-point, or min/max operations. Cray UPC does not support unsigned integer AMOs, whereas BUPC does. Therefore, I strongly believe that there needs to be a query mechanism, suggested in Comment #34, for users to figure out if an AMO is supported. It is NOT acceptable to say that an implementation must use software to emulate the operations that it does not support in hardware because in order to keep the operations atomic with respect to each other it would be necessary to implement them all in software, negating any benefit from the hardware. Furthermore, I think the query mechanism needs to be a function call because otherwise it will not be possible to compile a code once and run the same executable on two different platforms that differ only in the supported flavors of AMOs.

2) Providing entry points for {types} x {operations} x {fetching} is unwieldy for everyone...users, implementers, specification writers. It gets even worse if you add {blocking/non-blocking)} to the mix.

Straw Proposal:

/ Returns 1 if the specified AMO is supported or 0 otherwise. /fetching/ is non-zero to request a fetching AMO.

/type/ and /op/ specify the data type and operation to be performed. int upc_amo_exists( int fetching, upc_amo_type_t type, upc_amo_op_t op );

/ Atomically performs operation /op/ on the memory pointed to by /target/. The data type of the operation is specified by

/type/. If /fetched/ != NULL, then the previous value is fetched and stored in the memory pointed to by /fetched/.
Operands for the operation are pointed to by /operand1/ and /operand2/; /operand2/ may be NULL for some operations.
Warning: Operations are not guaranteed to be atomic with respect to non-UPC AMO operations.
/ void upc_amo( void* fetched, upc_amo_type_t type, upc_amo_op_t op, shared void* target, void* operand1, void* operand2 );

Example:

shared long x; upc_lock_t x_lock; ... if ( upc_amo_exists( 0, UPC_AMO_TYPE_LONG, UPC_AMO_OP_ADD ) ) { long one = 1L; upc_amo( NULL, UPC_AMO_TYPE_LONG, UPC_AMO_OP_ADD, &x, &one, NULL ); } else { upc_lock( &x_lock ); x += 1L; upc_unlock( &x_lock ); } ```

Reported by `johnson.troy.a` on 2012-07-25 18:55:57

2012-07-25T18:55:57+00:00

Former user Account Deleted

``` I haven't been following the details of the proposed AMO library, but am wondering whether operations (esp. compare-and-swap) on 128 bit data types are planned/proposed?

As a use case, consider a double buffering scheme with two buffer pointers, where it is convenient and efficient to swap the pointers atomically when switching buffers. Here, the buffer indexes might be two 64-bit indexes, or perhaps two PTS's represented in a 64-bit packed format.

Or simply, swapping a single fully general PTS which is represented as a 128-bit quantity on 64-bit targets.

```

Reported by `gary.funck` on 2012-08-03 05:20:35

2012-08-03T05:20:35+00:00

Former user Account Deleted

``` I don't like the idea proposed in comment #38 of query functions for which AMOs are supported. It complicates the user code, but more importantly just passes the buck of lowered performance to the application. Specifically, instead of the runtime emulating the required AMO in software, now that emulation is happening in the user application (where it's likely to be slower and more error-prone). With this approach, any portable UPC app using the AMO's would also need to fold in code for a fully software implementation of every AMO it uses and switch to that implementation if any of the queries fail. This defeats the code-factorization goal of having a library.

It seems better to restrict the AMO's to a core subset that all UPC implementations must support, and choose that subset wisely to allow hardware implementation on platforms of interest.

It's worth noting that a 64-bit compare-and-swap is sufficient to implement EVERY operation in the current AMO proposal (including 32-bit, unsigned, min/max, and float/double), although it implies an additional read and a possible retry under heavy contention (which should still be still significantly cheaper than a fully software implementation using upc_locks). Since Cray supports that operation in hardware, why not use that to perform the operations lacking direct hardware support? (Note I'm proposing this rewriting be done within the implementation, rather than in the UPC program as proposed in comment #34).

```

Reported by `danbonachea` on 2012-08-03 12:10:55

2012-08-03T12:10:55+00:00

Former user Account Deleted

``` "It's worth noting that a 64-bit compare-and-swap is sufficient to implement EVERY operation in the current AMO proposal (including 32-bit, unsigned, min/max, and float/double), although it implies an additional read and a possible retry under heavy contention (which should still be still significantly cheaper than a fully software implementation using upc_locks). Since Cray supports that operation in hardware, why not use that to perform the operations lacking direct hardware support?"

Because it is NOT significantly cheaper than a fully software implementation using upc_locks. It's fine if you have a couple dozen threads, but once the number of threads goes beyond a certain point, the network (and more importantly, bus) contention degrades performance way past that of a scalable lock algorithm. This gets even worse as you add more and more threads (cores) to a single network endpoint. Using compare-and-swap is only a tolerable workaround in the absence of contention. ```

Reported by `sdvormwa@cray.com` on 2012-08-03 14:48:23

2012-08-03T14:48:23+00:00

Former user Account Deleted

``` Here's my attempt at a summary of today's discussion on the telecon.

Points of Motivation: - UPC users want high-performance AMOs. - A UPC library should be robust (over types and operations), portable, and vendor-independent.

The current proposal (as of SVN r61) would restrict some vendors into implementations that /may/ limit performance. Also, here is / may be interest in expanding the types and operations specified by the current proposal.

Proposed Solutions (as I saw and recall them):

1) Provide a standard interface to the AMOs (e.g., upc_amo() in Comment #38) that supports a robust range of types and operations. Also, provide a function that allows a user to specify which types and operations the user will call, from which the library determines whether it will use a software-based or hardware-based implementation. Thus, all the possible AMOs are always available to the user, however, the implementation may use hardware acceleration either by default (if all types/operations are supported) or from a user's hinting. The hinting function would likely need to be called before any AMO calls are made in order for the implementation to choose the right "mode".

2) Expanding on (1), the hinting function can also accept (in addition to desired types and operations), some provided use-case parameter (e.g., high throughput or low latency), from which the library may select one of potentially-many possible implementations (either in hardware or software).

3) Expanding on (2), the hinting function is replaced by an atomicity-domain function that returns a handle for AMOs that supports a user-specified set of types and operations for a particular use-case. AMOs would only be guaranteed to be atomic with respect to multiple calls using the same handle. Atomicity would not be guaranteed for AMO calls using separate handles.

For (1-3), I think it was expressed that it could be good to have: - A query function that can say what the hardware is capable of supporting (so that a user may restrict his type/operation choice to guarantee hardware support) - A priority function in which a user could specify types, operations, and a use-case of varying priority such that the implementation chooses the best AMO implementation and specifies the supported types and operations.

My understanding of (1) and (2) is that they would both set the atomicity mode (i.e., hardware or some software implementation) at the hinting call (or elision thereof) and it would be henceforth fixed. This is problematic for libraries that may use AMOs not specified by the hinting call or that may make a hinting call before the user's call. I think that (3) is the only library-friendly (or library-general) approach.

I think the query function definitely makes sense, however, I'm not really sure I understand how one would effectively use the priority function. It would seem that this could lead back to heavily-IFDEF'd code (if the priority function selects an implementation that doesn't support the lower-priority types/operations), which I think would be avoided without it.

It is my preference (and I think this was part of all three proposed solutions) that the AMOs can operate on ALL shared addresses. Any restriction of use would be at the user's discretion and NOT be specified by either the hinting or handle functions.

Ideally, I think it would be good for the less performance-minded users if there was either a default handle that supported all specified types and operations in an implementation-defined way or some default parameter that returns "universal" handle.

Please add or correct anything here that I may have misrepresented (and please respond with your comments and feedback!).

From this discussion, I'll draft up another AMO proposal, which will be posted for comments and a follow-up telecon discussion (for whoever is interested). ```

Reported by `nspark.work` on 2012-08-03 22:17:53

2012-08-03T22:17:53+00:00

Former user Account Deleted

``` All "brand new" library proposals are targeted for starting in the "Optional" library document. Promotion to the "Required" document comes later after at least 6 months residence in the ratified Optional document, and other conditions described in the Appendix A spec process. ```

Reported by `danbonachea` on 2012-08-17 17:53:59 - Labels added: Type-Lib-Opt - Labels removed: Type-Lib-Required

2012-08-17T17:53:59+00:00

Former user Account Deleted

``` Set default Consensus to "Low". ```

Reported by `gary.funck` on 2012-08-19 23:26:19 - Labels added: Consensus-Low

2012-08-19T23:26:19+00:00

Former user Account Deleted

``` At the last call, Bill asked for someone to pick a "big issue" for discussion at the next call. Considering the importance of AMOs to many of the UPC users, and the very undecided state that it was left in after the call-before-last, Gary, Yili, and I thought it would be good to discuss AMOs on the next call. Here is something that I hope will re-light the fire of discussion.

Based on my summary of the last discussion in Comment 43, I propose the following based on Option 3 with Query (but no Priority):

# A Proposed Usage Scenario #

A user creates an atomicity domain object by specifying a set of operations, a set of types, and an implementation mode. This object is a handle to some AMO implementation. Depending on the specified mode, the implementation may be in hardware or software.

A user makes AMO calls as either upc_amo_relaxed() or upc_amo_shared() that otherwise look very much like Troy's upc_amo() proposal in Comment 38. Atomicity is only guaranteed for accesses using the same domain.

The library also provides upc_amo_query() so that a user can test whether a set of operations and types is supported for a given mode. There will be a default mode that supports all of a Spec-specified set of ops and types.

# A Bit More Detail #

Define a type (e.g., upc_amo_domain_t) to represent an atomicity domain, which specifies a set of operations and datatypes over which access to a memory location in a given synchronization

phase is guaranteed to be atomic if and only if no other mechanisms or atomicity domains are used to access the same memory location in the same synchronization phase.

Define a type (e.g., upc_amo_mode_t) that a user can use to indicate the AMO implementation "mode" desired with the following acceptable constants UPC_AMO_DEFAULT, UPC_AMO_HARDWARE, UPC_AMO_LATENCY, UPC_AMO_BANDWIDTH.

UPC_AMO_DEFAULT would be an implementation-defined default mode that will support ALL of a specified set of types and operations for AMOs. This would almost surely be a software- based implementation. Using this mode, AMOs would always be portable, but not necessarily high performance.

UPC_AMO_HARDWARE would force the use of hardware-supported AMOs. It is likely that every implementation would vary in the set of types and operations supported for AMOs under this mode. I suspect that users who use this mode would do so out of performance reasons and would not, in general, expect cross-platform compatibility.

UPC_AMO_LATENCY and UPC_AMO_BANDWIDTH would indicate a user preference for low-latency or high-bandwidth (is "throughput" a better term here?) atomics, respectively. IIRC, on the last AMO-centric call, either Steven or Troy noted at least a few times that a user favoring high throughput of atomic memory accesses may not necessarily want the hardware implementation.

Initially, I envisioned atomicity domain initialization (and destruction) as happening similarly to how its done with UPC locks; so I had upc_amo_domain_alloc/free() functions. Yili (and a user) suggested making this a static initializer and strongly encouraging compiler optimizations. I think that this static initialization approach makes more sense.

# Issues or Problems That I Foresee #

This is definitely more complex than what most of my users would like to see. They would most likely only need a short list of operations on 64-bit integer types and only ever use one domain.

In the scheme above, what happens with the static initialization call when a user specifies the hardware-supported mode, but specifies types or operations not supported by the hardware?

Code Portability: Performance-minded users might likely use the UPC_AMO_HARDWARE mode, however, this code would almost surely NOT be portable. One user strongly encouraged identifying a common subset of types/operations that could be supported across the implementation space (but there were concerns, at least initially in our discussions, that this intersection is the empty set). Speaking to vendors' capabilities, Cray has posted what they support; IBM has expressed (I think) that they're quite flexible in support; and it's not clear to me what SGI, HP, and InfiniBand support as hardware atomics.

Implementer/Implementation Burden: This proposal likely requires quite a bit of work on the part of the implementers to implement software and, where applicable, hardware support, as well as a whole new API.

-----------------------

Hopefully this is something to get the discussion restarted. I have a few more thoughts and comments from users that I'll post tomorrow. ```

Reported by `nspark.work` on 2012-08-28 21:29:07

2012-08-28T21:29:07+00:00

Former user Account Deleted

``` I apologize for being late and stupidly responding to the email list instead of posting here. I will learn.

My comments regarding the last post are below. I tried to read everything said already but might have missed a few details. It is not my intent to repeat previously stated points.

1. For remote atomics, we need to be more explicit about what we mean by hardware vs. software. Here is a partial spectrum of options:

- a single packet injected into the network induces a hardware atomic operation in remote memory (example: Blue Gene/Q can do this, at least in one direction) - an active message that executes atomically in software on the remote side does an atomic instruction in hardware (example: Blue Gene/P DCMF_Send w/ handler that does lwarx+stwcx) - an active message that does not execute atomically in software grabs a lock, does the update, then releases the lock - three active messages acquire a remote lock, performance the update, then release the lock

I'm not saying that the good options here are portable or that the bad ones made sense in any situation, but I hope to convince people that our terminology is inadequate.

2. UPC_AMO_HARDWARE should not be an option. It is meaningless to talk about hardware or software implementations on their own. What a user wants to know is fast or slow for a particular usage. Whether or not the implementation uses hardware for that shouldn't be visible to the user.

I'd like to know the use-case where the user specifying UPC_AMO_LATENCY or UPC_AMO_BANDWIDTH is actually going to help. Shouldn't the compiler have enough semantic information already to know whether or not to dump a bunch of nonblocking atomics into the network and flush them all at once versus using a blocking atomic operation and waiting on the round-trip?

3. Regarding any software-based implementations of AMOs, I don't see how portable and slow has any benefit over the user doing atomics themselves with the existing UPC lock functionality.

Furthermore, while it's rather ambiguous what is meant by a software implementation, I would love to know what reasonable architecture can do remote atomics as active messages faster than it can in hardware, just from a network architecture curiosity perspective, as noted previously by someone on a call that I wasn't part of.

4. Regarding Yili's desire for static allocation only...

Can someone describe how the compiler can make use of this? I'm not aware of any architecture where statically allocating memory for remote atomics has any performance benefit. Requiring static allocation is really unpleasant from a usability perspective. It reminds me of FORTRAN77, which is hardly a language we want to hold in high esteem, except perhaps for writing math libraries.

5. Here is my proposal:

Why not just define upc_atomic_int_t without specifying the size? The header can define UPC_ATOMIC_INT_MAX appropriately. On a 32b system like BGP, UPC_ATOMIC_INT_MAX=2^31 while on BGQ or anywhere else that is x86_64 or IA64, it's going to be >2^50 (One can imagine a few bits being reserved to enable hardware atomics). The operations that are provided by reasonable hardware have already been discussed at length.

If these and only these operations were supported, UPC would feel a lot like C, which seems like a reasonable guiding principle. Using upc_atomic_int_t solves all of major issues with 32b vs. 64b. I think that supporting 128b atomics is not a good idea because it's getting ahead of where the hardware is and forces a slow implementation in software.

I really don't understand why UPC can't support just these operations and let the user implement others in software on their own. The performance hit associated with forcing software emulation of simple operations when hardware exists is far greater than the benefit of allowing non-simple operations. ```

Reported by `jeff.science` on 2012-08-30 15:43:03

2012-08-30T15:43:03+00:00

Former user Account Deleted

``` Here are some inlined comments/responses to both Comment 47 and Jeff's previous email, which had a few additional comments.

Jeff said (in Comment 47):

1. For remote atomics, we need to be more explicit about what we mean by hardware

vs. software.

Here is a partial spectrum of options: [...]

I'm sure this only illustrates my overly-simplified understanding of things, but I see the options you present as representing one "hardware" implementation (to me, this means that the NIC does the work) and three "software" solutions (the exact implementation of which shouldn't be specified by the UPC Spec). In general, I think that my users want to be able to get to the NIC's atomic capabilities.

Jeff said (in his initial email):

So we're talking about atomics that exceed the atomicity of the particular hardware,

thereby

forcing a general implementation to use software despite the fact that all reasonable supercomputing hardware provides a reasonable set of remote hardware atomics or at

least enables

something close to them, with the caveat that not all capability is available on

all systems

(e.g. BGP can only do 32b atomics in hardware, while DMAPP only exposes 64b atomics).

I think the caveat you present is the basis for the challenge in defining a UPC atomics specification. From the last call and the discussion on Issue 7, I think it's reasonably clear that the goals of (1) a "robust" AMO library that supports a wide range of types and operations and (2) a "hardware-accelerated" (again, I see this as the NIC doing the work) generally pull in opposition to each other.

I think you are misunderstanding the intent of the "atomicity domain" (or I did a terrible job explaining my intent). The use of atomicity domains is intended such that a user is

not* forced into a low-performance software implementation, but has the option to use the network hardware to accelerate shared-pointer AMOs if the requested types/operations are supported by the hardware.

Jeff said (email):

I disagree that trying to use hardware as much as possible leads to lack of portability.

There

is a set of atomics that are widely portable. One need only inspect Intel TBB's

atomics or GCC

atomic extensions for examples. TBB atomics are not just available on Intel processors.

We've

ported TBB to PPC32, PPC64 and POWER7 and I believe my collaborator who did the PPC

ports

previously did DEC Alpha, Sparc and a whole bunch of weirdo processors with non-x86

memory models.

My interest in the/a UPC AMO specification is to have AMOs on UPC shared pointers. I think the field of network hardware support for atomics is a little less even. In the long term, I would love to see a standard base of network atomics supported by all the major HPC vendors. The intersection may be non-empty at present, but specifying the minimal common set doesn't quite satisfy the interested parties that want to also provide a robust AMO API in UPC.

Jeff said (email):

Does anyone know of hardware that does not support atomic load, store, fetch-and-add

and

compare-and-swap for either 32b or 64b integers? Do we want features beyond these?

I definitely want to see fetch-and-{and, or, xor} added to your list. Otherwise, I don't see the point in defining an AMO spec. I also want to see the non-fetch versions; i.e., atomic-{add, and, or, xor}. You've already noted that DMAPP only exposes 64-bit atomics. I would be okay with only 64-bit atomics, but I'm not sure everyone else would.

Jeff said (Comment 47):

4. Regarding Yili's desire for static allocation only... [...]

It's funny how perspectives differ between users. One of my users said he *wouldn't* want to dynamically allocate domains. He wants a statically-allocated domain in a library he'll use across his applications that supports exactly what he wants on the hardware he uses.

I'm not a compiler person, so I probably make too many assumptions about what magic can happen behind the scenes, however, I can imagine a scenario in which the compiler, after seeing that only one "hardware-only" domain is used, does something like replace the upc_amo() calls with inlined calls to the network API. I suspect this would provide some performance benefit on some systems where pass-through function calls might be costly, especially if one was making a lot of atomic accesses.

Jeff said (Comment 47):

Why not just define upc_atomic_int_t without specifying the size? The header can

define

UPC_ATOMIC_INT_MAX appropriately. On a 32b system like BGP, UPC_ATOMIC_INT_MAX=2^31

while on BGQ or

anywhere else that is x86_64 or IA64, it's going to be >2^50 (One can imagine a few

bits being

reserved to enable hardware atomics). The operations that are provided by reasonable

hardware have

already been discussed at length.

I don't think that restricting an atomic datatype to something less than the full 64-bits is a good idea. I'm not pushing for 128-bit atomics, but I think 64-bit and 32-bit atomics make sense. If your solution is to give users access to 50 bits of a 64-bit integer (and you seem to feel strongly about exposing hardware performance), how does the NIC properly handle bitwise operations on only the "user bits"? And how much of an application already using a vendor's extensions for 64-bit AMOs would have to be rewritten to only use the "user bits." I don't see how your solution /wouldn't/ force a software implementation.

Jeff said (Comment 47):

I really don't understand why UPC can't support just these operations and let the

user implement

others in software on their own. The performance hit associated with forcing software

emulation of

simple operations when hardware exists is far greater than the benefit of allowing

non-simple

operations.

Again, by creating a domain with the UPC_AMO_HARDWARE flag, the intent is that one would *not* be forced into a software implementation, but be explicitly selecting a hardware-only implementation. The last thing that I want is to force software emulation of simple AMOs in all cases. But I also appreciate the view that a UPC AMO library could benefit from providing a vendor-independent, wide base of types/ops for users who might not demand the highest level of performance.

Jeff said (email):

Another issue here is ordering. What's the ordering requirement in UPC? So I dump

100 atomics

into the network targeting the same remote address. Are they going to complete in-order?

That's

probably a much bigger performance hit on Cray Gemini or PERCS than anything entailed

by latency

vs. bandwidth. Forgive me for being a UPC noob if there's something in the language

spec. about

ordering of load-store already, but I would be surprised if UPC can say more than

C, which is going

to depend on the architecture for ordering semantics. Requiring order in remote

atomics when

load-store aren't necessary ordered on various processors seems unreasonable.

Ordering would necessarily be controlled by whether one makes a relaxed atomic access or a strict atomic access. I expect that my users would almost only ever use relaxed atomic accesses, so no ordering of the atomics would be expected (until one uses a upc_fence). ```

Reported by `nspark.work` on 2012-08-30 17:05:01

2012-08-30T17:05:01+00:00

Former user Account Deleted

``` Not sure if anyone else has looked at it for inspiration yet, but the way that C++11 provides atomics is that there are separate atomic types (e.g., std::atomic<int>) and each type provides an is_lock_free() member function to query whether AMOs on that type are implemented using locks or not. An implementation could have is_lock_free() return true for an int type but false for a long type. The fact that one uses locks and the other doesn't use locks is not an issue because int and long are essentially in separate atomicity domains, to use the terminology of our recent discussion. It differs from our atomicity domain concept in that is_lock_free() covers all required atomic operations on the type, so if there is hardware support for an atomic add but an atomic xor requires a lock, then the implementation would report that the atomic type is not lock free and it would use locks for both. Thus, the user that cares only about fast atomic adds is out of luck. The idea of creating an atomicity domain in UPC and specifying which operations actually will be used in the program is one way out of this problem, but it is more complicated. So I'm undecided whether a UPC atomicity domain concept should cover just the type or the type and the possible operations. ```

Reported by `johnson.troy.a` on 2012-09-06 17:59:22

2012-09-06T17:59:22+00:00

Former user Account Deleted

``` This is in response to Jeff's comment 47: "4. Regarding Yili's desire for static allocation only...

Can someone describe how the compiler can make use of this? I'm not aware of any architecture where statically allocating memory for remote atomics has any performance benefit. Requiring static allocation is really unpleasant from a usability perspective. It reminds me of FORTRAN77, which is hardly a language we want to hold in high esteem, except perhaps for writing math libraries."

YZ: I guess there is a confusion between Atomic Domains and Atomic Variables. And this is mostly because we haven't formally defined what is an Atomic Domain yet :-). Nick had a draft document about the definition of Atomic Domains and the related API but it's not published yet.

The suggested static allocation/initialization is only for Atomic Domains. An atomic domain is not an atomic variable nor an address to such a variable. Roughly speaking, an atomic domain defines a domain in which the atomic operations are atomic with respect to each other in the same domain.

The motivation of Atomic Domains was inspired by some observations about practical usage of atomic operations and earlier discussions on the topic.

1) There can be more than one hardware components in a compute node that can access/modify the same memory. An example would be a Cray XK6 node with a Gemini NIC and a GPU -- the CPU, the NIC (connected by HyperTransport) and the GPU (connected by PCI) all support "hardware atomic operations" to the host memory but the atomic operations issued from one component are not atomic with respect to those from other components.

2) In many practical cases, an app only needs a particular atomic op type in a small region of code or data. For example, one may need compare-and-swap when implementing a lock-free queue. Another may need fetch-and-add for implementing a Particle-In-Cell code. But it's uncommon to apply both compare-and-swap and fetch-and-add to a single memory location.

Therefore, we attempt a provide a mechanism (through Atomic Domains) to users to specify the intended usage pattern of atomic ops and thus gives the implementation more information and opportunities to optimize performance.

Proposed API for Atomic Domains (by Nick):

upc_domain_t *upc_global_domain_alloc(upc_op_t ops, upc_type_t types, upc_amo_mode_t mode);

upc_domain_t *upc_all_domain_alloc(upc_op_t ops, upc_type_t types, upc_amo_mode_t mode);

void upc_domain_free(upc_domain_t *ptr);

void upc_amo_strict(upc_domain_t *domain, void *fetch_ptr, upc_op_t op, upc_type_t type, shared void *target, void *operand1, void *operand2);

void upc_amo_relaxed(upc_domain_t *domain, void *fetch_ptr, upc_op_t op, upc_type_t type, shared void *target, void *operand1, void *operand2);

What I suggested for static allocation/initialization for Atomic Domains is something like:

upc_domain_t global_domain = DOMAIN_INITIALIZER(UPC_CSWAP, UPC_INT64 | UPC_UINT64);

This permits but doesn't require compiler optimization because the intended atomic usage pattern is known at compile time. It also makes common usage easier to write (no worry about alloc and free).

```

Reported by `yzheng@lbl.gov` on 2012-09-06 18:08:55

2012-09-06T18:08:55+00:00

Former user Account Deleted

``` What seems to be shaping up is a situation analogous to MPI communicators and MPI collective implementation choices. Atomic domains are in some sense similar to MPI communicators - certainly from a syntax point of view. Building on the analogy:

Have we decided what happens when the same shared pointer is used in two different atomic operations in two different domains? are the results supposed to be defined?

(My answer: heck no, you are on your own)

Is there an ordering constraint between atomic operations executed on different domains?

(My answer: usual UPC ordering semantics regardless of domains - unlike MPI matching semantics w.r.t. communicators)

Picking a domain implementation policy is akin to picking a collective implementation in PAMI. The QoS description of each collective is hairy, and the matrix of "best implementation for a particular situation" is a complex one. Some examples that come to mind:

"-> Implementation 1 is HW only, requires no progress on the receiver, has reasonable latency and BW but has nonscalable behavior beyond 1000 tasks exerting pressure on the same pointer. -> Implementation 2 is SW only, completely scalable to the limit of the system and with reasonable latency, but uses twice the bandwidth of implementation 1 and requires an active message handler to run on the thread affine to the pointer."

These kinds of implementations defy a short description. Facing a similar situation with collectives (accelerated vs. restricted to tori vs restricted to fixed point operations etc etc), the PAMI designers have opted for giving the users complete choice by providing a listing of names on demand, and allowing the user to make choices. In addition, "choice 0" is always the most general choice. Can we do something similar here?

```

Reported by `ga10502` on 2012-09-07 14:58:07

2012-09-07T14:58:07+00:00

Former user Account Deleted

``` As mentioned in an earlier comment by Troy, the C11 spec adds a complete AMO library that exposes functionality that is very similar to what we're trying to expose with our library. I think we need some high-level community discussion regarding whether C11 compatibility is ever a goal for UPC; but regardless of how that matter is resolved (and assuming C11 gains wide acceptance and implementation), I think whatever we come up with for AMOs will generate a very obvious question from every C11-savvy user. Namely, "Why doesn't UPC just support C11 AMO's with a pointer-to-shared?" and "Why can't I use C11 and UPC AMOs together on the same locations?". To play devil's advocate, a minimalistic way we could define AMO's for UPC would be to simply insert (shared _Atomic int) and (shared _Atomic long) into the atomic types specified in C11 7.17.6, and import that library wholesale. I'm NOT necessarily advocating that we should do this, but assuming that we intentionally disregard C11's <stdatomic.h> and define a whole new library with a wildly different interface, we should at least have a convincing "story" of why we diverged from an already-specified library with a nearly-identical target usage in our "base language" and motivate that doing so is necessary for UPC.

Towards this end I've done a careful audit of the C11 <stdatomic.h> and compiled a summary of interesting observations relative to the UPC AMO ideas floating around:

- C11 AMO's operate only on types specifically designed for atomic access, eg atomic_int and atomic_long, which can represent the same values as their base type, but are NOT guaranteed to have the same size or representation as their base type (eg they may include a lock data structure). They include all the signed and unsigned integral types and boolean, no floating point types.

- The atomic operations they support: swap and compare-swap fetch-and-... add, sub, bitwise or, bitwise xor, bitwise and store (aka write) and fetch (aka read) test-and-set and clear (only for a special atomic_flag boolean type) static and dynamic init (non-atomic) Notably OMITS min/max.

- ALL the atomic operations are supported on ALL the atomic types (excepting non-meaningful combinations), although not necessarily guaranteed to be implemented in hardware

- Macros query whether all the AMOs on a particular TYPE are "lock free" (ie fast hardware implementation), where the answer can be "always", "never" or "location dependent"

- A runtime query function can ask whether all AMO's on a particular location are "lock free" (yes or no)

- The actual AMO calls are declared using C11's new type-generic facility (_Generic), so the function the user invokes does NOT contain the type name - they invoke a type-independent symbol like: result = atomic_fetch_add(&my_amo, 2) and the implementation magically does the "right thing" based on the type of my_amo. If UPC ever decides to adopt any part C11, this simple feature could make both the AMO and collective library interfaces look much cleaner to the user.

- The "consistency" of the AMO's (what we would call strict and relaxed) is an optional final enum argument to each operation, where the default basically behaves like UPC strict (AMO includes acquire and release fences). This defines the ordering constraints of AMO's wrt other AMO's and unrelated operations. The enum defines six different consistency "modes" for the AMO, of which at most four seem to have distinct meanings for any given AMO (relaxed, acquire, release, acquire-and-release).

- AMO's atomicity is guaranteed at the actual memory location, and in particular still "works" even if you mmap the location into several places in virtual memory and invoke AMO's on all of them. I haven't heard anyone mention this issue wrt UPC AMOs, but it's one we need to somehow address.

- They include both "strong" and "weak" versions of compare-swap, where the latter may fail spuriously, eg on load-linked/store-conditional architectures. I *think* the issue here is that in cases with poor cache alignment it's possible to construct cases that livelock on these architectures if you blindly spin on LL/SC to implement a "strong" compare-swap on an unlucky location. I haven't heard anyone mention this issue wrt UPC AMOs, but its one we may need to address.

- AMO's and regular C reads and writes do not conflict in memory because this is prohibited by the type system.

- Similarly, atomicity is on a typed basis - one cannot "carve up" an atomic long type using a cast or union and try to call AMOs on the constituent bytes. (We should probably prohibit that as well).

A few misc comments: The C11 AMO library provides most/all of the functionality we're trying to specify, although they may have chosen slightly different points in the requirements/guarantees design space than we'd have tailor-made for UPC. Which of these do we feel are important to change for our library?

IMHO the most significant divergence is C11 defines T and atomic_T as separate, incompatible types. How much does this matter to UPC programmers? C11 Implementations are "encouraged" (but not required) to define atomic_T to have the same size and representation as T, which would permit casting them to regular types for non-atomic access in a separate synchronization phase. If we decided this feature was sufficiently important to UPC programmers, we could presumably still adopt C11 AMO's in UPC and add this additional implementation restriction for all UPC implementations (ie change "encouraged" to "required").

The implementation "trick" that allows that representation restriction to work for types requiring software emulation is to stash the hidden lock datastructure(s) in the runtime system rather than in user memory space. This could be as stupid as a "One big lock" protecting all software-emulated AMO's (of a given type), or could be tuned for improved concurrency by introducing a (probably fixed-size) table of locks indexed by a hash on the AMO location address. The latter would need to hash on physical addresses to retain the mmap atomicity guarantee.

The other major way in which our current UPC discussion regarding AMOs differs is the introduction of more detailed performance queries and hints regarding the implementation of given AMO/type combinations. We should consider if it's possible to get a similar effect by adding some query/hint functions to what C11 specifies.

Incidentally, C11 also standardizes a thread library <threads.h> that is basically POSIX threads in a thin disguise. Dynamic fork/join parallelism in a fully-shared memory space, mutexes, condition variables, thread-local storage. The memory model is roughly UPC relaxed, with fences provided by the AMO's and sync ops in <threads.h>. Dynamic threading doesn't play particularly well with UPC features, but luckily C11 threading is an optional feature, so if we ever adopted C11 we could take the easy out and simply recommend UPC implementations define STDC_NO_THREADS, to disable it entirely.

```

Reported by `danbonachea` on 2012-09-07 16:48:47

2012-09-07T16:48:47+00:00

Former user Account Deleted

``` George: I think AMO domains are a lot more like MPI_Win than MPI_Comm. Your first example sounds the same as a memory location being in two different windows.

In any case, the conclusions do not change.

Regarding whether or not UPC should be like PAMI, the answer should be no for many reasons. If someone wants all the dials that PAMI has, then they should just use PAMI. You should be able to manually switch UPC implementation semantics in software any more than you should be able to switch between the x86 and PowerPC memory model in a C program.

Dan: I completely agree with your point that doing anything deviant from C11 requires a very strong justification. I certainly hope that UPC grows to be a superset of C11 (perhaps without optional features) some day.

Does anyone know what compilers (if any) implement C11 atomics? ```

Reported by `jeff.science` on 2012-09-07 17:05:13

2012-09-07T17:05:13+00:00

Former user Account Deleted

``` "- AMO's atomicity is guaranteed at the actual memory location, and in particular still "works" even if you mmap the location into several places in virtual memory and invoke AMO's on all of them. I haven't heard anyone mention this issue wrt UPC AMOs, but it's one we need to somehow address."

Is mmap() added to C11? I thought that came from POSIX. Does plain C11 (without any extensions such as POSIX) actually permit access to a single object via pointers that don't compare as equal?

"The latter would need to hash on physical addresses to retain the mmap atomicity guarantee."

This will be *very* difficult to implement on many common systems, as user-level software generally doesn't have knowledge of the physical address (for good reasons), and UPC implementations aren't generally done in the kernel. ```

Reported by `sdvormwa@cray.com` on 2012-09-07 17:05:25

2012-09-07T17:05:25+00:00

Former user Account Deleted

``` Correction: "You should <NOT> be able to manually switch UPC implementation semantics in software any more than you should be able to switch between the x86 and PowerPC memory model in a C program." ```

Reported by `jeff.science` on 2012-09-07 17:06:32

2012-09-07T17:06:32+00:00

Former user Account Deleted

``` A part of the talk of C11 makes me uneasy. The UPC AMO started out as a *library only*.

My $0.02: can we define a near-term goal of just coming up with a usable library? without prejudicing a larger effort that will blend atomic types seemlessly into the language? I have of course no objection to looking at C11 for inspiration. I just don't feel the urgency to resolve the whole problem *right now*.

Done ranting. As to the substance of "being like PAMI", the choice is fairly obvious:

no allowance for decision whatsoever. You get a single choice for implementation, and you live with it. This is the simplest, but I have the feeling that most of us are not in this camp.

*some* allowance for a decision, but without clear performance guarantees. Thus, the choice is by design a bit nebulous "if you pick me you *may* get HW to execute your AMOs. Nasty.

An raw list of capabilities without commentary on what each capability is actually good for. This is the "PAMI way", and I can see why Jeff likes it not: the capability list is by definition non-portable.

Expert system: the system picks an implementation based on criteria you state when you create the domain. I'd love to have such a thing. We tried to do this for PAMI - there are endless published papers about it - just not a system that I trust to work. Yet.

Nirvana: the system guesses/autotunes the implementation you really wanted, no questions asked.

```

Reported by `ga10502` on 2012-09-07 17:50:54

2012-09-07T17:50:54+00:00

Former user Account Deleted

```

Is mmap() added to C11? I thought that came from POSIX. Does plain C11 (without

any extensions such as POSIX) actually permit access to a single object via pointers that don't compare as equal?

Here is the exact C11 verbiage I'm referring to:

"NOTE Operations that are lock-free should also be address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation should not depend on any per-process state. This restriction enables communication via memory mapped into a process more than once and memory shared between two processes."

AFAIK there is no way within strictly-conforming C11 to actually accomplish this mapping, so I think this is just a nod towards POSIX. We could conceivably "solve" this in UPC by explicitly prohibiting such user-level mapping in the first place.

```

Reported by `danbonachea` on 2012-09-07 18:01:32

2012-09-07T18:01:32+00:00

Former user Account Deleted

``` "The latter would need to hash on physical addresses to retain the mmap atomicity guarantee."

I am not certain that we want to allow the UPC *user* to mmap() the shared heap multiple times, making the statement above a "read herring" from the point of view of a UPC-only implementation.

Multiple mappings of the UPC shared heap for use of shared memory within a multi-core node would be fully controlled by the UPC runtime and thus easier to address (no pun intended) than a fully general case. In particular I doubt that arbitrary alignments are used, meaning that hashing on the offset within the page would be sufficient to correctly hash any/all pointers to a given object to the same slot in the table of locks (or equivalent).

I agree w/ George that we should NOT look to pickup C11 language features (_Generic in particular could be problematic), and I don't think that was Dan's intent. I think we should look at C11 for a "design" that our future users may already know and which a larger (I hope) community of experts has already vetted. However the "binding" of that design for UPC would need to be a LIBRARY - hopefully one that was nearly trivial to implement in terms of C11's stdatomic.h when available. ```

Reported by `phhargrove@lbl.gov` on 2012-09-07 18:03:42

2012-09-07T18:03:42+00:00

Former user Account Deleted

``` I think an atomics library-only implementation is fine as long as this permits a typedef atomic_T-like entity rather than requiring C89 build-ins and some registration garbage. It is effectively impossible to do atomics portably without a typedef because of alignment issues that we cannot assume from an arbitrary C compiler (within the src2src model of BUPC, for example). Sometimes one has to ram the stupid int in a struct to make atomics work in a reasonable way (see the BGP atomics API, for example). ```

Reported by `jeff.science` on 2012-09-07 21:02:54

2012-09-07T21:02:54+00:00

Former user Account Deleted

``` "It is effectively impossible to do atomics portably without a typedef because of alignment issues that we cannot assume from an arbitrary C compiler "

This is an excellent point, and probably another reason why C11's library requires the use of typedefs like atomic_int, rather than operating on objects declared with raw basic types as we seem to be leaning towards. If we go with a raw types design, this is a problem we will need to address. For example, if the UPC user does something like this:

typedef struct { char c; int64_t i; } S_t; shared S_t *S = upc_alloc(sizeof(S_t)); upc_amo_fetchincrement{&(S->i));

There are ABIs where the long field won't be 8-byte aligned, and where issuing a hardware atomic instruction on such a field could generate a bus error. One possible solution would be to place the burden upon the user and state that if the pointer argument to the AMO is not suitably aligned, behavior is undefined.

This resolution dodges the issue, but doesn't provide a solution if the user really needs to perform AMOs on struct elements. The only option open to such a user would be to use non-portable (but widely available) alignment built-ins like align(), which is incidentally standardized in C11-6.7.5 with the _Alignas() specifier. If we're concerned that this will be a frequent portability problem for our AMO users, we could decide to specify some typedefs for suitably-aligned versions of the basic types (with identical size and representation) and recommend their (optional) usage when declaring struct fields that will be passed to the AMO library.

Alternatively, we could just abandon the effort to support AMO on raw basic types and take the C11 approach of only supporting AMOs on the provided typedefs. This also solves the alignment portability problem, but makes it more tedious to incrementally add AMO's to existing applications, and arguably burdens the common usage to prevent problems in uncommon usage.

```

Reported by `danbonachea` on 2012-09-08 10:08:04

2012-09-08T10:08:04+00:00

Former user Account Deleted

``` As a follow-up to the discussion relating to C11 support for atomics and how/if this might apply to UPC AMO's, I found an interesting discussion here:

"Emulating C11 compiler features with gcc: _Atomic" http://gustedt.wordpress.com/2012/01/17/emulating-c11-compiler-features-with-gcc-_atomic/

In C11, _Atomic comes in two flavors, as a type qualifier (like const or volatile) or as type specifier (like in struct or array declarations). [...]

The author goes on to describe how he handles atomic types in his P99 macro pre-processor package: http://p99.gforge.inria.fr/ "P99 - Preprocessor macros and functions for C99"

P99 is a suite of macro and function definitions that ease the programming in modern C, aka C99. By using new tools from C99 we implement default arguments for functions, scope bound resource management, transparent allocation and initialization, ...

By using special features of some compilers and operating systems, we also are able to provide an emulation of a large part of the new C standard, C11.

P99 heavily depends on a decent support for C99 of compilers. We have set up a test program that may be used as a first indication if a compiler is compatible with that. Please see the directory c99-conformance for some results of such tests."

Although it is unlikely that we will adopt the P99 approach re: AMO's or C11 atomics, I am passing along the reference for consideration. There are a number of other potentially useful aspects of the P99 project, unrelated to AMO's. 1) C99 conformance test results, for example: http://p99.gforge.inria.fr/c99-conformance/c99-conformance-gcc-4.5.html 2) The P99 macros use C99 features in some novel ways. Given that UPC is derived from C99, it is both possible and likely that current UPC programmers do not make use of some C99 features that might improvement the maintainability and reliability of their UPC programs, and that might perhaps improve the productivity of programming in UPC.

```

Reported by `gary.funck` on 2012-09-08 18:38:23

2012-09-08T18:38:23+00:00

Former user Account Deleted

``` "One possible solution would be to place the burden upon the user and state that if the pointer argument to the AMO is not suitably aligned, behavior is undefined."

How does one write portable UPC that meets the alignment requirements of all possible ABIs? Is this not significantly more onerous than using a new type?

What if UPC is going to run in a heterogenous environment such as may exist in Cray Cascade? Do we think that alignment alone is going to solve the problems of atomics from distributed UPC threads that live on Intel Xeon, Intel MIC and/or NVIDIA processors? I don't recall what CUDA does for atomics but I can imagine in some future system that the atomics live in special hardware (e.g. Cray Gemini atomics living in the NIC) that requires a typedef.

I really don't see the use of atomic_int, for example, as "tedious to incrementally add AMO's to existing applications". Is the tedium coming from codes that want to replace lock/update/unlock with atomic operations and it is not obvious to the user which declarations correspond to variables that will require this treatment?

Perhaps it could be permitted - only when possible - to cast T to atomic_T. This allows the common case where typedef is required to support the evolution of existing codes, although it would not be portable. ```

Reported by `jeff.science` on 2012-09-09 17:49:07

2012-09-09T17:49:07+00:00

Former user Account Deleted

``` "How does one write portable UPC that meets the alignment requirements of all possible ABIs? Is this not significantly more onerous than using a new type?"

Not really. I believe the alignment issue only arises when the target of the AMO is located inside an aggregate type (a struct field), which may be an uncommon or less important case for typical HPC apps. For the simple case of AMO's on elements of an array of longs (the common case?), alignment should not be a problem on any reasonable architecture. If we believe that is the common case and the common case does not suffer from alignment problems, then it would be unfortunate to burden the common case with additional syntax that only affects the portability of the uncommon case.

"What if UPC is going to run in a heterogenous environment ... I can imagine in some future system that the atomics live in special hardware (e.g. Cray Gemini atomics living in the NIC) that requires a typedef"

This is a red herring. I believe hardware that *requires* AMO targets to live in special memory locations (where "special" means something more than UPC shared heap) are firmly beyond the scope of what we're trying to handle with the UPC AMO library. The C11 AMO library also does not handle implementations that require memory to be allocated "specially", although it does permit AMO's to be "faster" ("lock-free") on certain locations. In any case, a typedef alone is not enough to implement "special" memory, because it does not handle dynamic allocation.

"I really don't see the use of atomic_int, for example, as "tedious to incrementally add AMO's to existing applications". Is the tedium coming from codes that want to replace lock/update/unlock with atomic operations and it is not obvious to the user which declarations correspond to variables that will require this treatment?"

It's tedious because large portions of the code may already written that allocate and pass around long-lived application data structures as base types (eg arrays of longs). Adding an AMO operation on an element in a performance-critical loop can be a localized and incremental change if the AMO supports basic types, otherwise adding the AMO may require adjusting many type expressions globally throughout the application to accommodate the AMO call being added to one corner of the program.

"Perhaps it could be permitted - only when possible - to cast T to atomic_T. This allows the common case where typedef is required to support the evolution of existing codes, although it would not be portable."

My point is that the typedef is NOT required in the common case. We've already decided we don't want to allow AMO library metadata inside user data space, so we're only talking about alignment requirements here. Many architectures never have alignment problems for AMOs, and the rest only have a problem for struct fields.

I was suggesting that the AMO's statically accept memory arguments of either atomic_T or T, and in the latter case be permitted to abort at runtime in the rare situation that alignment requirements are violated. This supports easy (cast-free) incremental insertion of AMO operations, but also provides the mechanism to write fully-portable programs with AMOs that are guaranteed to satisfy alignment constraints.

If we feel supporting both atomic_T and T makes the API too "wide", we could just support the former and recommend the use of casts as you suggest, which is basically how C11 AMO's work. If we go this route then no special verbiage is requires for those casts, because that's already covered by C99-6.3.2.3: "A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the pointed-to type, the behavior is undefined."

```

Reported by `danbonachea` on 2012-09-10 00:05:07

2012-09-10T00:05:07+00:00

Former user Account Deleted

``` Dan wrote:

Not really. I believe the alignment issue only arises when the target of the AMO

is located inside an aggregate

type (a struct field), which may be an uncommon or less important case for typical

HPC apps. For the simple

case of AMO's on elements of an array of longs (the common case?), alignment should

not be a problem on

any reasonable architecture.

AIX on PPC64 is not reasonable by Dan's definition, because the ABI ensures 8-byte alignment for "double", but *not* for (void *) or long. The rules for 8-byte types inside structs on AIX is bizarre and not worth describing here. ```

Reported by `phhargrove@lbl.gov` on 2012-09-10 00:53:43

2012-09-10T00:53:43+00:00

Former user Account Deleted

``` "AIX on PPC64 is not reasonable by Dan's definition, because the ABI ensures 8-byte alignment for "double", but *not* for (void *) or long. The rules for 8-byte types inside structs on AIX is bizarre and not worth describing here."

I don't think we're proposing to allow UPC AMO's on (void *), because pointer-to-local living in shared space is already a "dicey" programming construct. I haven't heard anyone mention AMO's on pointer-to-shared, but the alignment of that type is already under the exclusive control of the UPC compiler.

The lack of alignment guarantees for arrays of longs in AIX-PPC64 (and possibly for other arrays of basic types elsewhere) is not likely to be an issue for UPC, because the AMO library already requires the target to live in shared space. In the case of static allocation of a shared array of longs, the UPC implementation can (easily?) provide 8-byte alignment of that shared object to make this problem go away (but this would not be required by the AMO library spec). In the case of dynamic allocation, upc_alloc and friends are already guaranteed to provide the necessary alignment for arrays of longs due to ISO C 7.20.3: "The pointer returned if the allocation succeeds is suitably aligned so that it may be assigned to a pointer to any type of object and then used to access such an object or an array of such objects in the space allocated", because "any type of object" includes atomic_long.

Now that we've broached the subject, I think we should consider eventually supporting compare-and-swap on a pointer-to-shared, since that would enable UPC programmers to portably implement lock-free data structures in shared memory. This usage case wasn't mentioned by our UPC users on the call, but I believe this is an important class of AMO clients, judging by the enormous literature on lock-free programming for non-UPC languages. This is something that could possibly be added as an enhancement in 1.4.

I'm aware that many UPC platforms use a 128-bit PTS representation and therefore probably cannot provide atomics on PTS exclusively in instruction-level hardware. However, there are also platforms using a 64-bit PTS representation, and also many cases where an implementation could do significantly better than upc_lock/pointer_swap/upc_unlock, even for 128-bit PTS. It's also worth noting that an implementation of the UPC AMO library should be permitted to use software-based atomics for one type (eg PTS) and fully-hardware atomics for a different, incompatible type (eg long) in the same "domain" without creating a problem. Hopefully the AMO spec being drafted will already prohibit touching the same memory location using two differently-typed AMO operations within a synchronization phase.

```

Reported by `danbonachea` on 2012-09-10 02:14:07

2012-09-10T02:14:07+00:00

Former user Account Deleted

``` "I don't think we're proposing to allow UPC AMO's on (void *)..."

So no one ever does pointer arithmetic with atomics? I don't know that "shared pointer to local" is the only use case here either. I am a UPC noob but isn't "shared pointer to shared" legal (and potentially useful)? ```

Reported by `jeff.science` on 2012-09-10 02:17:35

2012-09-10T02:17:35+00:00

Former user Account Deleted

``` "I am a UPC noob but isn't "shared pointer to shared" legal (and potentially useful)?"

Yes it is, which is the topic of my last two paragraphs that recommend including AMO library support for compare-and-swap on a pointer-to-shared (PTS). I wouldn't push for full-blown PTS arithmetic using AMOs, not because it would be impossible but because this would complicate the interface with the blocksize information required to perform PTS arithmetic, and lock-free algorithms generally only require pointer CAS, not pointer arithmetic (and the latter can be implemented using the former with an additional read). Adding PTS CAS support would subsume the need for AMOs on (void *) in shared space, which as I mentioned are only meaningful in very restricted cases (and I don't think anyone is pushing for them).

```

Reported by `danbonachea` on 2012-09-10 04:15:18

2012-09-10T04:15:18+00:00

Former user Account Deleted

``` As promised, and hopefully without too much delay, here's is a straw proposal for an "atomicity domain"-based AMO library.

-- upc_domain_t -- An object of type upc_domain_t represents an atomicity domain, which specifies a set of operations and datatypes over which access to a memory location in a given synchronization phase is guaranteed to be atomic if and only if no other mechanisms or atomicity domains are used to access the same memory location in the same synchronization phase.

-- upc_amo_mode_t -- The following constants of type upc_amo_mode_t shall be defined to allow user-provided indication of a preferred usage mode to the implementation.

UPC_AMO_DEFAULT, An implementation-defined default mode
UPC_AMO_HARDWARE, Use only hardware-supported atomic memory operations

-- Supported Operations & Types -- The atomicity mode UPC_AMO_DEFAULT will support the following combination of operations and data types:

AND OR XOR NOT SET CSWAP ADD MIN MAX I, I32 X X X X X X X X X L, I64 X X X X X X X X X UI, U32 X X X X X X X X X UL, U64 X X X X X X X X X F32 X X X F64 X X X

The atomicity mode UPC_AMO_HARDWARE will provide support for an implementation-defined subset of the data types and operations supported by UPC_AMO_DEFAULT.

-- Domain Initialization -- upc_domain_t UPC_DOMAIN_INIT(upc_op_t ops, upc_type_t types, upc_amo_mode_t mode);

A static initializer that creates a upc_domain_t object that supports the operations and types specified by ops and types, respectively, in the AMO implementation mode, mode.

int upc_all_domain_init(upc_domain_t *domain, upc_op_t ops, upc_type_t types, upc_amo_mode_t mode);

A dynamic initializer for a upc_domain_t object that initializes the specified object to support the operations and types specified by ops and types, respectively, in the AMO implementation mode, mode.

-- Support Query -- int upc_amo_query(upc_amo_mode_t mode, upc_op_t ops, upc_type_t types);

Returns 1 if the AMO implementation mode, mode, supports all pair-wise combinations of types and operations specified by types and ops, respectively; returns 0 otherwise.

-- Atomic Operations -- void upc_amo_strict(upc_domain_t *domain, void *fetch_ptr, upc_op_t op, upc_type_t type, shared void *target, void *operand1, void *operand2);

void upc_amo_relaxed(upc_domain_t *domain, void *fetch_ptr, upc_op_t op, upc_type_t type, shared void *target, void *operand1, void *operand2);

where: - *target = *target OP *operand1, for OP in {AND, OR, XOR, ADD, MIN, MAX} - *target = (*target == *operand1) ? *operand2 : *target, for CSWAP - *target = *operand1, for SET - *target = (*target), for NOT and the original value of target is stored in fetch_ptr if and only if fetch_ptr != NULL. ```

Reported by `nspark.work` on 2012-09-10 16:35:23

2012-09-10T16:35:23+00:00

Former user Account Deleted

``` "and also many cases where an implementation could do significantly better than upc_lock/pointer_swap/upc_unlock, even for 128-bit PTS."

Could you be more specific? I'm having trouble coming up with anything better, given that a programmer could use the lock to protect more than just the pointer swap. Also, note that implementations that have 128-bit PTS representation probably also have the restriction that 128-bit reads and writes are not atomic at the hardware level, so you'll need 128-bit load and store atomics as well, and these wouldn't play nice with strict reads/writes... ```

Reported by `sdvormwa@cray.com` on 2012-09-10 16:56:36

2012-09-10T16:56:36+00:00

Former user Account Deleted

``` Why no atomic SET for FP types? Don't we need to support atomic SET/GET for all types? Load/store atomicity is not necessarily at word granularity. ```

Reported by `jeff.science` on 2012-09-10 17:33:56

2012-09-10T17:33:56+00:00

Former user Account Deleted

``` "Why no atomic SET for FP types? Don't we need to support atomic SET/GET for all types? Load/store atomicity is not necessarily at word granularity."

Note that the memory model implies sequential consistency for strict references. I'm pretty sure that requires that strict memory operations are atomic (or at least appear to be). Admittedly, the spec is extremely vague on the mapping from source expression to memory model operations, so this may not be the full story. I'll defer to others more knowledgeable on the intricacies of the memory model. ```

Reported by `sdvormwa@cray.com` on 2012-09-10 17:49:30

2012-09-10T17:49:30+00:00

Former user Account Deleted

``` "Why no atomic SET for FP types? Don't we need to support atomic SET/GET for all types? Load/store atomicity is not necessarily at word granularity."

It certainly wouldn't bother me to have atomic SET for floating-point types. In general, I think a lot less about floating-point types and what one would do with them. If we also need an atomic GET, we could probably just add a NO-OP operation, so you'd do a fetching NO-OP and get the value without any modification. ```

Reported by `nspark.work` on 2012-09-10 17:53:42

2012-09-10T17:53:42+00:00

Former user Account Deleted

``` I assume that AMOs are going to work in relaxed mode, too, which is why atomic GET/SET are required, or should a user instead declare these references as strict? It might at least be convenient to define atomic GET/SET, if only as another way to declare a reference to be strict.

```

Reported by `jeff.science` on 2012-09-10 17:54:47

2012-09-10T17:54:47+00:00

Former user Account Deleted

``` The more I think about the straw proposal in Comment 68, the more I'd like to coalesce the domain and "mode" types and have an implementation provide "everything" (UPC_AMO_DEFAULT, likely in software) or just what the hardware can support (UPC_AMO_HARDWARE). Regarding all the different ways one could implement atomics "in hardware", I think that would fall in as one of the "implementation defined" aspects of the hardware-only domain. For Cray and IBM, I would /expect/ that they would let their NICs do the work, but any implementation could do whatever the implementation chose to do.

I also think it would be worth re-working the query to have separate types/ops parameters for integer and floating-point types. The support matrix I suggested doesn't work in the pair-wise matching of types and ops, if integer and FP types are considered together. ```

Reported by `nspark.work` on 2012-09-10 17:59:13

2012-09-10T17:59:13+00:00

Former user Account Deleted

``` Comment 73 was about comment 71, btw.

Regarding comment 72, FETCH-AND-OP with NO-OP is exactly how MPI-3 does atomic GET, btw.

I wonder if there is an advantage to having a separation. Unless the compiler can remove both the operation and the input operand, there is going to be packet overhead in doing GET via FETCH-AND-OP. Some systems have special packets for 8-byte communication, which could work for atomic GET (assuming perhaps that all GETs were atomic and thus nothing special was required to achieve this). In this case we want to force to user to declare their intent to do GET specifically.

The other issue is that the compiler can probably do more to optimize if it doesn't have to inspect the input operand of FETCH-AND-OP to distinguish a "GET" from a GET. ```

Reported by `jeff.science` on 2012-09-10 18:00:08

2012-09-10T18:00:08+00:00

Former user Account Deleted

``` "access to a memory location in a given synchronization phase is guaranteed to be atomic if and only if NO OTHER MECHANISMS or atomicity domains are used to access the same memory location in the same synchronization phase."

This part of the straw spec answers all the questions about concurrent strict accesses - basically, touching the same memory location with an AMO and a strict operation in the same synchronization phase means atomicity is no longer guaranteed. I would go even stronger and state that behavior is completely undefined if you do this. Every atomic type needs to have GET and SET via the AMO library, which is the "right" way to perform concurrent read/writes during an AMO synchronization phase.

"Note that the memory model implies sequential consistency for strict references. I'm pretty sure that requires that strict memory operations are atomic (or at least appear to be)."

As discussed elsewhere in other threads, the memory model does NOT guarantee freedom from word-tearing, which is a related but orthogonal issue. It would be impossible to implement strict if we required strict operations to be tearing-free, because you can perform a "single" strict write of arbitrarily large size using a struct type. Programs with race conditions need to be aware of the word-tearing issue, even if the races use only strict operations. Most architectures won't word tear for 1,2,4,8 byte accesses, but this is platform-specific.

In any case, we are specifically prohibiting concurrent access with non-AMO read/writes (strict or otherwise), both to ensure atomicity can be implemented when hardware support is absent, and to prevent word-tearing from being an issue.

In resp to comment 69: ""and also many cases where an implementation could do significantly better than upc_lock/pointer_swap/upc_unlock, even for 128-bit PTS."

Could you be more specific? I'm having trouble coming up with anything better, given that a programmer could use the lock to protect more than just the pointer swap."

An active message is sent, upon arrival software grabs a lock, does the update, then releases the lock and sends a reply. For a OP-and-fetch that's one network round-trip of latency. The naive UPC-level implementation of upc_lock/pointer-read/pointer-write/upc_unlock in general will induce four network round trips of latency, for approximately a 4x slowdown on a latency-dominated network.

```

Reported by `danbonachea` on 2012-09-10 20:17:33

2012-09-10T20:17:33+00:00

Former user Account Deleted

``` A few comments on the actual proposal:

int upc_all_domain_init(upc_domain_t *domain, upc_op_t ops, upc_type_t types, upc_amo_mode_t mode);

This function should not return a value, because it should never fail (unless the arguments are totally bogus values, which should be a fatal error). The user is stating he wants this set of ops and types, give me a domain with the "fastest" implementation that supports all the pairwise combos.

The mode argument makes no sense here - the user always wants the "fastest" implementation you can provide. I would change that argument to be a upc_amohint_t (or something similar) where the values are something like UPC_AMO_LATENCY, UPC_AMO_THROUGHPUT and allow for additional implementation-defined hints.

Finally, if we're going to support statically-initialized domains, I don't see a reason to also provide a dynamic one. However, we DO need to think about how these domain_t types are handled by the user, specifically whether they can be passed around by value (like a upc_handle_t) or whether they must be handed by reference (like a upc_lock_t

). The latter allows the implementation to store state directly in the upc_domain_t, the former requires a level of indirection in the implementation and opens the door to memory leaks inside the implementation lacking a destructor, and potentially makes it more difficult to implement the static constructor. Whichever we settle on should be used consistently throughout - if constructors return a domain_t by value, then the AMO's should also take the domain by value. Otherwise drop the static initializer and work everything by reference. Finally, we should state that domain construction is collective and domain handles are thread-specific, so we don't have domains handles being passed around in shared memory.

"I also think it would be worth re-working the query to have separate types/ops parameters for integer and floating-point types. The support matrix I suggested doesn't work in the pair-wise matching of types and ops, if integer and FP types are considered together."

An easy way to fix this is change it so each atomicity domain is only over a SINGLE type. We already have to prohibit concurrent AMO access to a given memory location using two different C-types within a synchronization phase, so might as well fold this requirement directly into the domain abstraction. This means every argument of (upc_type_t types) should be (upc_type_t type), and the type argument can be dropped from the actual AMO call because it is redundant and stored in the domain.

We haven't discussed the spelling of upc_type_t values and the exact C types they correspond to, but we need to specify those.

Combining all of the above, we end up with a spec like:

upc_domain_t UPC_DOMAIN_INIT(upc_op_t ops, upc_type_t type, upc_amohint_t hints);

void upc_amo_strict(upc_domain_t *domain, void *fetch_ptr, upc_op_t op, shared void *target, void *operand1, void *operand2);

Example usage for an atomic add:

upc_domain_t mydom = UPC_DOMAIN_INIT((UPC_ADD|UPC_CSWAP), UPC_TYPE_INT32, UPC_AMO_LATENCY); shared int32_t val; upc_amo_strict(&mydom, 0, UPC_ADD, &val, 42, 0);

```

Reported by `danbonachea` on 2012-09-10 20:48:08

2012-09-10T20:48:08+00:00

Former user Account Deleted

``` "An active message is sent, upon arrival software grabs a lock, does the update, then releases the lock and sends a reply. For a OP-and-fetch that's one network round-trip of latency. The naive UPC-level implementation of upc_lock/pointer-read/pointer-write/upc_unlock in general will induce four network round trips of latency, for approximately a 4x slowdown on a latency-dominated network."

This is true for most atomics. However, compare-and-swap is usually done in a spin-loop until it succeeds. Most uses of compare-and-swap that I've seen look something like this:

do { tmp = val new_val = f(tmp) } while( cas( &val, tmp, new_val ) != tmp )

Instead of compare-and-swap, users can just use locks to prevent anyone from touching the value until you've updated it.

lock( val_lock ) val = f( val ) unlock( val_lock )

This avoids the retry loop entirely by changing the algorithm. This is something only the user can do. That said, some users just want to run their lock-free code everywhere, regardless of whether or not it is a good idea on a given platform, so let's consider the two software approaches mentioned.

With a software atomic implementation naively based on locks, the minimum latency is dramatically increased due to the extra network round-trips. The maximum latency is hopefully capped by a fair locking algorithm though.

With a software atomic implementation based on active messages, you avoid 3 extra network round-trips, but are dependent on someone actively processing messages on the remote end, meaning the maximum latency is potentially unbounded. Additionally, since we're talking about systems where the shared pointer doesn't fit into 64 bits, there are scalability issues to consider. In particular, full message queues become much more common. Thus, I question if this is always better than naively using locks.

In general, the problem with software atomics is that there are many ways of implementing them, but which implementation to choose can be extremely problem-dependent. Users know more about the problem they are trying to solve, and are thus better equipped to make that decision. Implementers are unlikely to provide multiple implementations for the user to choose from (too expensive), so the user is tough out of luck if the implementation's choice doesn't match well with their problem. I suppose someone could write a bunch of reference implementations using different algorithms in UPC to get around this though... ```

Reported by `sdvormwa@cray.com` on 2012-09-10 23:17:37

2012-09-10T23:17:37+00:00

Former user Account Deleted

``` "This avoids the retry loop entirely by changing the algorithm. This is something only the user can do. That said, some users just want to run their lock-free code everywhere... Users know more about the problem they are trying to solve, and are thus better equipped to make that decision."

My argument for inclusion of PTS CAS was to support lock-free algorithms. Lock-free algorithms require a pointer CAS AMO. The question of whether or not the user SHOULD be using a lock-free algorithm on a given platform for a given situation is orthogonal - there are cases where lock-free data structures are probably faster (possibly much faster), and also cases where upc_lock-ing with a traditional data structure may be faster. This is part of application tuning, and it is not our place to dictate that decision. If we fail to provide PTS CAS then we are taking that decision away from the user and effectively prohibiting the use of lock-free data structures.

```

Reported by `danbonachea` on 2012-09-11 00:21:04

2012-09-11T00:21:04+00:00

Former user Account Deleted

``` "My argument for inclusion of PTS CAS was to support lock-free algorithms. Lock-free algorithms require a pointer CAS AMO. The question of whether or not the user SHOULD be using a lock-free algorithm on a given platform for a given situation is orthogonal - there are cases where lock-free data structures are probably faster (possibly much faster), and also cases where upc_lock-ing with a traditional data structure may be faster. This is part of application tuning, and it is not our place to dictate that decision. If we fail to provide PTS CAS then we are taking that decision away from the user and effectively prohibiting the use of lock-free data structures."

Yes, I agree that it should be included for completeness. My intent was merely to point out that the choice of which software implementation is best is usually more dependent on the problem at hand than on the platform details, and implementations are not likely to provide many choices given the development cost. Therefore, a generic software atomics library, with multiple algorithms, that worked on a wide variety of conforming UPC implementations would be useful--more so if such a library could continue to make use of hardware atomics provided by an implementation. ```

Reported by `sdvormwa@cray.com` on 2012-09-11 00:56:53

2012-09-11T00:56:53+00:00

Former user Account Deleted

``` Some last-minute follow up...

--- Initializers and Modes/Hints --- Dan said:

This function should not return a value, because it should never fail (unless the

arguments

are totally bogus values, which should be a fatal error). The user is stating he

wants this

set of ops and types, give me a domain with the "fastest" implementation that supports

all

the pairwise combos.

The mode argument makes no sense here - the user always wants the "fastest" implementation you can provide. I would change that argument to be a upc_amohint_t (or something

similar)

where the values are something like UPC_AMO_LATENCY, UPC_AMO_THROUGHPUT and allow

for

additional implementation-defined hints.

[...]

An easy way to fix this is change it so each atomicity domain is only over a SINGLE

type.

This approach makes sense to me. Turning the mode into a hint, which was the intent all along (I think), makes more sense to me with the per-type domain (from Dan, below). I think it eases the ability to "mix-and-match" types and ops and hints, knowing that you'll get the fastest version available as long as you don't over-specify the ops.

I think it simplifies AMO domain creation, unless you wanted AMOs on *lots* of types. What it complicates is whether we want to provide a default domain that supports "everything". We'd now have to provide "everything" domains on a per-type basis. That said, people looking into AMOs would likely be fine creating what they need and wouldn't necessarily need to have an "everything" domain.

Dan said:

If we're going to support statically-initialized domains, I don't see a reason to

also

provide a dynamic one.

When I've talked to my users about the domain idea, they've always said they'd use statically- initialized domains. I think I put the dynamic initializer in there for completeness. Jeff did previously say that "Requiring static allocation is really unpleasant from a usability perspective."

I think Dan's example at the end of Comment 77 looks pretty reasonable.

--- AMO Types --- Dan said:

We haven't discussed the spelling of upc_type_t values and the exact C types they

correspond to,

but we need to specify those.

For completenss with the types supported by the UPC Collectives, I would have upc_type_t support:

UPC_CHAR, UPC_SHORT, UPC_INT, UPC_LONG, UPC_UCHAR, UPC_USHORT, UPC_UINT, UPC_ULONG, UPC_INT8, UPC_INT16, UPC_INT32, UPC_INT64, UPC_UINT8, UPC_UINT16, UPC_UINT32, UPC_UINT64, UPC_FLOAT, UPC_DOUBLE, UPC_LDOUBLE, UPC_PTS

although not all of these would be supported by AMOs or UPC Collectives. I think it would be a complete-enough set for use in other (future) libraries, too.

--- AMO OPs --- Continuing along the lines of the AMO types, I would like to see upc_op_t (7.3.2.1) pulled out of the UPC Collectives library and made a common type for AMOs, Collectives, and whatever else may also want it. In addition to what is in 7.3.2.1, we would add UPC_GET, UPC_SET, UPC_NOT, and UPC_CSWAP. Both UPC Collectives and UPC Atomic would each have to specify which operations are supported.

--- Supported Types & Operations --- Revising the table from before...

CSWAP AND OR XOR NOT GET SET ADD MIN MAX I, I32 X X X X X X X X X X L, I64 X X X X X X X X X X UI, U32 X X X X X X X X X X UL, U64 X X X X X X X X X X F32 X X X X X F64 X X X X X PTS X

```

Reported by `nspark.work` on 2012-09-26 19:51:41

2012-09-26T19:51:41+00:00

Former user Account Deleted

``` SHORT VERSION: Static initializers cannot be supported by software-emulated AMO domains, so this feature should be dropped.

AMO's by definition imply coordination between multiple threads. We've previously discussed the need for domain creation to be a "collective" thing, with participation from all the threads which intend to use the domain to perform "conflicting" AMOs. Eventually that may be collective over a team, for now it means all the threads. The reason for this collective requirement is to allow the implementation of the library to setup whatever metadata it needs to ensure atomicity.

In the case of a domain where all the requested OPs can be supported via hardware instructions, probably nothing special needs to be done during domain creation - except probably record the type size provided by the user, and possibly stash away the other arguments for debug checking (all of which can be done locally with no coordination or computation). The actual AMO ops in this case boil down to a thin wrapper around a hardware instruction, prefaced by a check that this domain is hardware-supported.

However in the case of a domain that needs to perform locking in software, more probably needs to happen. Assume we are interested in a "good" implementation where each software-emulated domain is implemented to operate independently of other software domains (rather than a "dumb" approach of just using a single big lock for all software-emulated domains, which would lead to artificial serialization contention). In this implementation each domain needs its own dedicated "lock" resource, which might be a upc_lock_t, an OS mutex, or something similar. The important point is that domain creation involves allocating this lock resource, and very importantly also some form of COLLECTIVE coordination (probably a broadcast) to ensure all threads participating in the domain creation agree upon the identity/location of the lock resource protecting this domain.

For dynamic domain creation, this is not a problem - we can require the domain allocation function to be called collectively in the user's program, and the library can perform whatever allocation and coordination it requires inside that call. However, I'm now realizing that permitting static allocation of AMO domains would create serious problems for a software implementation. Consider the "simplest" case of a file-scope statically-initialized domain. Our base C99 language does not support running arbitrary code to initialize static objects (as for example in C++ or Java), the only support which is guaranteed to exist in the compiler is static initialization to a constant value. This means the static initializer implementation basically must be a macro, and at best it can simply stuff the arguments into some kind of struct. This does not allow the software implementation to dynamically allocate any lock resources it may need (especially if those locks need to live in the shared heap), nor does it permit collective coordination to uniquely associate the corresponding per-thread domain objects with each other and/or that dedicated lock resource. It might be possible to play some games with delayed initialization of lock resources to skirt the first issue, but I believe the second is fundamental - if the application statically creates multiple domains with similar arguments that must be emulated in software, there is no way for the library to reliably "match up" the domain resources across threads.

I believe this argues strongly for dropping the idea of static initialization of AMO domains entirely. AMO domain creation should be accomplished by a call to a collective initializer library function, which creates the domain and passes back a pointer or handle to it. It means a line of code per domain in the user program, but the alternative design severely curtails software implementation, and by extension the entire AMO library (since software implementation is our "fall back" position for anything not natively supported in hardware). ```

Reported by `danbonachea` on 2012-09-27 08:24:28

2012-09-27T08:24:28+00:00

Former user Account Deleted

``` Updated interface proposal: --------------------------

TYPE upc_amodomain_t

The type upc_amodomain_t is an opaque UPC type. upc_amodomain_t is a shared datatype with incomplete type (as defined in [ISO/IEC00 Sec 6.2.5]). Objects of type upc_amodomain_t may therefore only be manipulated through pointers.

Two pointers to upc_amodomain_t that reference the same AMO domain object will compare as equal. The results of applying upc_phaseof(), upc_threadof(), and upc_addrfield() to such pointers are undefined.

INITIALIZATION upc_amodomain_t *upc_all_amodomain_alloc(upc_op_t ops, upc_type_t type, upc_amohint_t hints);

The upc_all_amodomain_alloc function dynamically allocates an AMO domain and returns a pointer to it.

upc_all_amodomain_alloc is a collective function. The return value on every thread points to the same AMO domain object.

The AMO domain created supports AMO calls to operate on objects of a unique type,

specified by the "type" parameter. The upc_type_t values and the corresponding type

they specify are listed in table [ref].

The "ops" parameter specifies the atomic operations to be supported by the domain.

The valid upc_op_t values and their meanings are listed in table [ref]. Multiple upc_op_t values can be combined by using the bitwise OR operator (|), and each value has a unique bitwise representation that can be unambiguously tested using the bitwise AND operator(&).

The "ops" parameter shall only specify operations within the set permitted for "type"

(as defined in table [ref]), otherwise behavior is undefined.

The "hints" parameter provides a performance hint to the implementation. It shall be equal to one of the following values: 0 : default behavior UPC_AMO_HINT_LATENCY : requests the implementation to minimize latency of AMO operations UPC_AMO_HINT_THROUGHPUT : requests the implementation to maximize throughput of AMO operations UPC_AMO_HINT_* : Implementation-defined additional hint values The implementation is free to ignore the "hints" parameter.

USAGE void upc_amo_strict(upc_amodomain_t *domain, void *fetch_ptr, upc_op_t op, shared void *target, void *operand1, void *operand2);

Description to be added...

Example usage for an atomic add: ------------------------------- shared int32_t val; int main() { upc_amodomain_t *mydom = upc_all_amodomain_alloc((UPC_ADD|UPC_GET|UPC_SET), UPC_TYPE_INT32, UPC_AMO_HINT_LATENCY); upc_amo_strict(mydom, 0, UPC_ADD, &val, 42, 0); }

Notes: ------ This is not the only possible interface, but I'm intentionally mimicing the upc_lock_t interface, because it is a well-proven interface which is established and familiar to UPC users. The UPC I/O library has an elaborate system for providing implementation hints (and including implementation-defined hints), but for now I'm assuming we don't need that much complexity and only expect a few possible hints. We should consider whether we need a domain destructor, ie: void upc_all_amodomain_free(upc_amodomain_t

d). If we're convinced that all possible clients will only have a small constant number of domains, then perhaps that's unnecessary.

```

Reported by `danbonachea` on 2012-09-27 08:25:18

2012-09-27T08:25:18+00:00

Former user Account Deleted

``` Nick said: "We'd now have to provide "everything" domains on a per-type basis. That said, people looking into AMOs would likely be fine creating what they need and wouldn't necessarily need to have an "everything" domain."

I don't think we should provide any pre-defined "everything" domains. Such a domain would by definition include all the ops (for a given type) and therefore would activate the most general (read "slowest") implementation for that type. The whole point of domains is for the user to explicitly tell the library which ops his module needs, so that the fastest possible implementation meeting those restricted needs can be provided. Pre-defining fully general domains would undermine that design goal.

Nick said: "Continuing along the lines of the AMO types, I would like to see upc_op_t (7.3.2.1) pulled out of the UPC Collectives library and made a common type for AMOs, Collectives, and whatever else may also want it. In addition to what is in 7.3.2.1, we would add UPC_GET, UPC_SET, UPC_NOT, and UPC_CSWAP. Both UPC Collectives and UPC Atomic would each have to specify which operations are supported."

Issue 10 discusses this factorization, and I agree this should be a high priority to complement the inclusion of AMO's in 1.3. UPC_GET and UPC_SET are definitely required for every type, since we prohibit direct access during AMO phases, and these are our guaranteed "tear-free" reads and writes. I assume your "UPC_NOT" is a bitwise complement, otherwise it should be named "UPC_LOGNOT" to match the established convention in the collectives. I haven't heard anyone argue for atomic logic operations (UPC_LOGAND, UPC_LOGOR), so let's leave those out of the supported AMO list for now.

Nick said: "For completenss with the types supported by the UPC Collectives, I would have upc_type_t support:

UPC_CHAR, UPC_SHORT, UPC_INT, UPC_LONG,"

I don't recall any committee discussion regarding AMO's on types smaller than 32-bits. I don't personally object to that, and I agree it would be nice to provide it for completeness and consistency with the UPC collectives and C11 AMOs. For software-based implementations it should only be a small additional burden to support additional types, and I suspect every implementation will need a fall-back software implementation anyhow.

```

Reported by `danbonachea` on 2012-09-28 23:42:33 - Blocked on: #10

2012-09-28T23:42:33+00:00

Former user Account Deleted

``` Dan said:

I don't think we should provide any pre-defined "everything" domains.

I thought there were people (not me) that wanted to provide that as a convenience factor. Maybe I was mistaken. Either way, I don't think my users would use an "everything" domain for the very reason you note. They want the fastest for what they know they need. I'm happy to scratch the "everything domain" idea.

Dan said:

I don't recall any committee discussion regarding AMO's on types smaller than 32-bits.

I am not at all advocating for AMOs on <32-bit types. I would probably be happy with 64-bit types only, but considering some notes about BG/P's AMO support (I think), 32-bit AMOs make sense, too. I noted that "although not all of these would be supported by AMOs or UPC Collectives" and I would consider AMOs to only apply to the 32- and 64-bit types we've discussed so far, as well as the pointer-to-shared type. ```

Reported by `nspark.work` on 2012-10-01 21:12:14

2012-10-01T21:12:14+00:00

Former user Account Deleted

``` Since "we" (I'm not sure how many people are really concerned about the AMO discussion) seem to be considering AMO domains on a per-type basis (i.e., only guaranteeing atomicity on a per-type basis), I was wondering whether it would be worth revisiting an "old style" proposal (e.g., upc_amo_xor_U64_r()) with a restricted operation set (defined on a per-type basis). My biggest concern with the domain approach, especially given that domains must be dynamically allocated, is the general user acceptance with such a "heavy" API (c.f., existing AMO extensions).

If we went with a subset of the operations noted in Comment 38 and Comment 81, could we define an easier-to-use API that is supported by Cray, IBM, SGI, and the IB-based solutions (HP, SGI), even if it isn't as robust (i.e., no MIN or MAX)? Would fetching and non-fetching versions of {ADD, AND, OR, XOR, GET, SET, CSWAP} for integer types be a supportable, minimal set across implementations?

Maybe we leave operations like MIN and MAX for a future update when vendor support is more widespread. Floating-point support wouldn't be an issue because of the per-type atomicity; we wouldn't include bit-wise ops for FP types. And most FP AMOs would probably be done in software anyway.

Then, we could let an implementation define macros like UPC_AMO_TYPE_IS_FAST to provide some information to the user about the "fastness" of the AMO. ```

Reported by `nspark.work` on 2012-10-03 14:22:21

2012-10-03T14:22:21+00:00

Former user Account Deleted

``` "Floating-point support wouldn't be an issue because of the per-type atomicity; we wouldn't include bit-wise ops for FP types."

What's the hangup for bit-wise operations? These operations just work on bits and don't need to know whether those bits happen to be an integer or a floating point value (or anything else for that matter). Not including them will just cause users to play dirty games with unions or bad pointer aliasing. We might as well do that inside the library instead of forcing users to do it. ```

Reported by `sdvormwa@cray.com` on 2012-10-03 14:34:32

2012-10-03T14:34:32+00:00

Former user Account Deleted

``` "What's the hangup for bit-wise operations? [...]"

I just didn't think many people would care to AND, OR, or XOR two floating-point values. I guess that I just don't see what's helpful about that. I think that ADD, GET, SET, CSWAP all make sense; I just can't see how you'd use the others in a meaningful way. Also, I don't necessarily feel like the UPC AMO library needs to provide everything under the sun; I think it should provide those things that are meaningful or helpful to users. ```

Reported by `nspark.work` on 2012-10-03 14:39:56

2012-10-03T14:39:56+00:00

Former user Account Deleted

``` I support Comment 86 by Nick. I like the idea of starting off from a small and simple API and enhancing it over time as hardware evolves. Based on Comment 38, it seems int64 is the only type we need to support now if we decide to go with this direction.

```

Reported by `yzheng@lbl.gov` on 2012-10-03 14:46:40

2012-10-03T14:46:40+00:00

Former user Account Deleted

``` "I just didn't think many people would care to AND, OR, or XOR two floating-point values."

This is very true. I'd expect bit-wise ops to take appropriately sized integer values as masks. ;)

"I guess that I just don't see what's helpful about that. I think that ADD, GET, SET, CSWAP all make sense; I just can't see how you'd use the others in a meaningful way."

Just like with integers, they're extremely useful for directly manipulating various portions of the raw representation. For instance, with an IEEE representation, adding XOR allows users to atomically negate a floating point value, and adding AND allows users to atomically find the absolute value.

On an unrelated note, are you including SWAP under SET? SWAP is very important in many lock-free algorithms.

"Also, I don't necessarily feel like the UPC AMO library needs to provide everything under the sun; I think it should provide those things that are meaningful or helpful to users."

Agreed. ```

Reported by `sdvormwa@cray.com` on 2012-10-03 14:52:12

2012-10-03T14:52:12+00:00

Former user Account Deleted

``` "Based on Comment 38, it seems int64 is the only type we need to support now if we decide to go with this direction."

I disagree. I think we should still support [u]int(32|64), float/double, and PTS, but (according to Comment 38) Cray would only define UPC_AMO_INT64_IS_FAST == 1. I think the big benefit of the per-type atomicity is that it lets a vendor, like Cray, tell the user which AMO types are fast and which are not, while still providing reasonably broad support across the library. ```

Reported by `nspark.work` on 2012-10-03 14:52:16

2012-10-03T14:52:16+00:00

Former user Account Deleted

``` "Just like with integers, they're extremely useful for directly manipulating various portions of the raw representation. For instance, with an IEEE representation, adding XOR allows users to atomically negate a floating point value, and adding AND allows users to atomically find the absolute value."

I hadn't thought of that, but that is a neat use for bit-wise ops on FP types. Since most NIC-based AMOs seem to be atomic across types, I would imagine that your future FP support (noted in Comment 38) wouldn't be tied down by UPC allowing bit-wise FP ops.

"On an unrelated note, are you including SWAP under SET? SWAP is very important in many lock-free algorithms."

Yeah, I was considering SWAP = FETCH + SET. If we go the "explicit" route (like the current BUPC and Cray UPC AMO extensions), then we'll probably have two functions like: "void upc_amo_set()" and "TYPE upc_amo_swap()" (I'm intentionally leaving out the type and relaxed/shared stuff here). ```

Reported by `nspark.work` on 2012-10-03 14:58:52

2012-10-03T14:58:52+00:00

Former user Account Deleted

``` For whatever it's worth to this discussion, I talked to our Fortran rep about how their standard's committee was settling on which AMO operations to include for coarrays in the next standard (since Fortran currently has only an atomic read and an atomic write). He said that their approach is to require a set of operations that all implementations will be expected to support (add, and, or, xor, swap, compare-and-swap) and they basically assume that any "reputable hardware" will support them all efficiently. The idea of bothering with domains, a query mechanism for what is fast or slow, or worrying about some poor vendor that had hardware support for only a subset that was forced into software atomics seemed ridiculous to him. Then again, they also have a specific implementation-defined integer data type for atomics. ```

Reported by `johnson.troy.a` on 2012-10-03 15:17:23

2012-10-03T15:17:23+00:00

Former user Account Deleted

``` "Also, I don't necessarily feel like the UPC AMO library needs to provide everything under the sun; I think it should provide those things that are meaningful or helpful to users."

I think the problem is that while any specific user is likely to only consider a small set of AMO types and ops "meaningful and helpful", that set will differ between users/programs/modules, and the union of those requirements ends up being large. The domain API allows users to express which subset they care about for the current algorithm and the implementation to provide the best possible performance within those constraints. On the flip side, it allows implementations to expose whatever hardware AMO support they have for a particular algorithm, without requiring the hardware to support the complete list of ops in the UPC spec (most of which won't be used by any single algorithm).

A secondary, more subtle feature of domains is they can improve concurrency by directly expressing the independence of AMOs from different modules or program entities, which need not contend for the same atomicity resources (software locks in the case of software-implemented AMOs). Imagine a massively-parallel application that divides its threads into a large number of teams, each of which creates an AMO domain for access to each team's shared data. If the data type in question happens to require a software AMO (eg atomic add on a floating-point accumulator), then members of each team contend for the team's lock while performing the AMO, but AMO's from different teams can always proceed independently. Without domains, then EVERY thread in the application potentially contends for ONE lock, even for non-conflicting updates, which could make a massive difference in overall performance. Once UPC officially grows teams, it also gives us a natural way to ensure the locks associated with the domain are located "near" the team using them, which may mean the difference between network communication and shared-memory locking, another potentially enormous performance win. One could even imagine team domain creation that activates a fully hardware implementation when the team members are all local to a shared-memory node, but falls back to a software implementation when the team includes members who are remote over a commodity network lacking AMO features. C11 doesn't worry about such issues because its target platform is fully shared-memory threading with relatively small-scale concurrency, where these issues have far less impact. I suspect the same is true for the majority of the Fortran community (those not using co-array features). It's our job to ensure the design we specify allows good scaling on large-scale systems (high thread concurrency), and large-scale applications (high module concurrency), and I think domains provide that.

From a programability standpoint, I don't think the domain proposal is overly cumbersome or "heavy", relative to a "domain-less" API. The domain is created once at module initiation, with one line of code. Thereafter for AMO operations the difference we're talking about is whether the user writes something like:

upc_amo_strict(int64dom, 0, UPC_ADD, &val, 42, 0);

as opposed to something like:

upc_amo_strict_int64(0, UPC_ADD, &val, 42, 0);

ie the only effective difference at the point of use is whether the operand type is expressed using the domain argument or in the spelling of the function name. In my code above the difference is 2 characters, and could be less if I named my domain variable differently. How is this a major burden? The tiny additional complexity (for the programmer) seems well worth the potential performance gain.

"I think the big benefit of the per-type atomicity is that it lets a vendor, like Cray, tell the user which AMO types are fast and which are not, while still providing reasonably broad support across the library."

We seem to have consensus that per-type atomicity is a mandatory design feature. It allows the implementation (and performance) to vary based on the operand type, and it prohibits nasty interactions of atomics of different type widths conflicting on the same memory locations, which seems like a Good Thing. Also, C11 atomics take the same position, so it will be a familiar restriction.

""On an unrelated note, are you including SWAP under SET? SWAP is very important in many lock-free algorithms." Yeah, I was considering SWAP = FETCH + SET. "

To clarify, most lock-free algorithms I've seen require an atomic pointer COMPARE and swap, not just an unconditional swap. This is NOT equivalent to a fetch and set. For reference, C11 DOES provide atomic pointer CAS (via uintptr_t) and even atomic pointer increment (which I'm not advocating for UPC), but does NOT provide any atomic ops on floating point types (although given that many HPC codes are so heavily dominated by FLOPs it seems reasonable for UPC to add support for atomics on floats).

"Then, we could let an implementation define macros like UPC_AMO_TYPE_IS_FAST to provide some information to the user about the "fastness" of the AMO."

This seems like a nice usability feature. C11 has something similar and slightly more sophisticated. I think we can choose to provide this independently of any other design decisions.

```

Reported by `danbonachea` on 2012-10-04 04:41:52

2012-10-04T04:41:52+00:00

Former user Account Deleted

``` Nick - I'm in the process of drafting the spec for upc_type_t, based on your comment 81. I noticed that C11 defines the following types for AMOs that aren't mentioned in your list, so I wanted to float these past you (no pun intended):

These I suspect are irrelevant for the majority of UPC codes: _Bool wchar_t

These are for issuing AMOs that perform local pointer artihmetic, which perhaps we don't care about?: (u)intptr_t - integer type convertable to/from (void *) ptrdiff_t - the result of subtracting two pointers

These seem potentially more applicable to UPC codes: (u)intmax_t - the widest available integer type size_t - the size of objects

Finally, the C11 AMOs use the stdint minimum-width types: (u)int_least{8,16,32,64}_t and (u)int_fast{8,16,32,64}_t instead of the fixed-width types stated in comment 81: (u)int_{8,16,32,64}_t The main difference is the minimum-width types may be larger than requested and are required for C99 compliance, whereas the fixed-width types must be exact-sized and are technically optional by C99/C11. In practice every current HPC system I've encountered (all the ones supporting GASNet) provide all the exact sized types, so perhaps this is a non-issue for the platforms of interest. However to be safe we might want to add a clause to the AMO spec stating that support for AMOs on the fixed-width types is implementation-specified.

```

Reported by `danbonachea` on 2012-10-07 20:38:25

2012-10-07T20:38:25+00:00

Former user Account Deleted

``` I have worked on the reference implementation today, in the faint hope that I'd flush out a few things that were not obvious in the documentation.

a) I had to redefine all types, and I had to make them bitmask-able. Something like this:

enum { UPC_AMO_CHAR = (1<<0), UPC_AMO_SHORT = (1<<1), UPC_AMO_INT = (1<<2), UPC_AMO_LONG = (1<<3), UPC_AMO_UCHAR = (1<<4), UPC_AMO_USHORT = (1<<5), ... etc ...

Same goes for the operations. I don't know whether the original UPC definitions of upc_type_t are asking for bit-maskable values. Something to consider.

b) There is no good error reporting mechanism in UPC. What if I give the domain allocator a hint that I want the superfast HW implementation, but then ask for support for types that are not supported by HW?

-- should the domain creation fail? -- should the domain creation succeed, but subsequent operation fail? -- should failures be fatal (i.e. kill the UPC program), or should we have status codes returned by upc_amo_strict and upc_amo_relaxed?

I implemented upc_amo_query() in any case to check whether a particular domain implements a particular combination of types and operations.

c) the domain constructors do not return shared objects, and therefore the _all_ and _global_ protocol does not apply. This is a pity, because it breaks consistency with the existing lock and memory allocators.

-- should we follow the consistency rules of memory and lock allocators at all costs?

If we talk on the phone tomorrow, I will have some code. Suggestions as to where to post the code? even if it's throwaway stuff (as it most likely is), it might provide a further platform for discussion.

```

Reported by `ga10502` on 2012-10-08 21:26:50

2012-10-08T21:26:50+00:00

Former user Account Deleted

``` "I have worked on the reference implementation today, in the faint hope that I'd flush out a few things that were not obvious in the documentation."

Excellent - this exercise should definitely provide additional insights.

"I had to redefine all types, and I had to make them bitmask-able."

I don't think there is any reason for the values of upc_type_t to each have a dedicated bit. These are never OR'd together in the proposed interfaces we've tossed around. Note each domain should only be for a SINGLE unique type - I believe we have consensus that we should not combine multiple types within an atomicity domain (mostly because it sidesteps word tearing issues). The upc_types.h header proposal I drew up for issue 10 only requires the type macros to have distinct values (and also spells them differently than what you pasted).

"What if I give the domain allocator a hint that I want the superfast HW implementation, but then ask for support for types that are not supported by HW? "

There should be no "superfast HW" hint - the hints are of the form "tune for latency", "tune for throughput" etc, but in all cases "do the best you can for this combination of type and ops" is implied. In any case the hint value should not cause an error - it's a hint, not a demand. There should not be a hint value to select a particular vendor-specific implementation, because that would create portability problems.

If the user specifies a prohibited combination of type and op (eg UPC_PTS with UPC_MULT), or type/op values that don't exist, then that should be a violation of a "shall" constraint in the AMO spec - which means behavior is undefined, but this should be a fatal error in a high-quality implementation. I see no motivation to provide graceful failure for such programming errors.

"the domain constructors do not return shared objects, and therefore the _all_ and _global_ protocol does not apply. This is a pity, because it breaks consistency with the existing lock and memory allocators. "

Please see comment 83 - I intentionally followed the form of the upc_lock_t interface, which I think addresses your comment...

```

Reported by `danbonachea` on 2012-10-08 21:42:52

2012-10-08T21:42:52+00:00

Former user Account Deleted

``` "If the user specifies a prohibited combination of type and op (eg UPC_PTS with UPC_MULT), or type/op values that don't exist, then that should be a violation of a "shall" constraint in the AMO spec - which means behavior is undefined, but this should be a fatal error in a high-quality implementation. I see no motivation to provide graceful failure for such programming errors."

The user specifying type/op combinations that aren't supported by the implementation should be a fatal error (or undefined behavior), and an implementation should support all required combinations in the spec. However, I think we would want to explicitly permit implementations to support a documented super-set of the required combinations so they can introduce new (non-portable) functionality for users to experiment with. This way implementations wouldn't need to have a duplicate AMO API and makes satisfying the "two independent implementations" criteria for future extensions a bit easier. ```

Reported by `sdvormwa@cray.com` on 2012-10-08 21:56:52

2012-10-08T21:56:52+00:00

Former user Account Deleted

``` In lieu of a better solution I'll spam all of you with the tar file of the current implementation. Please comment. I expect most of the code to change.

I don't think there is any reason for the values of upc_type_t to

each have a dedicated bit. These are never OR'd together in the proposed interfaces we've tossed around. Note each domain should only be for a SINGLE unique type - I believe we have consensus that we should not combine multiple types within an atomicity domain (mostly because it sidesteps word tearing issues). The upc_types.h header proposal I drew up for issue 10 only requires the type macros to have distinct values (and also spells them differently than what you pasted).

Misunderstanding on my part, then. Looking purely at the API proposed somewhere above in this thread - I interpreted the type and op arguments for domain creation as bitmasks of desired operation. I may agree with Dan about the type, but it certainly makes no sense to me to create one domain for each operation. So my argument will still apply to the operations.

There should be no "superfast HW" hint - the hints are of the form

"tune for latency", "tune for throughput" etc, but in all cases "do the best you can for this combination of type and ops" is implied. In any case the hint value should not cause an error - it's a hint, not a demand. There should not be a hint value to select a particular vendor-specific implementation, because that would create portability problems.

OK. No errors upon domain creation. I can still ask to execute an AMO on an un-implemented operation on a domain. So AMOs can still issue "not implemented" errors. Hence my choice to return a status code from the AMO. Needless to say I'm not very strongly attached to this code, so I can be talked out of this choice.

Please see comment 83 - I intentionally followed the form of the

upc_lock_t interface, which I think addresses your comment...

I guess the question in my mind is this: is an AMO domain a shared object? A lock is obviously a shared object.

If the AMO domain is shared, a single thread can free it, and the whole alloc/free protocol translates directly from memory allocation and lock allocation.

If the AMO domain is local, then the protocol does not apply.

```

Reported by `ga10502` on 2012-10-09 12:53:35

2012-10-09T12:53:35+00:00

Former user Account Deleted

``` "In lieu of a better solution I'll spam all of you with the tar file of the current implementation."

You can click the link next to the paper clip icon with the label "Attach a file" to add the file to a ticket.

```

Reported by `gary.funck` on 2012-10-09 13:10:39

2012-10-09T13:10:39+00:00

Former user Account Deleted

``` On 10/09/12 09:07:27, Gheorghe Almasi wrote:

The code is already somewhat out of date based on Dan's comments.

(See attached file: upc_amo_0.01.tar.gz)

```

Reported by `gary.funck` on 2012-10-09 14:29:32

<hr>

*Attachment: [upc_amo_0.01.tar.gz](https://storage.googleapis.com/google-code-attachments/upc-specification/issue-7/comment-101/upc_amo_0.01.tar.gz)*

2012-10-09T14:29:32+00:00

Former user Account Deleted

``` "I may agree with Dan about the type, but it certainly makes no sense to me to create one domain for each operation. So my argument will still apply to the operations."

Agreed - upc_op_t values are specified to have unique bits in the issue10 proposal, specifically to allow creation of AMO domains that support multiple operations. However it turns out there are far fewer OPs than TYPEs that we may decide to support, so we should not run out of bits.

"I can still ask to execute an AMO on an un-implemented operation on a domain. So AMOs can still issue "not implemented" errors."

I assume by "unimplemented" you mean a situation where the user created a domain for upc_type_t T and upc_op_t (OP1|OP2|OP3), and then tried to invoke an AMO to perform OP4. This is a programming error and should just be undefined behavior (ie fatal error in a high-quality runtime implementation).

Note the implementation is required to provide a "working" AMO domain for any valid combination of type and ops the user requested at domain creation time (where "valid combination" is defined by the SPEC, not the implementation). This is what we mean when we say we are requiring all implementations to provide a fully general fallback mode (probably implemented in software with locks) that can correctly handle any combination permitted by the spec.

"I guess the question in my mind is this: is an AMO domain a shared object? A lock is obviously a shared object."

My proposal in comment 83 makes them a shared object just like a upc_lock_t. This is not the only possible design, but nobody has raised an objection to it yet. I think making them shared objects leverages familiarity with upc_lock_t, and provides some small usability features that may occasionally be useful (eg the ability to construct a complicated, dynamic data structure in shared memory with embedded references to amo domain objects that protect certain independent parts of the data structure).

"If the AMO domain is shared, a single thread can free it, and the whole alloc/free protocol translates directly from memory allocation and lock allocation."

So far nobody has presented a strong case for providing a "free" operation for amo domains at all (since our expected usage case is a small constant number of objects that have a lifetime of the entire program). A C purist would argue to include one to prevent memory leaks, so perhaps we should provide one in the interests of making the interface "complete" (and add clarifying text to make sure the programmer never uses a domain that was freed by any thread). It should probably work just like upc_lock_free().

```

Reported by `danbonachea` on 2012-10-09 14:34:53

2012-10-09T14:34:53+00:00

Former user Account Deleted

``` "is an AMO domain a shared object?"

Note that even if we decide that AMO domains are shared objects, like the upc_lock_t-inspired shared object design in comment 83, nothing prevents an implementation from storing whatever local information it wants, to speed up performance. In fact, one of the most obvious optimizations for a software-based AMO implementation (one built on UPC locks) would be to allocate one upc_lock_t object per-THREAD as part of each AMO domain, since AMO operations on locations with affinity to different threads are guaranteed not to conflict. This allows AMO ops on locations with affinity to the calling thread to always complete without communication. Also, it would be nice for error checking like validating the requested op to happen without communication in the library.

For example (modifying George's code, WARNING dry-coded):

include <upc_types.h>

typedef shared void upc_amodomain_t;

typedef struct { upc_lock_t *lock; upc_type_t type; upc_op_t ops; } upc_amodomain_internalrep_t;

upc_amodomain_t *upc_all_amodomain_alloc(upc_op_t ops, upc_type_t type, upc_amohint_t hints) { shared [1] upc_amodomain_internalrep_t * domain = upc_all_alloc(THREADS,sizeof(upc_amodomain_internalrep_t)); shared [1] upc_amodomain_internalrep_t *localdom = domain + MYTHREAD; localdom ->lock = upc_lock_alloc (); localdom ->ops = ops; localdom ->types = type; /* we ignore hint */ return domain; }

void upc_amo_strict(upc_amodomain_t *domain, void *fetch_ptr, upc_op_t op, shared void *target, void *operand1, void *operand2) { assert(domain); shared [1] upc_amodomain_internalrep_t *dom = domain; shared [1] upc_amodomain_internalrep_t *localdom = dom + MYTHREAD; assert(localdom->ops & op); use local metadata to check for errors, and retrieve the type upc_type_t tgttype = localdom->type; int tgtthread = upc_threadof(target); upc_lock_t *tgtlock = dom[tgtthread].lock; could also add a directory to cache this locally upc_lock(tgtlock); ... }

```

Reported by `danbonachea` on 2012-10-09 15:22:38

2012-10-09T15:22:38+00:00

Former user Account Deleted

``` Revision: 154 Author: ga10502@gmail.com Date: Tue Oct 9 08:03:50 2012 Log: Added reference upc_amo library http://code.google.com/p/upc-specification/source/detail?r=154

Added: /trunk/reference_code /trunk/reference_code/upc_amo /trunk/reference_code/upc_amo/Makefile /trunk/reference_code/upc_amo/upc_amo.h /trunk/reference_code/upc_amo/upc_amolib.upc /trunk/reference_code/upc_amo/upc_amotest.upc

```

Reported by `gary.funck` on 2012-10-09 15:32:11

2012-10-09T15:32:11+00:00

Former user Account Deleted

``` Arguably, the inclusion of stdint.h was unnecessary, and the upc_amo.h header file is the correct place for the inclusion of stdint.h.

Revision: 155 Author: gary.funck@gmail.com Date: Tue Oct 9 08:21:57 2012 Log: Misc. fixes to AMO ref. implementation. Include <stdint.h> and fix prototypes. http://code.google.com/p/upc-specification/source/detail?r=155

Modified: /trunk/reference_code/upc_amo/upc_amo.h /trunk/reference_code/upc_amo/upc_amolib.upc /trunk/reference_code/upc_amo/upc_amotest.upc

If we agree, I will revert the change and add the include to upc_amo.h.

```

Reported by `gary.funck` on 2012-10-09 15:34:57

2012-10-09T15:34:57+00:00

Former user Account Deleted

Yes Gary. Do commit. Sorry for the oversight.

Reported by ga10502 on 2012-10-09 19:05:57

2012-10-09T19:05:57+00:00

Former user Account Deleted

Re: comment 103 by Dan: good insight on local locks - you might want to test & checkin.


In any case the domain becomes a shared object distributed across threads, with each
thread performing local access. A fair bit of creation overhead and some (not much)
access overhead. I can re-code it as such and the question is settled.

Reported by ga10502 on 2012-10-09 19:11:16

2012-10-09T19:11:16+00:00

Former user Account Deleted

At present, none of the stdint types show up in the AMO interface, therefore, I don't
think inclusion of stdint.h from within upc_amo.h is indicated.  We can discuss.

Reported by gary.funck on 2012-10-09 19:12:13

2012-10-09T19:12:13+00:00

Former user Account Deleted

I should have said: I will also implement Dan's clarification w.r.t. types. I may not
be able to checkin by the time of the phone conference, but I will correctly code the
domain-as-a-shared object and single-type-domain changes discussed here today.

Reported by ga10502 on 2012-10-09 19:20:34

2012-10-09T19:20:34+00:00

Former user Account Deleted

"good insight on local locks - you might want to test & checkin."

I was just demonstrating that requiring the AMO domain to be a shared object should
not impose any significant use-time performance penalty in a good implementation. I'm
not sure the per-thread-locks optimization belongs in a "reference" implementation
- after correctness and conformance, is the next goal of that implementation performance
or simplicity?

As for domain creation overhead, my implementation strategy in comment 103 for a shared
domain object should have roughly the same communication overhead as the non-shared
domain object code currently in SVN - it just replaces a upc_all_lock_alloc collective
with a upc_all_alloc collective, which have roughly the same cost in a good implementation.
It also assumes upc_global_lock_alloc() can always be completed locally, but that's
the whole point of the per-thread-locks optimization (which can be disabled independently
of the use of a shared domain object).

Reported by danbonachea on 2012-10-09 19:30:11

2012-10-09T19:30:11+00:00

Former user Account Deleted

I just realized one side effect of single-type domains: we lose one of the many parameters
to the upc_amo_* calls (the type becomes superfluous, since the domain already specifies
it).

The drawback is that the testing function will now have to create multiple domains,
one for each tested type. But I tend to agree with Dan that "real life" applications
will use only one data type for AMOs - most frequently [u]int32 and [u]int64.

Reported by ga10502 on 2012-10-09 19:42:29

2012-10-09T19:42:29+00:00

Former user Account Deleted

George - your latest code push adds a non-collective "global_alloc" allocator for domains.
I don't see a strong motivation for providing this, and given the amount of implementation
grumbling about upc_global_alloc, I'm not sure we should be providing that feature.
I'd advocate keeping it simple with just upc_all_amodomain_alloc, and MAYBE a free
operation.

Incidentally, I believe your current implementation also happens to be incorrect (it
fails to initialize the remote thread data structures).

Reported by danbonachea on 2012-10-09 19:51:48

2012-10-09T19:51:48+00:00

Former user Account Deleted

I just committed an "interim draft".  It's definitely not complete.  Due to my tardiness,
I've attached the PDF render (in case not everyone has access to LaTeX right now).

Looking over George's code, I think I'm note entirely certain that I understand the
utility of upc_amo_query().  Initially, my thoughts for a query function were to give
the user a means to test whether the AMOs specified by the type and ops were supported
"in hardware."  George's code seems to indicate that it queries the domain to see whether
the domain supports the specific AMOs.

Would it make sense to have upc_amo_query() return upc_amohint_t, so that the user
can know what hints (maybe we do need a UPC_AMO_HINT_HARDWARE) are valid for a type
and set of ops?  In which case, it wouldn't be querying the domain, but the implementation
(which was what I thought the initial intent was).

Reported by nspark.work on 2012-10-09 19:54:47

<hr> * Attachment: upc-lib-atomic-ops-spec-draft.pdf

2012-10-09T19:54:47+00:00

Former user Account Deleted

Dan, correct- I just discovered and fixed the problem in the global allocator, have
yet to checkin. I implemented the function and didn't write a test for it - a stupid
mistake. Apologies.

As far as including or not including, I don't have a strong opinion.

Reported by ga10502 on 2012-10-09 19:55:19

2012-10-09T19:55:19+00:00

Former user Account Deleted

Nick - A few minor nitpicks on the current document text, which weren't mentioned in
the call:

* Please add the following paragraph to 7.4.1, for conformance with issue 91:

    Unless otherwise noted, all of the functions, types and macros
    specified in Section~\ref{upc-amo}
    are declared by the header {\tt <upc_amo.h>}.

* Also add this paragraph:

    Every inclusion of {\tt <upc\_amo.h>} has the effect of including {\tt <upc\_types.h>}.

* The function Synopsis sections should not #include <upc.h>, just delete that line
from each.

* For the broken links to upc_type_t and upc_op_t, you should use the latex macros:
   \upcopsection \upctypesection
  I've just added these to the common preamble, so you can use them now, although the
referenced sections won't exist until the issue10 branch is merged.

Reported by danbonachea on 2012-10-11 03:36:18

2012-10-11T03:36:18+00:00

Former user Account Deleted

Dan, I think I addressed your notes (and those from Tuesday's call) in the latest commit.

One thing that stood out when I was typing up the example for upc_amo_relaxed() was
that to do something like an increment by a fixed value, you'd have to actually have
that value stored in a constant somewhere.

Dan had presented the example:
> shared int32_t val;
> int main() {
>   upc_amodomain_t *mydom = upc_all_amodomain_alloc((UPC_ADD|UPC_GET|UPC_SET), UPC_TYPE_INT32,
UPC_AMO_HINT_LATENCY);
>   upc_amo_strict(mydom, 0, UPC_ADD, &val, 42, 0);
> }

However, this wouldn't perform "val = val + 42", since 42 must be passed as a (void*).
 I think that this would actually have to be:
> shared int32_t val;
> const int32_t incr = 42;
> int main() {
>   upc_amodomain_t *mydom = upc_all_amodomain_alloc((UPC_ADD|UPC_GET|UPC_SET), UPC_TYPE_INT32,
UPC_AMO_HINT_LATENCY);
>   upc_amo_strict(mydom, 0, UPC_ADD, &val, &incr, 0);
> }

It's not much worse, but it may motivate us to add a UPC_INCR so that users don't have
to have "TYPE ONE = 1;", which just feels silly to me.  Unfortunately, this could balloon
and motivate adding UPC_NOT and potentially others.

Reported by nspark.work on 2012-10-11 15:15:08

2012-10-11T15:15:08+00:00

Former user Account Deleted

"One thing that stood out when I was typing up the example for upc_amo_relaxed() was
that to do something like an increment by a fixed value, you'd have to actually have
that value stored in a constant somewhere."

I concur - my example was incorrect. This is an unfortunate side-effect of providing
a type-generic interface via a level of indirection (ie void *), rather than using
generic functions (as in C11). It's a minor nuisance, but I don't see it as a show-stopper.
I wouldn't argue to add a UPC_INCR, unless we believed there was a platform that could
do atomic-add-one faster than atomic-add-N.

Reported by danbonachea on 2012-10-11 17:13:36

2012-10-11T17:13:36+00:00

Former user Account Deleted

"I concur - my example was incorrect. This is an unfortunate side-effect of providing
a type-generic interface via a level of indirection (ie void *), rather than using
generic functions (as in C11). It's a minor nuisance, but I don't see it as a show-stopper.
I wouldn't argue to add a UPC_INCR, unless we believed there was a platform that could
do atomic-add-one faster than atomic-add-N."

It's also a potential performance hit on some platforms due to the additional memory
loads/stores if the operands are pass-by-reference.

Reported by sdvormwa@cray.com on 2012-10-11 17:25:13

2012-10-11T17:25:13+00:00

Former user Account Deleted

"It's also a potential performance hit on some platforms due to the additional memory
loads/stores if the operands are pass-by-reference."

Yes, but the AMO itself always imposes a heavyweight instruction (or more than one),
and often even a network traversal - so assuming the operand is in a stack variable,
its almost guaranteed to be in cache and the cost of loading it should be lost in the
noise. A good optimizer that understands the AMO interface could even put the operand
in a register and use it directly for the AMO. 

However, we SHOULD add const qualifiers to the operands to help analysis.

Reported by danbonachea on 2012-10-11 18:41:19

2012-10-11T18:41:19+00:00

Former user Account Deleted

Hi Nick - thanks for the new draft, it's looking much improved. Here are some comments
on the new stuff:

"Added upc_amolock_t for the macros UPC_AMO_LOCK_FREE and UPC_AMO_NOT_LOCK_FREE.  I'm
not crazy about the name, though."

I'm also not crazy about those exact names. C11 defines the term "lock-free atomic"
to mean something rather stronger than we want to imply - specifically that they are
safe for use in signal handlers (which I'm pretty sure we don't want, especially in
distributed implementations of UPC).

Also, I don't think it's consistent with C99 library design philosophy to introduce
a whole new type (upc_amolock_t) to represent the boolean return value of a single
function. I propose we delete this type and change the query function to something
like:

int upc_amo_isfast(upc_type_t type, upc_op_t ops, shared void *addr);

\np The {\tt upc\_amo\_isfast} function queries the implementation to determine the
expected performance of performing a {\tt upc_amo_relaxed} call on {\tt addr}, using
a domain allocated with the arguments {\tt type} and {\tt ops}. The call returns non-zero
if the performance is expected to be comparable to the fastest expected performance
of {\tt upc_amo_relaxed} for any combination of {\tt addr}, {\tt type} and {\tt ops}.
Otherwise the function returns zero.~\footnote{
This function allows the implementation to report which combinations of type, ops and
alignment are best supported (eg using hardware atomic instructions). Some implementations
may also return zero when upc_threadof(addr) is not equal to the calling thread, to
indicate the additional cost of remote access.
}

Let's also move this query function to the last section of the library spec, to eliminate
forward references and because it's an "optional" call that many users can safely ignore.

"Added upc_amodomain_free() and upc_all_amodomain_free().  Did we want both?"

I believe the consensus was to only provide collective allocators and deallocators.
Non-collective ones can easily be added in a future version if someone can make a convincing
argument that they are important, but lets drop those for now.

7.4: The naming of the feature macro and library header should match for usability.
Since we use the amo abbreviation in function names, lets just change __UPC_ATOMIC__
 to __UPC_AMO__

7.4.2: The new op table is a definite improvement, but needs a few tweaks. UPC_PTS
needs to also support UPC_GET and UPC_SET, so lets pull those out of "Numeric Ops"
and put them in a new category called "Accessors" that also includes UPC_CSWAP.

7.4.2: Add a paragraph that says something like:
The UPC_GET, UPC_SET and UPC_CSWAP value macros are defined in <upc_amo.h>. All other
UPC_* value macros mentioned in this subsection are defined by <upc_types.h> (see \upcopsection
and \upctypesection).

7.4.4.2: The "mode" parameter should be called "hints" to match the text.

7.4.4.2: The direct references to upc_op_t in upc_types.h should probably instead be
a reference to the common requirements section (which in turn references upc_types.h).
Otherwise the text gives the impression that the function prohibits the library-extension
values (UPC_GET, UPC_SET and UPC_CSWAP).

7.4.4.2 para 8:  "UPC_INT64" in the text should be "int64_t"

7.4.4.5: add const qualifiers to operand1 and operand2. We should probably also add
"restrict" to all the pointers (except domain), to indicate none are permitted to alias.

7.4.4.5: Something needs to be said about the behavior of UPC_CSWAP on a NaN floating
point value. Perhaps just a footnote on the definition of CSWAP saying that behavior
is undefined if *target or *operand1 is a NaN value.

7.4.4.5 p4: I think we need to be a bit more explicit about the memory model behavior
of AMOs. Specifically, we need to say that upc_amo_strict constitues a strict read
followed by a strict write, issued by the calling thread. Similarly for relaxed. Exceptions
to this are UPC_GET (which is just a read) UPC_SET (which is just a write). Here I'm
assuming UPC_CSWAP should be considered an unconditional write (as currently specified),
even though sometimes it writes the old value. I'm also wondering if we need to formalize
what it means for amo acesses to be "atomic" with respect to other accesses using the
same domain, or whether that's sufficiently obvious from the informal description.

7.4.4.5: Lets add a paragraph stating that arguments operand1 and operand2 shall be
a null pointer value for operations that don't use them (which allows implementations
to issue an error). I think it should also be an error to call UPC_GET with a null
fetch pointer, because that's just silly and probably represents a programming error.

7.4.4.5: As currently written, UPC_SET with a non-null fetch pointer has the same effect
as an unconditional atomic swap. Is that what we want? If so it should probably be
noted as it is not immediately obvious.

Reported by danbonachea on 2012-10-11 21:24:23

2012-10-11T21:24:23+00:00

Former user Account Deleted

I would like to enter a vote from the user community for including UPC_INC and UPC_DEC.
 Maybe it doesn't matter (ie, maybe we could wrap the call with our own macro).  But,
 it seems that one is usually doing one of two things:  doing local stuff into a variable
and then contributing that to a shared spot (this is probably the high throughput case),
or doing fetch_and_inc (this is probably the low latency case).  So with the addition
of UPC_INC and UPC_DEC we would have a clean way to handle "99%" of the usage.

Reported by prmerkey on 2012-10-15 18:57:30

2012-10-15T18:57:30+00:00

Former user Account Deleted

>So with the addition of UPC_INC and UPC_DEC we would have a clean way to handle "99%"
of the usage.

I agree that "add 1" and "subtract 1" are a very important usage case for the AMO library.
However I'm not convinced that adding syntactic sugar for those two cases provides
any benefit over just using the general UPC_ADD. It doesn't add much "shorthand" for
the programmer, because due to the current upc_amo_relaxed() API you would still have
to type a NULL for the operand parameter anyhow. Also, as you say many programmers
will hide the actual amo call inside a macro anyhow. 

There might be a motivation if some hardware supports ONLY atomic-add-one and not atomic-add-N
(or provides an atomic-add-one that is significantly faster). I don't recall encountering
any such hardware when implementing the GASNet atomics, but perhaps someone can provide
a motivating example system.

Reported by danbonachea on 2012-10-15 20:21:19

2012-10-15T20:21:19+00:00

Former user Account Deleted

Dan wrote:
> There might be a motivation if some hardware supports ONLY atomic-add-one and not
> atomic-add-N (or provides an atomic-add-one that is significantly faster). I don't
> recall encountering any such hardware when implementing the GASNet atomics, but
> perhaps someone can provide a motivating example system.

While not HPC-relevant in my opinion, IA64 is one platform which has a single-instruction
atomic inc/dec (actually can add or subtract certain small powers of 2 like 1, 2, 4
or 16 if I recall) but requires CAS (with retry) to add arbitrary values.

I am "neutral" on the issue of including INC and DEC operations.

Reported by phhargrove@lbl.gov on 2012-10-15 20:30:11

2012-10-15T20:30:11+00:00

Former user Account Deleted

On Blue Gene/Q, atomic inc and dec can be done in hardware, whereas add and sub require
an active-message.  If a user only wants inc and dec, then they can get them in hardware,
which is obviously a huge performance improvement in the case where asynchronous agency
is disabled, and still a measurable benefit even when asynchronous agency is enabled.

One might argue, "but this is a special case because Blue Gene/Q is <pejorative adjective>."
 However, the reason that BGQ can do inc and dec but not add and sub has to do with
what certain hardware components can do without extensive redesign, and thus I believe
that it is reasonable to expect other hardware may behave this way as well, particular
if one has 64-bit atomic words.

Reported by jeff.science on 2012-10-15 21:41:10

2012-10-15T21:41:10+00:00

Former user Account Deleted

Based on Jeff's input, I am leaning slightly toward adding INC and DEC operations.

The existence of even just 1 HPC platform w/ a non-trivial difference between add-1
and add-N makes this worth consideration.  Since the API is passing the operand by-reference
I don't think that it is reasonable, in general, to expect the compiler to be able
to recognize unity as a special case.

Reported by phhargrove@lbl.gov on 2012-10-15 22:14:24

2012-10-15T22:14:24+00:00

Former user Account Deleted

Reference code updated to Nick's current spec (revision 2.1). SVN rev = 170.

Reported by ga10502 on 2012-10-15 23:40:49

2012-10-15T23:40:49+00:00

Former user Account Deleted

"The existence of even just 1 HPC platform w/ a non-trivial difference between add-1
and add-N makes this worth consideration.  Since the API is passing the operand by-reference
I don't think that it is reasonable, in general, to expect the compiler to be able
to recognize unity as a special case."

I agree that one HPC platform is sufficient justification, but I'm not convinced this
example quite qualifies, so I'd like to hear more about the technical issue. It sounds
like we're talking about the difference between an active message, and some other "faster"
network message? What ballpark overall latency are we talking for each option?

It's important to note that while the optimizer might need to be somewhat smart to
infer *operand == 1, detecting this special case does NOT require a smart runtime system.
Specifically, it only takes the cost of one very fast branch for the upc_amo_relaxed()
implementation to realize it's doing the special case of atomic-add-one instead of
atomic-add-N, and use a faster code path where available. I suspect the cost of this
branch would be completely lost in the noise for anything involving network communication
on any large-scale system. 

The cost of that branch might be more noticeable on a small-scale, cache-coherent shared-memory
system where even "remote" AMOs might have a cost under a dozen cycles. However our
current API isn't really designed to streamline that regime anyhow without heavy optimization
of the library call, since we pass so many dynamic parameters that affect the implementation.

I'm more interested to know if there are examples of systems that can do atomic-add-1
in a "lock-free" instruction, but require locks to implement atomic-add-N. This would
provide a stronger motivation for INC/DEC, because on such a system the runtime branch
trick would not suffice (ie the implementation cannot prove that all conflicting updates
are atomic-add-1, so it has to ignore the hardware support and use locks to ensure
atomicity). On such a system, if the domain allocator specified INC/DEC but not ADD,
the implementation could use the lock-free hardware instruction, but otherwise would
have to use locks. Paul's IA64 example is close, but that system would probably use
CAS to implement atomic-add-N without locks and therefore would not suffer this problem.

Reported by danbonachea on 2012-10-15 23:53:10

2012-10-15T23:53:10+00:00

Former user Account Deleted

In any case, the INC/DEC feature also seems like one that can easily be added at any
point. We could even resolve to leave it out for the "optional library" 1.3 release,
then based on implementation/usage experience add it for the 1.4 "required library"
release. It's certainly easier to add features than to take them away.

I think it's more pressing to resolve issues with the general framework (eg those raised
in comment 120), which would be more difficult to change later without breaking backwards
compatibility.

Reported by danbonachea on 2012-10-16 00:13:12

2012-10-16T00:13:12+00:00

Former user Account Deleted

"I agree that one HPC platform is sufficient justification, but I'm not convinced this
example quite qualifies, so I'd like to hear more about the technical issue. It sounds
like we're talking about the difference between an active message, and some other "faster"
network message? What ballpark overall latency are we talking for each option?"

INC and DEC are actually modifications to load/store instructions, so there are literally
just a Put with a bit mask on the address to induce the store modification.

I confess to not having tested this feature myself yet but the above is based upon
detailed conversations with IBM over almost two years, so I have some confidence in
their accuracy.  The local version of these operations is very much in use in many
parts of BG/Q system software so we know that it works.  What is untested is the remote
version, wherein Put/Get implement load/store.

If necessary, I'll write some tests for this, perhaps with IBM's help, but it's not
high on priority list this week.

Reported by jeff.science on 2012-10-16 01:23:41

2012-10-16T01:23:41+00:00

Former user Account Deleted

"The local version of these operations is very much in use in many parts of BG/Q system
software so we know that it works.  What is untested is the remote version, wherein
Put/Get implement load/store... If necessary, I'll write some tests for this"

Regardless of the absolute performance of this Put/Get remote increment instruction,
in general it still has to touch the network, which means the cost of a highly-predictable
runtime branch to differentiate atomic-add-one from atomic-add-N should still be lost
in the noise (right?). 

The key question to answer is whether these special INC/DEC instructions are atomic
with respect to the "best" way to implement atomic-add-N (whatever that might be..
active message?).

Reported by danbonachea on 2012-10-16 02:29:25

2012-10-16T02:29:25+00:00

Former user Account Deleted

"Regardless of the absolute performance of this Put/Get remote increment instruction,
in general it still has to touch the network, which means the cost of a highly-predictable
runtime branch to differentiate atomic-add-one from atomic-add-N should still be lost
in the noise (right?)."

Put/Get are going to be single packet RDMA.  Arbitrary increment via Send (active-message)
is going to have to wait until the target polls or a comm thread can be scheduled.
 The performance details here are extremely dependent on the implementation of the
runtime and probably not suitable for this forum.  However, if you want Paul and I
to go into vivid detail on how we use PAMI on BGQ, I suppose we can do that.  I know
Paul is less concerned about BGQ relative to PERCS though, so perhaps he doesn't want
to take a strong position here.

My overarching position is that hardware is better than software, so doing AMOs with
RDMA is superior to AMs.

"The key question to answer is whether these special INC/DEC instructions are atomic
with respect to the "best" way to implement atomic-add-N (whatever that might be..
active message?)."

I would do lwarx+stwcx inside of the AM handler so I imagine this would be atomic w.r.t.
inc/dec+load/store, but I'm going to be noncommittal until I either test this myself
or have a long conversation with IBM about the atomicity of the MU vs. the CPU.

Reported by jeff.science on 2012-10-16 02:48:40

2012-10-16T02:48:40+00:00

Former user Account Deleted

"My overarching position is that hardware is better than software, so doing AMOs with
RDMA is superior to AMs."

No doubt - although I think you're missing part of my point. Specifically, the implementation
of UPC_ADD for BG/Q should already be able to use the fast hardware inc/dec instructions
when the user is adding one, by simply testing for the case where the user passed *operand
== 1. This optimization should probably be performed for UPC_ADD regardless of whether
we expose an explicit UPC_INCR in the public API (which is the current question under
debate). Provided the special inc/dec instructions are atomic wrt the lwarx+stwcx used
to implement atomic-add-N, then it shouldn't matter whether to this machine whether
we expose UPC_INCR in the AMO library or not - the implementation will still do "the
fast hardware thing" when the user adds 1 (regardless of how he expresses it), and
"the slower but still correct thing" when he adds a non-unit value.

Reported by danbonachea on 2012-10-16 03:06:44

2012-10-16T03:06:44+00:00

Former user Account Deleted

Re: comment 132

At this point, I need to talk to IBM and/or test these things myself.  I am making
too many assumptions to trust that the fastpath option you describe will work as expected.

I should reread the latest AMO proposal, but if there's still that option (which I
don't like) to query for what is "in hardware" then your "optimize for operand=1" is
not exposed to the user.  If that query is gone, then that's great, because - as I
said before - I don't think it makes a lot of sense.

Reported by jeff.science on 2012-10-16 03:14:53

2012-10-16T03:14:53+00:00

Former user Account Deleted

I did a bit of fact checking with Phil Heidelberger of IBM Research. He is telling me:
remote atomic add (arbitrary integers) packets can be dispatched in the 0.5 usec range,
remote test+add can be done in twice that time (obviously, return time is involved),
anything like a compare-and-swap requires active messages - and you are looking at
2-3 usec for that. 

I don't think he said anything about INC and DEC being privileged. So Jeff, if you
have sources that tell you otherwise, let's hear them. ... I don't have /Q access so
I can't test myself.

Reported by ga10502 on 2012-10-16 20:25:17

2012-10-16T20:25:17+00:00

Former user Account Deleted

Obviously, Phil knows what he's talking about, but I think he's referring to what one
can do in hardware and/or at the SPI level.  Rochester said the SPI can do arbitrary
fixed integer increment for some finite set of these (maybe restricted to integers
that are the sum of <8 powers of 2 or something weird like that), but that they cannot
be exposed at the PAMI level, meaning no sane implementation of UPC can use them. 
Neither GASNet nor XLUPC drops to the SPI on BGQ; rather, both use PAMI (which probably
has something to do with PERCS being the preferred platform).

The good news in the theoretical case where one does UPC on BGQ using SPI is that the
compiler can deal with the weirdness of the packing of the op codes into the packet,
whereas a library runtime that uses a higher-level interface cannot.

Anyways, I still need to do more investigating.  It would be nice to have another example
where INC/DEC are cheaper than ADD/SUB so that the whole issue does not depend on our
(my) understanding of BGQ.

Reported by jeff.science on 2012-10-16 20:32:36

2012-10-16T20:32:36+00:00

Former user Account Deleted

Posting this, now that the Google Code site is no longer read-only.

I've incorporated most of Dan's proposed changes in Issue 7, Comment 120 in the attached
proposal, with the following notes/comments/exceptions.

> 7.4.4.5: Something needs to be said about the behavior of UPC_CSWAP on a NaN floating
point value. Perhaps just a footnote on the definition of CSWAP saying that behavior
is undefined if *target or *operand1 is a NaN value.

I am by no means familiar with the intricacies of NaNs or IEEE 754, but from what I've
read, it seems that *target == *operand1 should evaluate as False (and thus, not perform
the swap), though it might raise a floating-point exception.  If the latter is the
case, then I guess we may to say the behavior is undefined, but do our arithmetic operators
not suffer from the same issues with NaN?

> 7.4.4.5 p4: I think we need to be a bit more explicit about the memory model behavior
of AMOs.  Specifically, we need to say that upc_amo_strict constitues a strict read
followed by a strict write, issued by the calling thread. Similarly for relaxed.  [...]

I am fine with this for strict AMOs, however, the "similarly for relaxed" AMOs doesn't
seem quite so clear.  It's not an issue of per-thread reordering, but restricting the
interleaving of accesses in the total ordering of accesses on that address.  So we'd
have to add some extra restriction and saying that the relaxed-read/relaxed-write pair
happen atomically just seems a little hand-wavey, considering that we have a formal
memory model.  Does the memory model permit an atomically-coupled relaxed-read/relaxed-write
operation?

Beyond Dan's comments, it seems like the UPC_INC/DEC question is still open.  I think
that I favor it as a convenience item, mostly because it feels silly to declare "const
int one = 1;" just to do an increment, even though it is just one line (per type).

Also, in an off-list comment, George asked about why I had (and removed) "UPC_AMO_HINT_DEFAULT
== 0" (and just made it and later used 0).  In my mind, it makes sense to have macros
for all the acceptable values, but since many may just use the default mode, I like
the clean-ness of being able to just pass 0.  So, I put it back in in the hopes that
there will be some comments on it.

Reported by nspark.work on 2012-10-19 17:14:57

<hr> * Attachment: upc-lib-atomic-ops-spec-draft.pdf

2012-10-19T17:14:57+00:00

Former user Account Deleted

Thanks for the new draft - I'll have more detailed comments after I've reviewed it,
but responding to your high-level points:

> NaNs ... it seems that *target == *operand1 should evaluate as False (and thus, not
> perform the swap), though it might raise a floating-point exception.  If the latter

> is the case, then I guess we may to say the behavior is undefined, but do our 
> arithmetic operators not suffer from the same issues with NaN?

I'm interested to hear from users whether they think it makes sense to perform FP atomics
involving NaN values. I suspect this is a "you should never do that" scenario, in which
case undefined behavior may be acceptable for any AMO's on NaNs. My main concern is
to ensure that we don't specify a semantic for handling of NaN's that could end up
slowing down the common case FP atomics. In some cases we can rely on NaNs to be handled
automatically by IEEE-754 compliant hardware, but in other cases requiring well-defined
behavior for NaN's might force implementations to insert additional software checks
in the "fast path".

In the case of arithmetic operations, a floating-point CPU unit already has to process
the operands, so for example a atomic-floating-add should automatically "do the right
thing" for a quiet NaN. However for a signalling NaN it's entirely likely the FP unit
may be "remote" to the thread invoking the AMO, so generating a signal on the caller
would likely place a check-for-signalling-nan directly in the common-case FP code path.
The particular case I mentioned, UPC_CSWAP, is "special" because hardware CAS operations
are generally bit-oriented and simply compare-and-swap a given number of bytes. If
you allow NaN's to have undefined behavior, the FP CSWAP operation can internally be
performed by casting the bits to an appropriately-sized integral type and performing
an architectural "type-less" CAS operation, so the entire FP CSWAP on a locally-addressable
location can often be a single atomic instruction. However if NaN's are required to
have well-defined behavior, the implementation would need to insert specific checks
for NaN that would impact the performance of the common-case FP CSWAP (because the
bitwise representation for NaN is not unique, and also does not automatically compare
unequal via bitwise comparison). 

> >  we need to say that upc_amo_strict constitues a strict read followed by a strict
write, issued by the calling thread. Similarly for relaxed. 
> I am fine with this for strict AMOs, however, the "similarly for relaxed" AMOs doesn't
seem quite so clear.  

The first point I was raising is the current text defines behavior as a "strict shared
access" and "relaxed shared access", which is not defined anywhere. The memory model
is constructed in terms of reads and writes of each strict/relaxed/local flavor, and
does not formally use the term "access" anywhere (except when referring to a mathematical
set which is a union of reads and writes). Aside from GET and SET, each AMO operation
truly includes both a read and a write operation. The two operations need to be atomically
"coupled" somehow, but that doesn't change the fact the AMO is both consuming data
(read) and producing data (write). 

> It's not an issue of per-thread reordering, but restricting the interleaving of accesses
in the total ordering of accesses on that address.  

Yes exactly - We need to specify the memory semantics of the AMO to ensure that a series
of conflicting AMO's issued by one thread (even relaxed ones) don't get "optimized"
in ways that would break the behavioral properties we wish to ensure.

Consider this pseudocode: (where relaxed_CAS(addr,old,new) == UPC_CSWAP on addr with
old and new values that returns non-zero on "success")

shared int flag = 0
result1 = relaxed_CAS(&flag, 0, 1)
result2 = relaxed_CAS(&flag, 1, 2)
result3 = relaxed_CAS(&flag, 2, 3)
assert(result1 && result2 && result3);

If we assume no other thread is touching flag, I THINK we want to guarantee this code
always passes the assertion. Specifically, we want the on-thread read/write dependencies
constrained to be resolved using program order on the issuing thread, so that the code
above behaves as if you'd instead written something like this using relaxed shared
reads and relaxed shared writes (which IS guaranteed to work):

upc_lock();
result1 = (flag == 0);
flag = 1;
result2 = (flag == 1);
flag = 2;
result3 = (flag == 2);
flag = 3;
upc_unlock()
assert(result1 && result2 && result3);

We'd want to prohibit the compiler/runtime from reordering those conflicting CAS operations
issued by a single thread in ways that would cause the assertion to fail, even though
they are "relaxed". However under the memory model, relaxed shared reads (even conflicting
ones) are allowed to "pass each other", so an AMO cannot be specified solely as a relaxed
shared read, or it would allow optimizations that broke the property checked by the
assertion. There are similar but more complicated reasons why an AMO cannot be solely
a relaxed shared write.

> So we'd have to add some extra restriction and saying that the relaxed-read/relaxed-write
pair happen atomically 
> just seems a little hand-wavey, considering that we have a formal memory model. 
Does the memory model permit an 
> atomically-coupled relaxed-read/relaxed-write operation?

No there is nothing in the current framework to provide this for us. Calling the behavior
an "access" just hides the problem behind an undefined term, and suffers from the problems
above. I was hoping it would be sufficient to state something like "upc_amo_relaxed
atomically performs a relaxed shared read of addr followed by a relaxed shared write
of addr", and further define the term "atomically" to require that "the read and write
accesses comprising one AMO cannot appear (to any thread) to have been interleaved
(or word-torn) with the read/write pair of a conflicting AMO to the same location".
This is still admittedly a bit hand-wavy, but seems fairly clear. Defining the atomicity
in the full-blown formalism is probably possible with alot of legwork, but I don't
think it would improve clarity.

Reported by danbonachea on 2012-10-19 19:24:41

2012-10-19T19:24:41+00:00

Former user Account Deleted

> I'm interested to hear from users whether they think it makes sense to perform FP
atomics involving NaN values.

I would be surprised if they cared much at all about floating-point AMO behavior, provided
how we specify FP AMO behavior doesn't adversely affect integer AMO behavior.

> This is still admittedly a bit hand-wavy, but seems fairly clear. Defining the atomicity
in the full-blown formalism is probably possible with alot of legwork, but I don't
think it would improve clarity.

I think that's reasonable.  The formalism might not be necessary because we guarantee
atomicity only through this library (and a single domain), so we're putting a restriction
on the library, not the memory model.  If you do anything outside the AMO library (and
the specific domain), you're operating outside of our realm of guarantees.  I feel
more comfortable with this explanation now, so I'll put something in the proposal soon.

Reported by nspark.work on 2012-10-19 20:27:46

2012-10-19T20:27:46+00:00

Former user Account Deleted

Nick - It appears you never committed your LaTeX sources from the latest AMO draft,
dated Oct 19th.

Could you please immediately email me whatever you have (or commit them), so that we
can proceed from the latest version?

Reported by danbonachea on 2012-10-23 21:13:43

2012-10-23T21:13:43+00:00

Former user Account Deleted

I don't yet have the document sources, so I'll just list some outstanding issues with
the AMO spec, for discussion in the telecon. Once I get the sources from Nick I'll
apply a few editorial fixes and also whatever resolutions are reached in the telecon.

* UPC_MULT: I just realized the AMO spec omits UPC_MULT as a numeric operation. I'm
not sure if this was accidental or intentional. Do we want to include fetch-and-multiply
in the list of available numeric AMO ops? I don't see any reason to prohibit it, or
any way to efficiently get the same effect in general.

* Barrier separation: We require that any given memory location accessed by an AMO
is only accessed by a single AMO domain at a time, and is isolated from regular read/writes
or access via independent AMO domains. This is an important semantic requirement to
ensure AMO's can be correctly implemented in software - the key requirement is to ensure
that only accesses from the AMO domain are "racing", and that there are no concurrent
outside accesses racing. The current spec 7.4.2 requires this "isolation" to be on
the granularity of "a given synchronization phase", which in UPC parlance requires
a UPC barrier to separate any external data races on those locations. This is certainly
sufficient, but is not strictly necessary. Should we consider relaxing this to allow
the use of other forms of synchronization (eg. user-constructed sync with strict operations)
to ensure the lack of data races? 

* NaNs: as mentioned in comment 137, we need to say something about the behavior of
NaNs in FP AMOs. Something needs to be said about the behavior of both signalling NaNs
and quiet NaNs in each category of operation. My opinion:
   1. FP AMOs where any operand is a signalling NaN should have undefined behavior.
Providing reliable signalling is likely to be impossible in many implementations, and
we want to allow for the possibility of hardware where a FP signal from a signalling
NaN can be delivered to any thread that might not correspond to the thread invoking
the AMO library. 
   2. Specify that a UPC_CSWAP operation where either operand is a quiet NaN has undefined
behavior (or possibly implementation-specified behavior). See comment 137.
   3. I think we can safely require well-defined behavior for numeric operations on
quiet NaNs (and get/set), unless someone sees a problem with that. These cases should
automatically "just work" on any system with IEEE-754 compliant FP units, unless someone
is planning to do something freaky like software-emulated FP arithmetic on a NIC.

Reported by danbonachea on 2012-10-24 11:01:02

2012-10-24T11:01:02+00:00

Former user Account Deleted

AMO open issues were discussed at length in the 10/24 telecon. High-level decisions
reached:

1. We should be permissive and generally include as many ops as seem useful to users
and/or enable potential optimizations on some hardware. Therefore, we resolved to add
the following upc_op_t values as supported numeric AMOs:
    UPC_INC   atomic fetch-and-increment
    UPC_DEC   atomic fetch-and-decrement
    UPC_SUB   atomic fetch-and-subtract
    UPC_MULT  atomic fetch-and-multiply

2. We decided to adopt the semantics proposed in comment 140 for NaNs in FP AMOs.

3. In all other cases AMOs should have the same semantics as the corresponding C99
operator, in particular well-defined 2's complement integer overflow behavior.

4. We will continue to require language-level barriers for AMO isolation. This requirement
may be relaxed in a future revision of the spec.

I'll work on integrating these resolutions into the proposal and distribute a new draft
in the next few days.

Reported by danbonachea on 2012-10-27 03:41:24 - Labels added: Priority-Critical, Consensus-High - Labels removed: Priority-Medium, Consensus-Low

2012-10-27T03:41:24+00:00

Former user Account Deleted

Attached is the latest draft proposal, also in SVN as r178. It includes resolutions
for all outstanding issues, as discussed in comment 137-141.

I think we're close to a final version, so I'm making the official announcement and
moving this to PendingApproval. Mailed 10/28/2012.

Reported by danbonachea on 2012-10-29 04:38:09 - Status changed: PendingApproval

<hr> * Attachment: upc-lib-atomic-ops-draft2.3.pdf

2012-10-29T04:38:09+00:00

Former user Account Deleted

The name upc_amo.h seems a bit obscure.

Suggestion: name the header upc_atomic.h

Reported by gary.funck on 2012-10-29 17:01:41

2012-10-29T17:01:41+00:00

Former user Account Deleted

> Suggestion: name the header upc_atomic.h

This seems like a good suggestion - I'm in favor of this change provided nobody else
objects.

Reported by danbonachea on 2012-10-29 23:23:00

2012-10-29T23:23:00+00:00

Former user Account Deleted

> Suggestion: name the header upc_atomic.h

On second thought, I think it's important to be consistent in our naming. So if we
don't like the "amo" abbreviation and are changing to "atomic" in the the header, we
should also rename __UPC_AMO__ -> __UPC_ATOMIC__, and probably also the types and function
names that use "amo":
  upc_amodomain_t -> upc_atomicdomain_t
  upc_all_amodomain_alloc() -> upc_all_atomicdomain_alloc()
  upc_amo_relaxed() -> upc_atomic_relaxed()
  UPC_AMO_HINT_LATENCY ->  UPC_ATOMIC_HINT_LATENCY
  etc.
This is not a difficult change to perform (and there will never be a better time for
such a change), but it's a bit broader than first appearances. 

How do others feel about the naming scheme?

Reported by danbonachea on 2012-10-30 12:46:12

2012-10-30T12:46:12+00:00

Former user Account Deleted

Re: "In all other cases AMOs should have the same semantics as the corresponding C99
operator, in particular well-defined 2's complement integer overflow behavior."

Are we trying to require that AMOs behave like the 2's compliment case in C99?  Because
that is not required and one of three options described by the C99 standard.  It seems
dangerous to bind UPC to one of those three.

Do I misunderstand what you are saying here?

Reported by jeff.science on 2012-10-30 18:57:26

2012-10-30T18:57:26+00:00

Former user Account Deleted

"Are we trying to require that AMOs behave like the 2's compliment case in C99?  Because
that is not required and one of three options described by the C99 standard.  It seems
dangerous to bind UPC to one of those three."

My understanding is that we'd require the AMOs to behave equivalently to non-atomic
operations in the underlying C99 implementation.  I had assumed Dan's comment in comment
141 was referring to the common implementation choice of 2's complement representation
of signed integers as an example of where this matters.

Reported by sdvormwa@cray.com on 2012-10-30 19:43:22

2012-10-30T19:43:22+00:00

Former user Account Deleted

Yeah, at the most recent conference call we decided that the only semantic difference
within a given implementation between a C op and the corresponding UPC atomic op should
be the fact that it's atomic.  Otherwise things get weird for the user because they
can't introduce atomicity withing bringing along some other semantic change that they
don't want to think about when they just want to make something atomic.

Reported by johnson.troy.a on 2012-10-30 20:05:27

2012-10-30T20:05:27+00:00

Former user Account Deleted

Steve and Troy are correct. We are trying to require AMO numeric operators to compute
the same result as the non-atomic C99 operator in the same implementation. Here is
the actual proposal text that addresses this:

\np In all other cases, the value computed by {\tt op} and stored in {\tt *target}
    shall be equal to the value that would have been computed by passing the operands
    to the corresponding built-in language operator. In particular, this requires that
    overflows, underflows and quiet NaN values are handled as specified in [ISO/IEC00].

Reported by danbonachea on 2012-10-31 04:44:43

2012-10-31T04:44:43+00:00

Former user Account Deleted

Attached is an updated draft proposal.

The most notable change is the global renaming of the acronym "amo" to "atomic", as
suggested by Gary and approved in the last telecon to enhance code readability.

Reported by danbonachea on 2012-11-03 03:55:40

<hr> * Attachment: upc-lib-atomic-ops-draft2.4.pdf

2012-11-03T03:55:40+00:00

Former user Account Deleted

This PendingApproval change appeared in the SC12 Draft 3 release.
It was officially ratified at the 11/29 telecon.

Reported by danbonachea on 2012-11-29 20:03:22 - Status changed: Ratified

2012-11-29T20:03:22+00:00

Former user Account Deleted

After implementing atomics for GUPC I noticed the following three things in the spec:

* Should UPC_CSWAP also require a not null fetch_ptr? 7.6.4.3 Paragraph 6 states that
fetch_ptr shell not be a null pointer for UPC_GET operation.  Should that also be true
for UPC_CSWAP operation as return value is the only indication if operation succeeded?

* Should UPC_CSWAP have operand1 as the new value for target? All the operations (7.6.4.3
Paragraph 4) have operand1 as the new value for atomic variable.  Except for UPC_CSWAP
where new value is in operand2.

* UPC_PTS/UPC_CSWAP operation require pointer compare to ignore phase. This requirement
prevents implementation from using the native CSWAP operations when pointer-to-shared
can fit in some integral container (e.g. uint64_t) for which CSWAP already exists.

Reported by nvukicevic on 2013-05-28 21:12:10

2013-05-28T21:12:10+00:00

Former user Account Deleted

"Should UPC_CSWAP also require a not null fetch_ptr? 7.6.4.3 Paragraph 6 states that
fetch_ptr shell not be a null pointer for UPC_GET operation.  Should that also be true
for UPC_CSWAP operation as return value is the only indication if operation succeeded?"

More precisely, the return value is the only indication that the operation resulted
in a swap.  The operation still "succeeds" even if no swap occurs--the result is simply
the same as the previous value.  While I can't think of any reason to not check the
result (speculative updates perhaps?), I don't see any good reason to force users to
provide it either.  The implementation can trivially write the result to a garbage
location if necessary on a given platform.

"Should UPC_CSWAP have operand1 as the new value for target? All the operations (7.6.4.3
Paragraph 4) have operand1 as the new value for atomic variable.  Except for UPC_CSWAP
where new value is in operand2."

The Cray atomic intrinsics, gcc atomic intrinsics, and C++11 atomics have the compare
value first and the replacement value second, so the current definition at least matches
some existing practice.  I don't have a strong preference either way though.

"UPC_PTS/UPC_CSWAP operation require pointer compare to ignore phase. This requirement
prevents implementation from using the native CSWAP operations when pointer-to-shared
can fit in some integral container (e.g. uint64_t) for which CSWAP already exists."

I think we discussed this on one of the telcoms and decided that the benefits of the
equality check matching the semantics of the base language out-weighed the performance
penalty.  See issue 8 though. ;)

Reported by sdvormwa@cray.com on 2013-05-28 21:47:41

2013-05-28T21:47:41+00:00

Former user Account Deleted

> "Should UPC_CSWAP also require a not null fetch_ptr? 7.6.4.3 Paragraph 6 states
> that fetch_ptr shell not be a null pointer for UPC_GET operation.  Should that
> also be true for UPC_CSWAP operation as return value is the only indication if
> operation succeeded?"
>
> More precisely, the return value is the only indication that the operation
> resulted in a swap.  The operation still "succeeds" even if no swap occurs--the
> result is simply the same as the previous value.  While I can't think of any
> reason to not check the result (speculative updates perhaps?), I don't see any
> good reason to force users to provide it either.  The implementation can trivially
> write the result to a garbage location if necessary on a given platform.

I am aware of some low-level C codes that perform a CAS without checking the result.
So, there is still some use for CSWAP w/o a return.

Reported by phhargrove@lbl.gov on 2013-05-28 21:58:41

2013-05-28T21:58:41+00:00

Comments (153)