Library: non-blocking memory copy extensions

Issue #41 new
Former user created an issue

Originally reported on Google Code with ID 41

This is to log the UPC non-blocking memory copy library extensions.

For more information, please see
https://sites.google.com/a/lbl.gov/upc-proposals/extending-the-upc-memory-copy-library-functions

Reported by yzheng@lbl.gov on 2012-05-22 23:41:42

Comments (112)

  1. Former user Account Deleted

    Reported by `phhargrove@lbl.gov` on 2012-06-01 03:43:19 - Labels added: Spec-1.3

  2. Former user Account Deleted

    Reported by `phhargrove@lbl.gov` on 2012-06-01 06:08:21 - Labels added: Milestone-Spec-1.3 - Labels removed: Spec-1.3

  3. Former user Account Deleted

    ``` Given the difference between Cray's and Berkeley's positions on the non-blocking memory copy proposal, I was hoping to restart the discussion in the hopes of having some consensus.

    Generally speaking, I think that my (and my users') position between the two is that we prefer non-blocking memory copy functions to *NOT* be independent/agnostic of a upc_fence or strict synchronization. That is, we essentially support the Cray position.

    My understanding is that the biggest motivation for the fence/strict independence is that some users may start a non-blocking copy and then call a function or library that calls a fence internally. While we recognize that this may happen and it would eliminate much of the benefit of the non-blocking copy, we feel that using a fence is inherently an expensive operation that should be used judiciously, but should (from 6.6.1.5) apply to "all shared accesses."

    I think it is somewhat philosophically orthogonal to provide an independent communication channel within UPC that is still essentially called UPC, but has to be managed separately from the "traditional" shared UPC accesses.

    As a far less important issue, I prefer the "_nb/_nbi" suffix to the "_async/_asynci" suffix. ```

    Reported by `nspark.work` on 2012-06-11 15:44:54

  4. Former user Account Deleted

    ``` Let me attempt to connect this issue with the UPC collectives 2.0 issue (appropriately numbered Issue 42). There, too, we have a problem of not being able to use upc_fence to guarantee completion of operations.

    If we can formulate upc_fence and handles in a way that allows libraries to use it as an extension tool, we could deal with the (very valid) Berkeley objections and make Troy happy too.

    Of course Bill will be upset. So that's the price to pay.

    :)

    ```

    Reported by `ga10502` on 2012-06-15 15:12:52

  5. Former user Account Deleted

    ``` First, I attach the latest documents from Berkeley and Cray that may facilitate discussion and help clarify confusions.

    I think it's logical that "upc_fence" would sync all outstanding implicit non-blocking operations. But how about explicit handle operations?

    For example,

    /* foo may be only in some binary format developed by a third party */ void foo() { upc_fence; }

    h = upc_memcpy_nb(...); foo(); sync(h);

    Cray's position: It's an user error to call upc_fence (and thus foo()) before sync(h).

    Berkeley's position: upc_fence has no effect on h.

    Neither seems to be perfect. Any suggestions or comments?

    In addition, we should carefully consider and define the differences between local completion and global completion as stated in Cray's document.

    ```

    Reported by `yzheng@lbl.gov` on 2012-06-15 16:58:43

    <hr>

  6. Former user Account Deleted

    ``` I understand that the community Nick represents is in favor of something more like Cray's version than Berkeley's version. While that is not my personal preference, I am willing to accept the input of the USERS as more relevant than the distastes of a lone implementer. So, let's see where this leads us...

    I don't have a problem with implementation of syncing implicit NB ops in upc_fence(). I never did except that doing so w/o the matching behavior for explicit handles seemed a stupid half-way commitment.

    It has been the interaction of strict accesses (and therefore upc_fence()) with explicit-handle NB ops that has been my main concern (the implementation costs). In the interest of reaching consensus I will concede that strict accesses must complete NB ops. Specifically, I concede that placing non-blocking memcpy functions outside of the UPC memory model is unacceptable to the community.

    As Yili mentions in the previous comment, we are still in some disagreement on explicit-handle operations. My main concern is the one Yili expresses: Cray's current proposal that a upc_fence() is illegal between init and sync makes it "difficult" to call external code (to achieve communication-computation overlap). In fact, the current Cray proposal would require different code depending on whether the code in a called function includes ANY strict accesses.

    My hope is to "meet half-way" with something that has the most desirable properties. I think that permitting strict accesses and upc_fence(), while keeping the handle "live", permits the user to write code without the need to know if any external functions (or even their own in a large project) contain fences or strict accesses. The Cray-proposed behavior of PROHIBITING the sync after a fence or strict access seems sufficiently distasteful to me that I am willing to drop my objections to handle-tracking overheads to void it (lesser of two evils in my mind).

    Would the following be acceptable to Cray: + "strict access" in what follows implicitly includes calls to upc_fence() + a strict access between init and sync of an explicit-handle NB op is permitted. + such a strict access causes completion of all outstanding NB transfers (both implicit and explicit-handle) EXACTLY as they do any normal relaxed access (no special-case spec language required) + However, any handle from an operation initiated, but not yet synced, before the strict access is still "live" and therefore an explicit sync call is REQUIRED to free any associated resources + Note: "complete" is with respect to memory-model while "synced" is with respect to retiring the handle. The "completion" occurs no later than the "sync", but can be forced earlier with a strict access.

    One additional thought that occurs: If the user uses "#include <upc_strict.h>" or "#pragma upc strict" then ALL shared accesses between the init and sync of an NB call would become strict. This feels to me like another reason to keep handles "live" and allow the same code to work in either strict or relaxed mode.

    Also, I endorse the inclusion of "restrict" in the prototypes, which appears to have been unintentionally omitted from the Berkeley proposal. It was not our intent to support overlapping src/dest.

    NOTE: In Berkeley UPC we introduce our extensions with a bupc_* prefix rather than using in the upc_* namespace for our extensions. This means that if the eventual specification differs from our initial versions, user codes can continue to use the bupc_* prefixed "legacy" versions rather than seeing their code break when they update to a compiler implementing the new spec and therefore changes the semantics of the upc_* version. So, I would recommend that to save Cray's users from some pain we adopt the "_async" family of names to NOT collide with Cray's current implementations (which may differ from the final spec semantics). ```

    Reported by `phhargrove@lbl.gov` on 2012-06-15 23:16:54

  7. Former user Account Deleted

    ``` All our (Cray's) concerns were with regard to the memory model--specifically that NB operations be treated as relaxed operations and thus are ordered by fences. We proposed that having the sync come after a fence be "undefined" behavior (note, undefined does not mean illegal) to ensure that no strictly conformant program did this. An implementation would then be free to break the memory model in a non-conformant program, and thus could permit outstanding NBs past a fence like Berkeley currently permits. We didn't mean to make it an error to call the sync routine after a fence, merely to make it so that users couldn't rely on any particular behavior in that case and subtly discourage its use in portable applications.

    Of course, there's no need for this if fences/strict accesses are explicitly defined to complete outstanding NBs as far as the memory model is concerned. In that case we have no problems requiring the handle be explicitly sync'd even after a fence. ```

    Reported by `sdvormwa@cray.com` on 2012-06-16 00:44:20

  8. Former user Account Deleted

    ``` Excellent! It sounds like Cray and Berkeley have converged on their most significant differences. Now we need a volunteer to draft a new proposal that we can look over to make sure we agree on the little stuff too.

    I am sorry if Yili or I mis-characterized Cray's intended semantic for sync-after-fence.

    Berkeley will continue to offer bupc_* async-memcpy extension which operate outside of the memory model, and will add upc_* async-memcpy functions which behave just as other relaxed accesses with respect to the memory model. ```

    Reported by `phhargrove@lbl.gov` on 2012-06-16 01:28:31

  9. Former user Account Deleted

    ``` Just a note that there was an email exchange external to this issue forum that resulted in the following conclusion:

    The LaTeX for the Cray proposal will be uploaded to the SVN repository (my task, once I figure out how). Everyone then will be able to edit that version until we have something that we can recommend as a unified non-blocking proposal.

    I also wanted to note the relationship between this issue and Issue 7 Comment #30: http://code.google.com/p/upc-specification/issues/detail?id=7#c30 Answering that question would help us to include clearer language describing the semantics of the sync functions. ```

    Reported by `johnson.troy.a` on 2012-06-19 18:49:09

  10. Former user Account Deleted

    Reported by `gary.funck` on 2012-07-03 18:07:50 - Labels added: Type-Lib-Required - Labels removed: Type-Enhancement

  11. Former user Account Deleted

    ``` I've steered clear of this discussion until now, but as the original author of both the Berkeley async proposal and much of the memory model, I'd like to provide some perspective that I believe is missing from the current discussion. I don't claim to state an official position for the Berkeley group, this is my own expert opinion.

    * Executive summary * I'm arguing that the non-blocking memcpy functions should NOT be synchronized in any way by upc_fence or other strict accesses issued between init and sync. The non-blocking library includes an explicit and required synchronization call, and the transfer should be permitted to continue during the entire "transfer interval" between the initiation and successful sync, regardless of the actions of the calling thread (or libraries it invokes) in that interval. As far as the memory model is concerned, the data transfer behaves as a set of relaxed read/write operations of unspecified size and order, which are "issued" by the calling thread at a unspecified time during the transfer interval. The ops are not affected by fences or other strict operations issued in the transfer interval, because it is explicitly unspecified whether they were issued "before" or "after" any such fences. I believe this semantic "dodge" makes it totally compatible with the memory model. Furthermore, the operations specified by the library should not imply any fencing semantics, which needlessly complicate the interface and may impose performance degradation through unwanted semantics. The transfers are relaxed operations and any required synchronization should be explicitly added by the user as strict operations or other fences around the transfer interval.

    Point 1: *The memory model already allows optimization of "blocking" memcopies*

    By B.3.2.1: "For non-collective functions in the UPC standard library (e.g. upc mem{put, get, cpy}), any implied data accesses to shared objects behave as a set of relaxed shared reads and relaxed shared writes of unspecified size and ordering, issued by the calling thread." In English, this means as far as the memory model and compiler are concerned, the accesses implied by the existing "blocking" memcopy functions are handled as an unspecified set of relaxed access. Specifically, the compiler/runtime is already free to reorder them with respect to any surrounding relaxed access, subject only to the limitations of static analysis (or runtime checking) to avoid reordering conflicting writes or passing a strict operation. High-quality implementations will thus already achieve some communication overlap from calls to the existing memcopy libraries. In proposing an extension to these existing libraries, we must specify semantics that allow significantly MORE aggressive overlap to occur, leading to measurable performance gain - otherwise the complexity of the proposed extension is not justified.

    Point 2: *The primary goal of the async feature is performance*

    Async memcopies do not add any expressiveness to the language - ie any async data movement operations can be already be expressed with their fully blocking counterparts. I doubt anyone would argue that explicitly async memcopies are more concise or elegant than a blocking call, nor do they improve the readability or debuggability of the UPC program. On the contrary, the programmer has chosen to sacrifice all of these features to some extent, all in order to (hopefully) reap an improvement in performance by explicitly requesting a communication overlap optimization which is either too hard (due to limitation of static analysis) or too costly (overhead of dynamic optimization) for the compiler to perform automatically. As performance is the primary and overriding goal of this library feature, great care should be taken to avoid any semantic roadblocks with the potential to artificially hinder performance of the library under realistic usage scenarios.

    Point 3: *Async calls are an annotation that explicitly suppress the memory model

    The whole point of the explicitly asynchronous memcopies is to provide an annotation from the user to the compiler/runtime asserting that the accesses performed by the copy do not conflict with any accesses or other operations that occur between the initiation and the sync call. The user is asking the compiler to "trust" this assertion and maximize the performance of the transfer while ignoring any potential conflicts. This obviously includes conflicting read/write accesses to the memory in question (otherwise the assertion is meaningless). I believe it ALSO should apply to any strict operations or fences that may occur (possibly in hidden callees or external libraries that defeat static analysis). It makes no sense to "suppress" the memory model for one type of conflict but preserve it for another.

    Yes, this finessing of the memory model makes these async calls harder to write, understand and debug than their blocking counterparts, but that's precisely the price the programmer is paying for a chance at improved performance. The C language has a long history of features with pointy edges that give you enough rope to hang yourself, in exchange for allowing the programmer to get "closer to the machine" and micromanage behavior where it matters. The async extensions are just the latest example of such an "advanced feature", and we should not saddle them with semantic half-measures that try to make them slightly more user-friendly at the expense of potentially sacrificing any amount of performance (which is their primary reason for existence).

    Point 4: *Async calls should not include additional fencing semantics*

    The current proposal is deeply mired in providing fencing semantics, ensuring operations are locally or globally visible in the synchronization calls. This approach couples inter-thread synchronization with data transfer, making the operations more heavyweight and simultaneously imposing MORE synchronization semantics than their blocking counterparts. For example, a upc_memput_nb which returns UPC_COMPLETE_HANDLE or immediately followed by a gsync currently implies "global visibility", which is MORE than the guarantees on blocking upc_memput and may be considerably more costly. The proposal also introduces several concepts (eg. "half-fences" and "local visibility") which are not defined by the current memory model, and may be tricky to formally define. It appears to advocate a usage case where async transfers behave sufficiently "strict-like" that after a sync the user can issue a RELAXED flag write to synchronize other threads. This is completely backward from the UPC philosophy and best practice, which is to use relaxed operations for data movement, and strict operations to update flag variables and perform synchronization.

    In my opinion this piggybacking of fencing semantics on the library calls should be removed entirely. An important usage class of applications want to perform a large number of non-blocking operations in phases separated by user-provided barriers (3D FFT being one obvious example), and any fencing semantics on the individual operations is undesirable and only has the potential to degrade performance. These applications don't care at all about local or global completion of anything before the next barrier, and don't want to pay for the library computing it under the covers or imposing hidden fences.

    The transfers performed by the library should abstractly behave as a set of relaxed ops with respect to the memory model. There is no difference between local and global completion, because the accesses in question are entirely relaxed. They behave exactly as relaxed operations issued at an unspecified time during the transfer interval. They are not affected by fences or other strict operations issued in the transfer interval, because it is explicitly unspecified whether they were issued "before" or "after" any such fences. The same logic implies that conflicting accesses in the transfer interval also return undefined results. A successful sync call indicates the relaxed operations have all been "performed", thus ensuring any subsequent conflicting operations issued by the calling thread see the updated values. Programs that wish to enforce global visibility of a transfer should explicitly issue a fence or other strict operation after the sync call.

    The approach I'm describing significantly simplies the current proposal (removing many unnecessary functions), makes the semantics easier to understand (by removing all the fence-related goop) and at the same time removes semantics which have the potential to reduce performance. It also brings it more in line with the memory model and the semantics of the existing blocking operations. I believe more high-level discussion of this nature is prudent before accepting the current semantics, which seem problematic in many ways.

    ```

    Reported by `danbonachea` on 2012-08-03 12:29:39

  12. Former user Account Deleted

    ``` Regarding Comment #14, I very strongly disagree on all points and most of these issues have been considered prior to forming the consensus proposal. See comments below.

    "The transfers are relaxed operations and any required synchronization should be explicitly added by the user as strict operations or other fences around the transfer interval."

    No. A fence or strict access MUST complete ALL prior non-blocking operations to be compatible with the existing UPC memory model. Therefore, if you want to have multiple non-blocking operations in flight and then use a fence or strict access to complete ONE of them, you end up forcing completion of ALL of them. This is not acceptable for software pipelining of large copies.

    "Point 1: *The memory model already allows optimization of "blocking" memcopies*"

    No, existing UPC implementations implement upc_mem* as blocking copies. I believe both BUPC and Cray concluded that the functions had to be blocking via different reasoning and there's a comment somewhere (not in this Issue) that explains both lines of reasoning. The goal of this proposal is to provide non-blocking copy functions.

    "Point 2: *The primary goal of the async feature is performance* Async memcopies do not add any expressiveness to the language - ie any async data movement operations can be already be expressed with their fully blocking counterparts."

    Given that upc_mem* functions already exist in UPC 1.2, this is not a very compelling argument. I equally could argue that the existing upc_mem* functions are unnecessary because the language already provides expressiveness in terms of relaxed-access loops that a compiler should be able to automatically convert to the equivalent of a upc_mem* call. I could use the Cray compiler as a proof-of-concept that this optimization is possible and argue that the existing upc_mem* functions should never have existed, but I think there is a benefit to programmers in having these functions to make their intention explicit instead of relying on optimization.

    "Point 3: *Async calls are an annotation that explicitly suppress the memory model

    Programming language constructs that sit outside the memory model are dangerous and confusing. A memory model provides a way of reasoning about a program. No matter what part of my program I'm currently looking at, I have confidence that the rest of the program and any libraries that I've linked to written in that same language are following the same rules. If I write "upc_fence" then I know that nothing is in flight immediately after that statement executes -- I don't have to wonder if there's something elsewhere in the code that has a memory-model-exempt copy ongoing on which my upc_fence has no effect.

    "Point 4: *Async calls should not include additional fencing semantics*"

    Ideally, no, but we need a way to manage individual non-blocking transfers. We want to keep the existing meaning of upc_fence and explain semantics relative to that as much as possible. The half-fence is essentially how the Cray implementation works now.

    "The approach I'm describing significantly simplies the current proposal (removing many unnecessary functions), makes the semantics easier to understand (by removing all the fence-related goop) and at the same time removes semantics which have the potential to reduce performance."

    See above response regarding ignoring the memory model.

    "It also brings it more in line with the memory model and the semantics of the existing blocking operations. I believe more high-level discussion of this nature is prudent before accepting the current semantics, which seem problematic in many ways."

    No, it does not bring it in line with the memory model because it explicitly denies being part of the memory model.

    ```

    Reported by `johnson.troy.a` on 2012-08-03 14:53:09

  13. Former user Account Deleted

    ``` "A fence or strict access MUST complete ALL prior non-blocking operations to be compatible with the existing UPC memory model. "

    I think you missed an important point - I'm basically arguing the definition of "prior". I'm proposing that the accesses implied by the library are issued at an UNSPECIFIED time between the initiation and successful sync call, so any intervening fences need not synchronize them because it cannot be proven that those anonymous accesses were issued "prior" to that fence - therefore no violation of the memory model can be observed. Explicitly async semantics already introduce the user to the concept of an asynchronous data transfer agent, and I'm arguing that agent is issuing the abstract memory operations at an intentionally unspecified time within the interval between init and sync.

    Whatever semantics we come up with will need to be explained within the context of the formal memory model, and the easiest way to do this is to define the library's effects as a set of abstract operations. I propose to allow these operations to be abstractly issued anywhere within the transfer window, whereas you seem to be arguing they should be nailed down to all be issued at the initiation. I believe my approach is cleaner and allows for higher performance. I am NOT ignoring the memory model, I'm just defining the library semantics in such a way that fences don't interfere with its operation.

    "existing UPC implementations implement upc_mem* as blocking copies"

    This is an implementation decision, and is not required by the currently (looser) specification. There have been prototype UPC implementations that perform software caching at runtime that relax this decision to provide some overlap for "blocking" operations. In any case, the upc_mem* operations do NOT imply any strict accesses or fences, so the current async proposal is definitely adding additional synchronization where none exists in the blocking version.

    ```

    Reported by `danbonachea` on 2012-08-03 15:27:31

  14. Former user Account Deleted

    ``` "I think you missed an important point - I'm basically arguing the definition of "prior". I'm proposing that the accesses implied by the library are issued at an UNSPECIFIED time between the initiation and successful sync call, so any intervening fences need not synchronize them because it cannot be proven that those anonymous accesses were issued "prior" to that fence - therefore no violation of the memory model can be observed. Explicitly async semantics already introduce the user to the concept of an asynchronous data transfer agent, and I'm arguing that agent is issuing the abstract memory operations at an intentionally unspecified time within the interval between init and sync."

    No, the memory model becomes completely broken (or alternatively, the async updates are useless) if we do this. Fences prevent backwards movement of relaxed accesses as well, so either all threads observe the results before the fence or all threads observe the results after. Relaxed accesses cannot be reordered with respect to fence, so regardless of when the "access" occurs for an async call, all threads still have to agree on it. In order to guarantee that all threads agree, most implementations are going to have to either sync at the fence, or delay starting the operation until after the last fence prior to the user's sync, thus defeating the purpose of allowing an async to bypass a fence in the first place.

    "I am NOT ignoring the memory model, I'm just defining the library semantics in such a way that fences don't interfere with its operation."

    And thus ignoring the rules surrounding reordering relaxed accesses in the presence of fences.

    "There have been prototype UPC implementations that perform software caching at runtime that relax this decision to provide some overlap for "blocking" operations."

    Remind me again what the trend is for available memory per-thread? I don't think that this can be considered a useful solution given that we barely have enough space to track the address ranges that are outstanding, let alone all the data. ```

    Reported by `sdvormwa@cray.com` on 2012-08-03 17:17:50

  15. Former user Account Deleted

    ``` "No, the memory model becomes completely broken (or alternatively, the async updates are useless) if we do this. Fences prevent backwards movement of relaxed accesses as well, so either all threads observe the results before the fence or all threads observe the results after. Relaxed accesses cannot be reordered with respect to fence, so regardless of when the "access" occurs for an async call, all threads still have to agree on it. In order to guarantee that all threads agree, most implementations are going to have to either sync at the fence, or delay starting the operation until after the last fence prior to the user's sync, thus defeating the purpose of allowing an async to bypass a fence in the first place."

    The abstract relaxed accesses which comprise the transfer need not be issued as a group all at once - they are a set of relaxed read/write operations that can be issued any time after the initiation call and before a successful synchronization returns. As such any fences in the transfer interval may occur before a subset of them have been issued. Understand I'm not suggesting this as an IMPLEMENTATION, this is merely the formalism I propose to define the semantics of the call within the existing formalism of the memory model, and has the side-effect that intervening operations do not interfere with the asynchronous transfer.

    More importantly, I'm proposing the source memory is required to be constant and the contents of the destination memory are explicitly undefined between the init and sync, so threads are not permitted to be "peeking" before sync anyhow, which is another reason that reordering with respect to intervening operations cannot be observed.

    ```

    Reported by `danbonachea` on 2012-08-03 17:30:02

  16. Former user Account Deleted

    ``` I'm mostly with Troy on this one. If an async memory operation has been waited upon already, it should be synchronized by fence - or we are in semantic hell.

    In my lights just because an async memput has been waited on does not mean that the data has been deposited remotely. It only means that the send buffer can be reused. In this situation, if I go with Dan I will *never* know whether the transfer has finally completed. If I go with Troy then the fence will guarantee completion. Thus, Troy :)

    I have not thought through all other possible orderings of events (e.g. put, fence, wait).

    ```

    Reported by `ga10502` on 2012-08-03 17:54:56

  17. Former user Account Deleted

    ``` "If an async memory operation has been waited upon already, it should be synchronized by fence - or we are in semantic hell. "

    I don't believe there's any disagreement about that - once a sync operation has returned successfully (ie "waited upon") and you've subsequently issued a fence, the operation is guaranteed to be "complete" as far as all threads is concerned. Both approaches ensure this. The argument centers around fences that are issued BEFORE the sync, while the operation is still "in-flight" and what they mean.

    "if I go with Dan I will *never* know whether the transfer has finally completed"

    That's not the case :) In my interpretation, the transfer has "completed" with respect to the calling thread once the sync returns successfully (ie subsequent conflicting data accesses are preserved). The next strict operation ensures they are complete with respect to all threads. ```

    Reported by `danbonachea` on 2012-08-03 18:01:43

  18. Former user Account Deleted

    ``` "existing UPC implementations implement upc_mem* as blocking copies"

    As an additional counter-example, consider the important case of a system with full hardware shared memory support, like a large SMP. The Berkeley implementation of upc_memput/upc_memget on such a system boils down to a C99 memcpy(), which at ABI level results in a series of load/store instructions in a loop. Other UPC implementations on such hardware probably look similar. There are no architectural memory fences or compiler reordering fences inserted before or after the operation, because none are dictated by the UPC memory model for upc_memput/upc_memget. As a result, on a modern memory hierachy these load/stores can and will be aggressively reordered with respect to surrounding, non-conflicting memory load/stores, which may correspond to surrounding relaxed operations that access memory with affinity to different language-level threads. The necessary cache coherency, write buffering, conflict checking and load/store reordering is performed entirely in hardware by all modern processors with shared-memory support. The current memory model specification for upc_memput/upc_memget is intentionally permissive of this implementation.

    Now it's true that an async memory copy facility will probably enable the largest performance benefit on loosely-coupled, distributed-memory hardware. However the semantics should be designed to still allow reasonable performance when run on the simple case of cache coherent shared memory hardware. The semantic insertion of memory fences around the proposed async operations to enforce the suggested completion guarantees has the potential to make async operations significantly MORE expensive and expose LESS overlap on shared memory hardware than the equivalent upc_mem{put,get} call which includes no extraneous fencing semantics. This seems fundamentally broken.

    ```

    Reported by `danbonachea` on 2012-08-03 23:32:34

  19. Former user Account Deleted

    ``` You're misinterpreting "blocking" as inserting a full fence, which is a much stronger statement. The basic problem is that two relaxed operations issued by the same thread must be observed in program order. Therefore, the implementation must guarantee the ordering between upc_mem* copies and relaxed operations to those memory addresses by the calling thread. Because it is difficult in general to prove there are no other accesses to those memory addresses, implementations must do something to prevent incorrect orderings. In the case of your "large SMP", the hardware takes care of it. On distributed systems that lack such hardware support, one simple solution is to simply block until the upc_mem* operation is globally visible before issuing any further operations. ```

    Reported by `sdvormwa@cray.com` on 2012-08-04 03:57:13

  20. Former user Account Deleted

    ``` Sorry, the second sentence in comment 22 should have been "two relaxed operations to the same memory location issued by the same thread", not just "two relaxed operations issued by the same thread". ```

    Reported by `sdvormwa@cray.com` on 2012-08-04 04:04:14

  21. Former user Account Deleted

    ``` Troy said: "existing UPC implementations implement upc_mem* as blocking copies" sdvormwa said: "the implementation must guarantee the ordering between upc_mem* copies and relaxed operations to those memory addresses by the calling thread."

    Correct - I'm very aware of the memory consistency requirements involved for upc_mem*, having written the memory model, the relevant spec language, and several implementations myself :).

    I was responding to Troy and illustrating that there are platforms of interest which can satisfy those requirements without any sort of "blocking" in the implementation whatsoever. Cache coherent shared memory platforms can implement upc_mem* as a simple set of load/stores, which the hardware is then free to aggressively reorder in both directions with respect to surrounding non-conflicting load/stores that correspond to other relaxed operations. The 1.2 spec for those operations is deliberately loose enough to allow this important optimization in hardware.

    One of my many criticisms of the current async proposal is that it cannot achieve this level of overlap on the important case of cache-coherent shared-memory systems, because the semantics require the insertion of "half-fences" and guarantees of local/global visibility, which are significantly more heavyweight and would inhibit the hardware-provided reordering optimizations. As such, the simple case of upc_mem*_async();gsync() would be expected to perform SLOWER and provide LESS overlap than the existing analogous upc_mem* call for those systems. It makes no sense to introduce an "async" library whose semantics for memory overlap are often MORE restrictive than the existing synchronous counterparts.

    The larger point I'm trying to make here is that in introducing an async interface, we should be careful to specify semantics that are uniformly a RELAXATION relative to the existing synchronous library. As performance is the primary and solitary justification for the user to call the more complicated interface, any semantics with the potential to hinder performance on platforms of interest should be rejected. The sole source of this semantic relaxation is the introduction of a "transfer interval", between the init and sync call, where the application promises "no threads are looking at my src/dst buffers", and therefore the implementation is free to perform the data transfer operations inside that interval without worrying about any of the activities of the ongoing computation. I contend this "promise" from the application should extend all the way until the library sync call, and consequently it is impossible for any thread to observe a violation of the memory model by inserting fences before the library operation has been synced (because no thread is allowed to be looking at the memory in question during that time). Stated another way, the application makes a promise not to look at the transfer memory until after a successful sync, and is not permitted to dynamically "change its mind" by issuing a fence in the middle of that interval - the assertion is made at the init call and remains in force until a successful sync.

    ```

    Reported by `danbonachea` on 2012-08-04 09:21:58

  22. Former user Account Deleted

    ``` Without the half-fence though, you have no way of implementing the very important use case of notifying another thread that the async transfer has completed without doing a MUCH more expensive full fence, as any relaxed operation may be reordered before the gsync. That is the prime motivator of our insistence on including the half-fence on the gsync, as this use case is one of the most important to our customers. Such a half fence may seem useless (or even too expensive) on systems where its performance is roughly in line with that of a full fence, but there are systems where the full fence is significantly more expensive. ```

    Reported by `sdvormwa@cray.com` on 2012-08-04 16:47:09

  23. Former user Account Deleted

    ``` "The sole source of this semantic relaxation is the introduction of a "transfer interval", between the init and sync call, where the application promises "no threads are looking at my src/dst buffers", and therefore the implementation is free to perform the data transfer operations inside that interval without worrying about any of the activities of the ongoing computation."

    So in other words, use of the async operations tells the implementation to ignore the memory model for a given range of memory locations until the gsync is reached. The memory model is confusing enough already without the additional headache of language constructs that ignore it. ```

    Reported by `sdvormwa@cray.com` on 2012-08-04 17:09:55

  24. Former user Account Deleted

    ``` "Without the half-fence though, you have no way of implementing the very important use case of notifying another thread that the async transfer has completed without doing a MUCH more expensive full fence, as any relaxed operation may be reordered before the gsync."

    I acknowledge this is an important usage case, one that we refer to as a "signalling put", ie performing a memput and notifying the target of arrival. We've seen this application desire in other places as well. I agree that a user who is given ONLY the async library and no additional tools would need to implement this using a flag write - under the Berkeley semantics, he would need to use a strict write after sync. The proposed Cray semantics include extra semantics intended to allow him to use a relaxed write for signalling, however as I previously expressed, using relaxed operations for synchronization seems highly "sketchy" and completely contrary to the UPC memory model philosophy and officially stated "best practice". I'm not even convinced this is guaranteed to be correct, lacking a formal proof (and a formal definition of "half-fence"). Even if it works for this very specific case, encouraging the use of relaxed writes for synchronization as a UPC programming practice seems like a very Bad Idea and likely to lead to nightmarish race condition bugs for users.

    In any case, I would argue the correct solution to this application requirement is NOT to saddle the async primitives with additional semantics that allow the user to "roll his own" questionable synchronization. A far better solution to that usage case is to introduce a DIFFERENT library function that encapsulates exactly the semantics required - ie perform a non-blocking memput and update a flag at the target when it arrives. Such an interface is more user-friendly, less error-prone, and ADDITIONALLY has the potential to eliminate an entire network round-trip for the signal write on a loosely-coupled system. We've been talking about introducing such an interface for a long time, and Berkeley UPC includes a working prototype implementation, described here: http://upc.lbl.gov/publications/upc_sem.pdf If this is truly an important usage case for your customers, then I suggest we split that discussion into a separate issue and consider a library call to meet that need.

    Let's set aside that usage case for the moment and assume we independently arrive at a library solution that provides an encapsulated solution for that application requirement. With that in place, can we agree to remove these potentially costly half-fence semantics from the proposed interface? The fact that Cray can implement them efficiently on one platform of interest does not justify their serious potential performance impact on other hardware.

    ```

    Reported by `danbonachea` on 2012-08-04 17:20:49

  25. Former user Account Deleted

    ``` "I'm not even convinced this is guaranteed to be correct, lacking a formal proof (and a formal definition of "half-fence")."

    We already have an implicit partial definition, as it is what prevents updates to the same memory location from being reordered. A gsync simply becomes a single relaxed operation that covers all memory locations. ```

    Reported by `sdvormwa@cray.com` on 2012-08-04 17:30:24

  26. Former user Account Deleted

    ``` "The fact that Cray can implement them efficiently on one platform of interest does not justify their serious potential performance impact on other hardware."

    The existing memory model already imposes serious performance impacts on large distributed memory systems (including at least Cray and IBM systems), as they have to jump through enormous hoops to prevent the reordering of operations to the same memory location. These large distributed memory systems are also those that benefit the most from async operations, as the remote memory latency is so large, and the bandwidth relatively low compared to local accesses. Are you seriously saying that we should constrain those systems even more because of concerns with the relatively small impact on smp systems that don't have much to gain from using the asyncs in the first place? ```

    Reported by `sdvormwa@cray.com` on 2012-08-04 17:49:47

  27. Former user Account Deleted

    ``` "So in other words, use of the async operations tells the implementation to ignore the memory model for a given range of memory locations until the gsync is reached. The memory model is confusing enough already without the additional headache of language constructs that ignore it."

    I'm sorry but you're still not getting it. I'm not proposing to ignore the memory model. I think what's lacking is an understanding of what the memory model actually guarantees - I highly recommend you go re-read the formal semantics in Appendix B (the actual model, not the Cliff notes in 5.1).

    The memory model is NOT an operational description of a virtual machine, nor does it prescribe the contents of memory, even in the abstract. It is sometimes convenient to think about and discuss it in an operational sense, but that is NOT the basis of the formalism, and ultimately that mode of reasoning may be misleading and diverge from the true guarantees.

    The memory model is defined entirely in terms of relaxed and strict reads and write operations, and for a given execution trace of a VALID UPC program it determines whether the execution was "UPC Consistent", in that one can construct the appropriate partial orders <_t and total order <_strict that satisfy the guarantees it provides. I'm not going to paste in the entire formalism here - it's all in appendix B. However, a VERY important and deliberate property of the model is that it does not make any guarantees about the possible results of operations that did not occur. Stated another way, if the execution trace did not directly "observe" a violation of the model, then the execution is consistent, regardless of what tricks the implementation may be playing under the covers (whose effects were not observed by any application read operations).

    The Berkeley semantics for async are that it is ERRONEOUS for any application thread to modify the source buffer or in any way access the destination buffers during the transfer interval, between the library initiation call and the successful return of library sync call. A program that "cheats" and touches these buffers in the forbidden interval is an INVALID UPC program, and the memory model does not provide ANY guarantees whatsoever for an invalid program. Valid UPC programs Shall Not touch those buffers within the transfer interval, and this property makes it IMPOSSIBLE for them to observe exactly how those buffers were accessed by the library, and how those accesses may or may not have been affected by other non-related fences or synchronization constructs. Because all executions of valid programs are prohibited from observing any violations, by definition the memory model is preserved and the executions are "UPC Consistent". This is the essence of how the memory model works - if VALID programs cannot tell the difference, then the model is not violated.

    ```

    Reported by `danbonachea` on 2012-08-04 17:50:09

  28. Former user Account Deleted

    ``` I understand the memory model argument completely. My qualm is with the ERRONEOUS part, as I think it is both confusing to programmers and difficult to detect. That combination will lead to programming mistakes that are extremely hard to debug. Simply saying "this program is invalid, so the behavior is undefined" is a nice cop-out for the language designer, but it's not so nice for the programmers. ```

    Reported by `sdvormwa@cray.com` on 2012-08-04 18:03:08

  29. Former user Account Deleted

    ``` " My qualm is with the ERRONEOUS part, as I think it is both confusing to programmers and difficult to detect. That combination will lead to programming mistakes that are extremely hard to debug."

    I'm sorry but that's the very semantic basis of what's involved in any explicitly asynchronous transfer library. This is precisely the additional complexity that sophisticated users are accepting when they choose to use any async library. If users cannot figure out how to leave the buffers untouched during the transfer interval, then they have no business using an asynchronous library.

    Are you seriously arguing that we should care about the observed behavior of ERRONEOUS programs? I can easily devise many erroneous programs that lead to very bizzarre and inexplicable behaviors on any system of your choice, without even touching the UPC libraries. Our task as specification writers is to clearly define the contract between the user (who writes programs which the spec judges to be VALID) and the implementation (which generates executions with the guaranteed behavior for those valid problems).

    ```

    Reported by `danbonachea` on 2012-08-04 18:11:58

  30. Former user Account Deleted

    ``` "Are you seriously arguing that we should care about the observed behavior of ERRONEOUS programs?"

    No, I'm arguing that async transfers should be defined as relaxed operations, and not excuse them from the rules regarding the ordering of relaxed operations in the presence of a fence. Then we don't need to bother with your proposed cop-out in the first place. Does this argument mean that some systems won't benefit as much from the asyncs? Yes. But those same systems get more benefit from the existing routines, so it balances out nicely. ```

    Reported by `sdvormwa@cray.com` on 2012-08-04 18:22:29

  31. Former user Account Deleted

    ``` I don't think it's productive to continue this discussion in the present mode. It really feels like this is devolving into a textual shouting match, which is not a useful form of idea-sharing, collaboration or consensus building. I believe both sides have stated their positions, but the discussion has drifted from impartial analysis of the core issues to "I like my way, I hate your way, lets see how I can make the opposite side look ridiculous".

    This is obviously a highly contentious issue, involving significant semantic subtlety and non-trivial implications for existing implementations. I believe some impartial moderation is called for, and some face-to-face (or at least voice-to-voice) interaction. I believe one of the action items from Friday's telecon was to setup a telecon devoted to this issue amongst interested parties.

    Can we try that as a next step for making progress on some of these issues? ```

    Reported by `danbonachea` on 2012-08-04 18:25:56

  32. Former user Account Deleted

    ``` That's probably a good idea. ```

    Reported by `sdvormwa@cray.com` on 2012-08-04 18:28:51

  33. Former user Account Deleted

    ``` Taking a step back and looking at the currently archived discussions, it seems to me that at the core of the disagreement is that the Cray and Berkeley proposals are trying to meet different needs. The analogy that comes to mind is automotive: Berkeley has designed a "manual transmission" API and Cray has designed "automatic transmission". Each has its place in the world, but we are now trying to pick exactly one to go into the UPC spec.

    I have always believed that the Berkeley design is correct for the goals it is meant to address, and I suspect that if I understood the background of Cray proposal better I would also find it equally well suited to its design goals. So, I'd like to suggest that on the conference call we might start by trying to discuss/understand the GOALS rather than the various issues regarding the designs that have evolved to meet those goals.

    Since the Berkeley semaphore/signaling-put extension (see http://upc.lbl.gov/publications/upc_sem.pdf)* which Dan recently mentioned is (I believe) intended to address synchronization goals vaguely similar to Cray's async memcpy proposal, it may be helpful to at least skim that document before the call.

    Reported by `phhargrove@lbl.gov` on 2012-08-05 20:19:09

  34. Former user Account Deleted

    ``` "I don't think it's productive to continue this discussion in the present mode. It really feels like this is devolving..."

    No, you started a useful discussion. If it has felt as if it is devolving, please be aware that this discussion comes at a relatively late phase, after a subcommittee was formed to develop a consensus proposal. I'm not sure why you weren't involved on the BUPC side of things. I agree that we need to discuss this issue at length on the recently scheduled telecon, but I don't think that precludes discussion online because the telecon isn't for two more weeks.

    "the core of the disagreement is that the Cray and Berkeley proposals are trying to meet different needs. The analogy that comes to mind is automotive: Berkeley has designed a "manual transmission" API and Cray has designed "automatic transmission". Each has its place in the world, but we are now trying to pick exactly one to go into the UPC spec."

    When working on the consensus proposal, I viewed the difference as BUPC and Cray starting from different origins, but both wanting a solution that is consistent with the memory model and useful to users. I saw BUPC as starting with the memory model, fitting in _async extensions, and then attempting to make them useful to users by exempting the extensions from normal fence semantics. I saw Cray as starting with a description of what the users wanted to do, writing _nb extensions to let them do it, then making them fit into the memory model without losing their utility by introducing the half-fence concept.

    While we're talking philosophy here, I think it's very important in this discussion that we not lose sight of the UPC spec as being the primary mechanism whereby users can find out how the language -- and presumably their compiler -- works. UPC isn't like C or C++ where users can find zillions of books and online resources to help them out. We should try to minimize putting things in the spec where the spec says one thing but 99% of implementations will do something that is apparently completely different but compliant. For example, the reason that we're even discussing this problem is that the existing upc_mem* functions are blocking on many implementations. The spec doesn't make them blocking, and any users reading the spec will see that they just wrap up a bunch of relaxed accesses into a convenient function call, but the functions generally are blocking and performance-conscious users must think of them that way. To continue that example, I believe that users will view a non-blocking call as initiating the copy before it returns because most implementations will do that. If the spec does not require that behavior, then we're again in the same confusing situation where there are basically two standards: (1) the UPC spec, and (2) how most UPC implementations work. ```

    Reported by `johnson.troy.a` on 2012-08-06 16:45:26

  35. Former user Account Deleted

    ``` After a chat about the NB memcpy proposals with one my users, I thought I should pass one thing he said:

    I do strongly feel that the semantics for nonblocking reads/writes should be the same as for the nonblocking collectives (if and when they get implemented). So any discussion of this should take that into account, even though the collectives are in a different proposal. (I don't see really needing the extra flexibility of the Berkeley proposal for reads and writes, but I'm less sure about collectives.)

    ```

    Reported by `nspark.work` on 2012-08-06 21:49:18

  36. Former user Account Deleted

    ``` "While we're talking philosophy here, I think it's very important in this discussion that we not lose sight of the UPC spec as being the primary mechanism whereby users can find out how the language -- and presumably their compiler -- works."

    Philosophically, I strongly disagree that the spec should be geared as a training tool for users, or as a substitute for vendor-provided documentation of implementation-specific behaviors. Behavioral descriptions of particular implementations or even expected implementations have no place in a formal language spec. The specification is a contract between all users and all implementations, and historically the UPC spec always strived to specify necessary and sufficient semantics - ie the minimally necessary restrictions on the implementation to provide sufficient functionality for the user. As the spec gains implementation restrictions and operational codification of behavior, you reduce the space of legal implementations and optimizations, potentially leading to performance degradation. Programming languages have a much longer life cycle than hardware systems, so as language writers we must be sensitive not only to current implementation stategies and platforms, but must also do our best to allow for improvement via future strategies and hardware. It's difficult to accurately predict where hardware will be in 5 or 10 years, but minimizing spec requirements to necessary and sufficient conditions gives us the most "wiggle room" to accomodate a changing hardware landscape in the future.

    "the reason that we're even discussing this problem is that the existing upc_mem* functions are blocking on many implementations. The spec doesn't make them blocking, and any users reading the spec will see that they just wrap up a bunch of relaxed accesses into a convenient function call, but the functions generally are blocking and performance-conscious users must think of them that way. "

    To address your specific point about existing upc_mem* behavior, there is a very important semantic difference between "blocking" (ie synchronous) and strict (ie surrounded by fences that prevent access movement). These may happen to have similar performance characteristics under naive translation on a current distributed system, but are quite different on current systems with hardware shared memory support. One could imagine future systems with better hardware support for UPC where the difference could be even more significant. The difference is also quite important as far as the compiler is concerned - the relaxed semantics of upc_mem* allows for a good optimizer and/or a smart runtime system to intelligently reorganize and schedule the data transfer, using only serial/local data analysis. The appearance of any fences severely limits what an optimizer can do, because full parallel analysis with complete program information is usually required for provably correct transformations around fences. The fact that some implementations make no effort to exploit this semantic does not mean that the spec should be written to preclude such optimizations, which is why upc_mem* has the semantic specification that it does.

    " To continue that example, I believe that users will view a non-blocking call as initiating the copy before it returns because most implementations will do that. If the spec does not require that behavior, then we're again in the same confusing situation where there are basically two standards: (1) the UPC spec, and (2) how most UPC implementations work."

    I see no compelling reason to require implementations to issue all accesses before returning from initiation, even in software. I can easily imagine implementations that could improve throughput under high load by delaying initiation based on network status. At the hardware level, we WANT the operations to be "in-flight" for as much of the transfer interval as required (that's the entire point of using an async library), and the asynchronous agent (eg the RDMA engine) should have the freedom to initiate accesses that perform the transfer when appropriate based on network resource status.

    ```

    Reported by `danbonachea` on 2012-08-06 22:54:27

  37. Former user Account Deleted

    ``` Cray's proposal is trying to solve the problem that the same-address restriction prevents the compiler/run-time library from making the existing upc_mem* routines non-blocking on machines where the hardware provides multiple paths to (remote) memory, and thus must rely on software to make ordering guarantees. Most high-performance scalable networks (including both Cray's and IBM's current offerings) are designed in this way, as it provides greater bandwidth and improved resilience against hardware failures. Looking 10 years out, we don't see this situation changing significantly, as most networks are moving more and more in this direction. We therefore don't believe it is reasonable to expect hardware support on large distributed memory systems for the foreseeable future.

    To enforce the ordering in software, an implementation must track operations that are "in-flight" and resolve conflicts in some way. One proposed approach to this is software caching of relaxed accesses. However, we do not believe this is a viable approach (in the context of this discussion) for large systems for the same reason it's not done it hardware: lack of memory. The size of your cache determines the upper limit on the amount of data you can have in-flight. Non-blocking operations are most useful when you have a lot of data to move, and the cache must be relatively small so there's still enough room for user data. It is also complex to implement and can easily hurt performance more than it helps without per-application tuning.

    Another approach is to track which shared memory locations have operations that are in-flight, and insert syncs of some kind when a conflict is detected. There's still a memory problem, but instead of large contiguous transfers being the problem, smaller "random-access" transfers kill the scalability of this approach, as the implementation can't efficiently store lots of "random" scattered memory addresses, and must therefore rely on much more coarse tracking. I believe this is what IBM claimed to be doing (with a bit-vector permitting a single "in-flight" operation per remote thread/node if I recall correctly?) on one of the earlier phone conferences. Cray does something similar to this, with the caveat that upc_mem* routines are always synced before returning for various reasons. However, there is a noticeable overhead to this tracking, particularly on some important (to our customers) access patterns.

    Other approaches either can't handle relatively common corner cases (static compiler analysis) or don't take advantage of available hardware offload mechanisms and have other scalability issues (active messages/RPC). We therefore need some help from the user to get around this.

    The "half-fence" that we proposed on the global sync formally provides acquire semantics on relaxed accesses. This is necessary to permit pairwise synchronization with a remote thread via relaxed operations to notify that thread that the non-blocking operation is complete. It is important that this be done with relaxed operations, as using strict operations would unnecessarily sync other non-blocking operations (which may include much more than simply the user's explicit use of the proposed routines!). If another method of providing this functionality is made available, either via a new type of fence (upc_fence_acquire/upc_fence_release?) or Berkeley's semaphore proposal (which I haven't read yet), then I don't think we'd have a problem dropping this part of our proposal.

    In terms of spec changes, I believe our proposal is much more conservative than Berkeley's. Importantly, the new restrictions on accessing memory locations involved in a call to one of the proposed routines apply ONLY to the calling thread in our proposal. As far as all the other threads are concerned, the proposed routines behave just like the existing upc_mem* routines, and thus no changes to the memory model are required--minus the "half-fence", which I think Dan has convinced me could be better provided in a different manner. The proposed routines are simply another way to perform relaxed shared memory accesses, with the benefit/caveat that the same-address restriction is lifted between the initiation of the operation and the sync. We believe this behavior is sufficient to provide the amount of communication/computation overlap users desire without adding significant additional complexity to the memory model.

    We DO NOT believe permitting non-blocking operations to continue beyond a fence provides any useful additional functionality (perhaps you could provide an example where this is necessary?). We DO believe that allowing it will confuse users who expect upc_fence (or worse, a UPC barrier!) to be a full memory barrier. Additionally, it is a non-trivial task for the implementation to detect and warn users when they've (hopefully accidentally) written "illegal" code that accesses memory locations involved in a call to one of the proposed routines on a thread other than the calling thread before the sync, and will therefore be hard-pressed to aid the user in debugging the problems that this will cause. We previously proposed adding a class of "super-relaxed" operations, which were relaxed operations that didn't have the same-address restriction. It was rejected because of concerns it'd be too confusing to users, and added too much complexity to the memory model. I can't imagine this is any less confusing, given that the legality of a users code won't be immediately obvious nor easily provable in all cases.

    "Taking a step back and looking at the currently archived discussions, it seems to me that at the core of the disagreement is that the Cray and Berkeley proposals are trying to meet different needs. The analogy that comes to mind is automotive: Berkeley has designed a "manual transmission" API and Cray has designed "automatic transmission". Each has its place in the world, but we are now trying to pick exactly one to go into the UPC spec."

    I think this is exactly the case, though I don't quite understand your automatic versus manual transmission analogy. To my mind, a better analogy would be traffic at a street light. Cray proposed a system that allows the user to say "trust me, I'll make it through before it turns red" to allow vehicles to continue when the light turns yellow, but doesn't allow anyone through a red light. Berkeley proposed letting some vehicles go right through a red light, and denying insurance claims if an accident occurs due to the "illegal driving" of a vehicle with a green light hitting them. ```

    Reported by `sdvormwa@cray.com` on 2012-08-07 14:13:01

  38. Former user Account Deleted

    ``` Because our common goal is to develop a consensus proposal, may I propose the following: let's discuss the disagreement points one by one instead of referring to the whole proposal. I think there are good points on both sides so why not combine and agree on the best.

    Here is my attempt to summarize the current disagreements:

    1) Should upc_fence (strict memory ops in general) guarantee the completion of outstanding non-blocking memory operations? A subcommittee of 5 people (including myself) had agreed to "Yes". But since there are some different opinions now, let's revisit this issue.

    2) Should the "sync" calls have fence/half-fence semantics?

    3) Should there be both local and global sync functions?

    4) Function naming (minor)

    Please add and/or change the discussion points if you have any others. I hope the list of disagreements will converge to zero as our discussion goes along.

    ```

    Reported by `yzheng@lbl.gov` on 2012-08-07 16:27:55

  39. Former user Account Deleted

    ``` "let's discuss the disagreement points one by one instead of referring to the whole proposal. I think there are good points on both sides so why not combine and agree on the best."

    Yili- You've summarized the low-level technical differences between the two approaches, but I don't think that's the correct level of discussion at this time. I think what these discussions have revealed is the reason the two proposals differ in the details is because they were designed with a different set of high-level goals and to satisfy a different set of user needs. The technical details mostly follow logically from those differing goals. We cannot arrive at a consistent and well-designed interface by resolving technical points in a vacuum, without first straightening out the high-level goals of the interface.

    sdvormwa@cray.com: "The proposed routines are simply another way to perform relaxed shared memory accesses, with the benefit/caveat that the same-address restriction is lifted between the initiation of the operation and the sync. We believe this behavior is sufficient to provide the amount of communication/computation overlap users desire without adding significant additional complexity to the memory model."

    I think Paul is correct that we need a high-level discussion about goals of the interface. Alleviating the same-address restriction is nice, but is NOT the major goal the Berkeley proposal was trying to accomplish. Conflicting writes to the same address from a single thread with no intervening fence is not a pattern we expect in well-tuned applications, because it represents a neglected opportunity for communication coalescing. That being said, it may occasionally happen and still needs to be handled correctly, but it's not the case we're most interested in tuning for. Neither are we designing the async memcpy library to specifically serve as a "signalling put" - this is an important usage case that we feel deserves its own separate library interface and should not be conflated with pure asynchronous data movement.

    Our goal with the Berkeley async transfer library was to enable far more aggressive overlap of communication with unrelated computation and other communication. We are trying to overlap the entire cost of a communication, and allow it to asynchronously continue undisturbed without interference from unrelated operations. The boundaries of the asynchronicity are defined by the init and sync library calls (as part of the "contract" between the app and library), not by random fences that may happen to occur in the unrelated code. The need we are trying to meet is the user explicitly asserts "perform this transfer in the background, and I will explicitly call you again when I need to ensure it has completed" - this is a familiar paradigm in other parallel libraries. I think it would be more surprising to the user who has invoked the async library that when he calls an unrelated application module written in UPC, suddendly the async transfers for his module are no longer achieving overlap; because the callee module uses a fence somewhere to synchronize some completely unrelated data.

    ```

    Reported by `danbonachea` on 2012-08-07 18:21:46

  40. Former user Account Deleted

    ``` "Conflicting writes to the same address from a single thread with no intervening fence is not a pattern we expect in well-tuned applications, because it represents a neglected opportunity for communication coalescing."

    That is not the problem though. The issue is that unless the implementation can PROVE there are no conflicting writes, it must conservatively assume there are, which impacts just about all codes. Good compiler analysis can help in some cases, but there are important cases that it can't help with, usually due to other language design decisions--separate compilation probably being the most obvious. Runtime caching / tracking / coalescing can all help sometimes as well, but the memory overhead limits their usefulness, and they tend to not scale well beyond a certain number of threads. ```

    Reported by `sdvormwa@cray.com` on 2012-08-07 20:07:38

  41. Former user Account Deleted

    ``` I'm not sure if there is any substantial difference in the high-level goals of this extension -- skipping the adjectives, isn't the high-level goal the same on both sides: enable communication/computation and communication/communication overlaps? (Note: I would like to save the discussion about half-fence-at-sync in a different post.)

    Actually, for many common cases where no fence is used between nb init and nb sync, both the original Berkeley and Cray proposals behave similarly, if not the same. The main disagreement is on how to handle the special case when a fence is used between an init and the corresponding sync.

    danbonachea: "Alleviating the same-address restriction is nice, but is NOT the major goal the Berkeley proposal was trying to accomplish. Conflicting writes to the same address from a single thread with no intervening fence is not a pattern we expect in well-tuned applications, because it represents a neglected opportunity for communication coalescing. That being said, it may occasionally happen and still needs to be handled correctly, but it's not the case we're most interested in tuning for. "

    I think "alleviating the same-address restriction" is NOT a goal but a Mechanism to achieve the goal of overlapping. Because of the same-address restriction, the UPC compiler/runtime cannot perform reordering optimization for 99% of common cases where there are actually no same-address accesses but the compiler/runtime just cannot prove its absence. Another way to view the nb memcpy functions is that they provide a library approach for users to express "super relaxed" data accesses.

    I like Steve's analogy of "allowing outstanding non-blocking memory operations to pass a fence is like allowing cars to pass a red light". While there could be special situations to justify such violations, I generally prefer to obey the traffic laws.

    ```

    Reported by `yzheng@lbl.gov` on 2012-08-07 20:26:00

  42. Former user Account Deleted

    ``` "The issue is that unless the implementation can PROVE there are no conflicting writes, it must conservatively assume there are, which impacts just about all codes."

    I completely agree - this is ONE of the main motivations for an explicitly asynchronously library. My point is that it's not the ONLY reason for using such a library and not the sole design goal, as your text I quoted in comment #43 seems to indicate. Specifically, it is not "sufficient" for the library to provide a tool to suppress the "same-address" restriction, we also want the semantics to enable full overlap of the communication with other, fully-general and unrelated activity (which the user asserts does not touch the transfer buffers).

    ```

    Reported by `danbonachea` on 2012-08-07 20:38:59

  43. Former user Account Deleted

    ``` " isn't the high-level goal the same on both sides: enable communication/computation and communication/communication overlaps?"

    Both sides probably agree to that broad statement, but we need a more detailed and concrete description of the types of usage cases we wish to support, and how the library fits into those cases.

    "I like Steve's analogy of "allowing outstanding non-blocking memory operations to pass a fence is like allowing cars to pass a red light". While there could be special situations to justify such violations, I generally prefer to obey the traffic laws. "

    I don't think we should be debating formal semantics by analogy.

    However since people seem seduced by the analogy, I think Steve's characterization is flawed. I think the Berkeley semantics are better described as an overhead expressway - it bypasses all the city traffic below and is unaffected by city traffic lights, because the laws of the road guarantee the cars below cannot even SEE the highway traffic, let alone interact with it. The on-ramps and off-ramps are clearly defined by the library calls which define where cars enter and exit the normal flow of relaxed operations on the city streets, but while they're "in-flight" on the expressway they operate completely independently of everything else.

    ```

    Reported by `danbonachea` on 2012-08-07 21:17:37

  44. Former user Account Deleted

    ``` "Conflicting writes to the same address from a single thread...not the case we're most interested in tuning for"

    Same for us. It's a rare case that has unfortunate performance consequences for the common case in at least two vendor implementations. We don't optimize for it happening; we try to deal with it in a way that minimizes the impact that its very existence has on the common case.

    "Neither are we designing the async memcpy library to specifically serve as a signalling put"

    Cray calls this a put-with-notify and we're interested in that functionality becoming part of UPC. If it is separate from the _async/_nb functions, then so be it, but it does mean introducing more library functions than if _async/_nb could be used instead.

    "The boundaries of the asynchronicity are defined by the init and sync library calls...not by random fences"

    "Because all executions of valid programs are prohibited from observing any violations, by definition the memory model is preserved and the executions are "UPC Consistent". This is the essence of how the memory model works - if VALID programs cannot tell the difference, then the model is not violated." [Comment #30]

    Let me paraphrase that to make sure that I've got it and then come at this from a slightly different manner than I have before. I still have my previous objections about the async fence behavior, but I want to look at upc_barrier because I think users will find that more surprising...

    The BUPC async proposal adds something to UPC that violates the memory model and then hides the fact that the memory model is being violated by declaring that otherwise legal programs that could observe the violation are now illegal. For example, normally it is legal for two threads to modify the same data from opposite sides of a barrier and I could use this legal behavior to detect the async memory model violation, but instead it is declared that if there is an unsynchronized async to this data, then my program is illegal; i.e., even if I can run my program and demonstrate the memory model violation, the evidence is inadmissible.

    I don't think that this approach is valid for extending UPC (at least not in the backwards compatible manner that we want for UPC 1.3) because it could break the intent of existing code by removing the only mechanism that the programmer has to ensure that there is no ongoing communication: upc_barrier. If I have a collective function in an existing library, I may have used a upc_barrier upon entry and exit to ensure that I can do what I want with any memory in between. Currently this is a foolproof way of guarding against what comes before and after my collective library function and the only burden on my client is to call the function in a collective context. With asyncs added, my barriers no longer offer complete protection and the burden shifts to the library client to ensure that any asyncs touching the data do not cross a call to this function somewhere in their call graph.

    I can see an argument that the library code is still legal and client code just needs to be more careful with the new language feature, but I don't think it's a very nice thing to do to people in a 1.2 -> 1.3 change because it essentially changes contracts of existing functions. The contract here changing from "I promise to call this function in a collective context" to "I promise to call this function in a collective context and further promise to not be asynchronously touching any memory that the function may touch." This change is particularly awkward if the the client doesn't have complete knowledge of all memory locations that the function may touch. ```

    Reported by `johnson.troy.a` on 2012-08-08 16:42:55

  45. Former user Account Deleted

    ``` I have four major concerns with allowing the routines to continue past fences. The first two are philosophical, while the final two are potential future problems I see as an implementer.

    1. Allowing it adds restrictions on threads other than the calling thread. This is counter-intuitive, at least to me, as the one-sided model implies threads are independent outside of explicit inter-thread synchronization. If the routines are synced by fences, other threads are not impacted by a thread's use of these routines at all.

    2. The existing memory model is difficult to understand, but complete. With this change, the memory model is no longer complete, as we've introduced a relaxed access with special rules that aren't reflected by the memory model. We can (and did) go back and forth all day about whether or not this breaks the memory model, but it certainly complicates the task of trying to understand it.

    3. Violations of the access rules are relatively easy to detect on the calling thread, either through static analysis or fairly cheap run-time checks. Detecting violations on other threads is a much more difficult problem, as every thread must be aware of every other thread's non-blocking operations. This will make debugging extremely difficult.

    4. I think this will eventually create a de facto memory model for the "illegal" codes, which like it or not, users will end up writing. They'll find that the undefined results are acceptable on one implementation, and then other implementations will have to provide the same behavior for compatibility when the users port their code. Since this could have very significant performance (not to mention implementation design) implications, I'd much prefer to hammer this out ahead of time rather than be stuck with a de facto definition that hamstrings us later.

    Additionally, I still don't see a motivating need for allowing these to pass fences. While Dan's vague "what-if" scenario could indeed cause problems, I'm having trouble coming up with a specific situation that it would apply to (ignoring signalling puts/gets, which we've agreed to handle separately). Could someone give a more concrete example where this functionality would be required? Without some way of addressing the concerns I listed above, I don't think we should be adding this to the spec unless we have a specific use-case in mind--one that can't be done any other way. Undefined behavior should be a last resort for specification writers, particularly when the trigger is so hard to detect.

    "I don't think we should be debating formal semantics by analogy."

    Agreed. I just put it in there to lighten up the conversation after I didn't understand Paul's analogy. That said, yours was pretty good, though highways generally have actual physical barriers preventing city traffic from interacting with its own traffic. ```

    Reported by `sdvormwa@cray.com` on 2012-08-08 16:51:01

  46. Former user Account Deleted

    ``` I have been asked to contribute an opinion here. It is a long thread, and a passionate. Of the several possibilities discussed, I extracted two that seemed reasonable.

    1) asynchronous memory operations have local completion semantics (i.e. waiting for an async memput only guarantees that the send buffer is reusable after a put). Fences operate on asynchronous operations just like on "normal" blocking ones.

    2) asynchronous memory operations have global completion semantics (i.e. waiting for an async memput guarantees that it is now remotely complete as well). Fences do not operate on asynchronous operations - indeed, there is no point since just waiting already does the job.

    There were others half-mentioned (or maybe I misunderstood the heated dialogue) - like remote memory ops that don't fence at all - that is, we *never* know whether they have ever remote completed. I will not consider such scenarios.

    I prefer (1) over (2) (which puts me in cahoots w/ the Cray guys rather than Dan, I think). Here is why: because the (1) semantics is unsurprising. It is in line with what I already know about UPC - that relaxed writes have local completion semantics - I only know that send buffers can be reused when the write returns. (1) is *also* in line with MPI and shmem, to the best of my understanding - this may not be an argument for you, but sure is for me.

    I'm not sure what you will say about processor ordering of asynchronous puts to the same remote address. I would love it if you could make yourself say that the *start* of the operation is what determines order - not when you wait for completion. This, again, would be unsurprising. It can be implemented with some effort - I will claim that the effort is the same as we are already making to order blocking puts on an unordered network.

    You spent a lot of time talking about fences and their interaction with half-finished asynchronous operations. This seems like a red herring to me - if you are a crazy enough programmer to use non-blocking operations - need I elaborate on the perils of non-blocking remote operations? - well, in that case making sure that there are no pesky strict accesses, barriers and so forth between the op start and the wait should be child's play.

    If you end up going for (2), it's still kind of OK ... it's different, but still has a kind of internal consistency. Fences would simply ignore non-blocking operations. You would order remote puts w.r.t each other based on when you wait for them - not when you start them. You could order remote puts w.r.t. normal blocking puts by employing strategic fences (although you'd be kissing goodbye to performance if you did that). It's serviceable ... but personally I don't really like it; it's a much larger change relative to what UPC users are used to in terms of ordering and fences.

    My $0.02 in 1966 issue pennies ... if you have to flame me, do it gently.

    ```

    Reported by `ga10502` on 2012-08-10 02:44:16

  47. Former user Account Deleted

    ``` I'm glad to hear we're making some progress on teasing out the "signalling-put" usage scenario, and perhaps agreeing that it deserves a separate, dedicated library call, to avoid conflating it with the desire for pure asynchronous data movement free from inter-thread synchronization. If we at least have consensus on that goal, then perhaps we can make progress on both extensions independently.

    However, I'm very dismayed the Cray team is insisting on continuing to claim the Berkeley semantics somehow "violate" or "break" the memory model. This claim is totally and completely FALSE. Repetition of this unproven and patently false claim only demonstrates a lack of understanding of the Berkeley semantics, the memory model, or perhaps both, and is not conducive to productive discussion. I sincerely HOPE there is an honest lack of understanding here, and that this is not some kind of underhanded negotiation tactic attempting to discredit an opposing view with disparaging emotional claims. In any case I hope we can drop this false preconception and unsubstantiated mode of argument and have an open-minded discussion of the factual issues in precise technical terms, devoid of sloppy generalizations. This will be my last textual attempt to try and describe how the Berkeley semantics fit into the memory model. If someone is still not "getting it" then we should table the memory model discussion until the call. I have limited internet access for the next week until the call anyhow.

    The UPC memory model is not some touchy-feely generalization or hand-wavy set of properties, nor is it anything like a virtual machine abstraction that one might sketch on the standard UPC shared memory diagram. It is a formal mathematical model that defines consistent execution traces of shared read/write accesses in valid programs. The conditions for validity are very explicitly and mathematically defined, and the guaranteed properties of strict accesses (and by extension fences) are corollaries that follow from the formal definition. The memory model is not automatically "violated" simply because it exhibits a property that someone may find surprising or counter-intuitive, nor because it may allow behaviors that have no analogue on one's favorite architecture under naive translation. This is not a weakness in the definition, it reflects an inherent complexity of relaxed memory models, which are by nature subtle and often surprising or counter-intuitive.

    Now all that being said, we all obviously strive to present users with language and library tools that neatly wrap up useful functionality with an interface that is as intuitive and unsurprising as possible, while still efficiently providing the desired functionality. However programmers that insist upon writing invalid or perverse programs will end up with undefined or unintuitive behaviors, respectively. We should not cripple the performance or utility of our expected target usage scenarios in order to better accomodate these outliers. Parallel programming is hard, and C is a low-level language - programmers need considerable sophistication to use UPC correctly and effectively, and that goes double for "advanced" features like explicitly asynchronous data movement within a relaxed memory model. Debugging tools and automated error detection are valuable tools, but ultimately we are dealing with a non-type-safe HPC language with numerous "pointy edges", and the feasability of automated correctness checking should not take precedence over functionality and performance concerns in evaluating new language features.

    A VERY important and deliberate property of the memory model is that it does not make any guarantees about the possible results of operations that did not occur. Stated another way, if the execution trace did not directly "observe" a violation of the model, then the execution is consistent, regardless of what tricks the implementation may be playing under the covers (whose effects were not observed by any application read operations).

    The Berkeley semantics for async are that it leads to UNDEFINED BEHAVIOR for any application thread to modify the source buffer or in any way access the destination buffers during the transfer interval, between the library initiation call and the successful return of library sync call. A program that "cheats" and touches these buffers in the forbidden interval is an INVALID UPC program, and the memory model does not provide ANY guarantees whatsoever for an invalid program. Valid UPC programs Shall Not touch those buffers within the transfer interval, and this property makes it IMPOSSIBLE for them to observe exactly how those buffers were accessed by the library, and how those accesses may or may not have been affected by other non-related fences or synchronization constructs. Because all executions of valid programs are prohibited from observing any violations, by definition the memory model is preserved and the executions are "UPC Consistent". This is the essence of how the memory model works - if VALID programs cannot tell the difference, then the model is not violated.

    Troy wrote: "I don't think that this approach is valid for extending UPC (at least not in the backwards compatible manner that we want for UPC 1.3) because it could break the intent of existing code by removing the only mechanism that the programmer has to ensure that there is no ongoing communication: upc_barrier. If I have a collective function in an existing library, I may have used a upc_barrier upon entry and exit to ensure that I can do what I want with any memory in between. Currently this is a foolproof way of guarding against what comes before and after my collective library function and the only burden on my client is to call the function in a collective context. With asyncs added, my barriers no longer offer complete protection and the burden shifts to the library client to ensure that any asyncs touching the data do not cross a call to this function somewhere in their call graph."

    I agree the property you mention is CURRENTLY true of pure UPC 1.2 programs as a consequence of combining the memory model with the highly restricted set of communication operations available in 1.2. It's not a direct property of the memory model, it just happens to be true based on what you can currently express in the 1.2 language. The property is notably NOT preserved once you link in anything outside UPC, including MPI operations and even asynchronous file I/O, both of which are common-place in enterprise applications. It's debatable whether that property was originally intentional or even if its a desirable high priority moving forward. In any case the fact that admitting the Berkeley semantics into the 1.3 toolbox can remove this guarantee does not constitute a violation of the memory model, or any officially stated corollary thereof. It's not a backwards compatibility issue because legacy programs don't use the new features and therefore remain unaffected. Programmers who add calls to the async library are burdened with ensuring the buffers remain untouched throughout the transfer interval - this is precisely the additional complexity that sophisticated users are accepting when they choose to use any async library. If users cannot figure out how to leave the buffers untouched during the transfer interval, then they have no business using an asynchronous library.

    However regarding this point, I think it's worth separating discussion of the property no-asyncs-crossing-barriers (mentioned above) from no-asyncs-crossing-fences (which includes barriers as a special case, but MANY more operations as well). We could conceivably decide to require the init and sync call to appear in the same barrier phase (and this could be easily checked in a debug implementation), independently of what we decide regarding fences. My inclination is that explicitly async transfer intervals should be permitted to span either one (for reasons of composibility), but I would like some user feedback about how important Troy's property is to them.

    ga10502 said: "1) asynchronous memory operations have local completion semantics (i.e. waiting for an async memput only guarantees that the send buffer is reusable after a put). Fences operate on asynchronous operations just like on "normal" blocking ones.

    2) asynchronous memory operations have global completion semantics (i.e. waiting for an async memput guarantees that it is now remotely complete as well). Fences do not operate on asynchronous operations - indeed, there is no point since just waiting already does the job."

    We all appreciate your input, although unfortunately I don't think either of your "summaries" above corresponds to the possibilities we're currently discussing. The Berkeley semantics are very *close* to option #1, except the fences you mention are only relevant to the async operation when they occur outside the transfer interval, and fences during the transfer interval are not relevant to the completion of the async. Option #2 is close to the Cray proposal of several weeks ago, but I think consensus is moving towards loosening those built-in fences and instead introducing a dedicated signalling-put library to meet the need it was trying to address. Both sides should probably summarrize the semantics they are currently proposing - I'll post a followup shortly that does this.

    "I'm not sure what you will say about processor ordering of asynchronous puts to the same remote address. I would love it if you could make yourself say that the *start* of the operation is what determines order - not when you wait for completion. This, again, would be unsurprising. It can be implemented with some effort - I will claim that the effort is the same as we are already making to order blocking puts on an unordered network."

    This "effort" you mention is one of the primary performance bottlenecks we are trying to alleviate by introducing this library. The programmer asserts there are no conflicting accesses to the source or destination buffers for the entire period delimited by the init/sync calls (the "transfer interval"), and the implementation does a better job of communication overlap as a result. If you need a stronger ordering guarantee, the existing upc_mem* libraries already provide that semantic. Burdening the async library with such a semantic would make it nearly indistinguishable from the existing calls and defeat the purpose of introducing a new library.

    "You spent a lot of time talking about fences and their interaction with half-finished asynchronous operations. This seems like a red herring to me - if you are a crazy enough programmer to use non-blocking operations - need I elaborate on the perils of non-blocking remote operations? - well, in that case making sure that there are no pesky strict accesses, barriers and so forth between the op start and the wait should be child's play."

    Your statement is a very strong argument for Berkeley's view regarding the behavior of fences and other strict ops that appear in the transfer interval. This is the largest point of remaining contention. However let me clarify that neither proposal is currently

    • prohibiting* fences in the transfer interval - the disagreement centers around what effect they have on the async operation in progress. If you never place a fence or strict op in the transfer interval, then the difference is irrelevant and both sides agree on behavior. However, I think both sides want to allow strict ops during the transfer interval for reasons of overlap and composibility - ie so that UNRELATED computation and communication can continue.

    ```

    Reported by `danbonachea` on 2012-08-10 16:10:41

  48. Former user Account Deleted

    ``` In preparation for the upcoming telecon discussion, I'm splitting out the key sections of the old Berkeley proposal, which included a number of additional functions that are not currently under consideration for inclusion. Both Cray and Berkeley proposals MOSTLY agree on the use of handle types (and the different flavors of handle-based synchronization), and the flavors of memput/memget that should be provided - so to focus on the important issues and avoid tedious duplication I'll show a single exemplar function. This text obviously still needs alot of work in translation to formal "spec-ese", but for now I'm tending towards verbosity in the interests of clarity:

    1. include <upc_nb.h> upc_handle_t upc_memcpy_nb(shared void * restrict dst, shared const void * restrict src, size_t n);

    Semantics ---------- 1 This operation has the same data movement effect as the corresponding upc_memcpy call defined in the UPC Language Specification section 7.2.5, except it is split-phase. 2 The specified operation is initiated with a call to the above function which returns an explicit handle representing the operation in-flight. 3 The operation remains "in-flight" until after a successful call to upc_waitsync or upc_trysync on the returned handle. 4 The contents of all affected destination memory is undefined while the operation is in-flight, and if the contents of any source memory changes while the operation is in-flight, the result is undefined. 5 The order in which non-blocking operations complete is intentionally unspecified - the system is free to coalesce and/or reorder non-blocking operations with respect to other blocking or non-blocking operations, or operations initiated from a separate thread - the only ordering constraints that must be satisfied are those explicitly enforced using the synchronization functions (i.e. the accesses comprising the non-blocking operation are only guaranteed to be issued somewhere in the interval between initiation and successful synchronization on that operation). 6 The effect on conflicting accesses OUTSIDE the transfer interval is AS-IF the transfer were performed as a set of relaxed shared reads and relaxed shared writes of unspecified size and order, issued at an unspecified time within the transfer interval by the calling thread. Conflicting accesses INSIDE the transfer interval are prohibited by semantic 4. "INSIDE" and "OUTSIDE" are defined by the Precedes() program order for accesses from the calling thread; accesses from other threads are considered "INSIDE" unless every possible and valid <_strict relationship orders them before the init call or after the sync call.

    Key properties of the Berkeley async semantics: ---------------------------------------------- 1. The program is prohibited from changing the source buffer between the initiation call and the return of a successful sync call (the "transfer interval"). 2. Similarly, the contents of the destination buffer are UNDEFINED during the entire transfer interval (until a successful sync call returns). 3. Fences, barriers, and other operations (to unrelated memory locations) are fully permitted during the transfer interval, but the transfer interval is unaffected and extends until the matching sync call (stated differently, the operation that was explicitly initiated remains "in-flight" until the program explicitly calls the sync operation for it). 4. Because the contents of destination memory remain UNDEFINED until the return of a sync call, it is impossible for any valid program to "observe" any effect on the destination buffers caused by any fences or strict operations issued during the transfer interval. 5. After a successful sync call returns, subsequent read/writes issued by the thread that invoked the transfer are immediately guaranteed to observe the effects of the transfer without any fences. This is exactly the same semantic guarantee as the existing upc_mem* calls. 6. There are no strict operations, fences or "global completion" guarantees built-in to the library calls. The library performs pure data movement, not inter-thread synchronization. If such fences or inter-thread synchronization are required (eg to guarantee other threads see the effects of the transfer), they can be added around the library calls by the program using existing UPC mechanisms. These are exactly the same consistency guarantees as the existing upc_mem* calls, and the "best practice" documented in B.3.2.1. 7. The formal semantics of the library are defined in terms of minimally-constrained relaxed operations, exactly analogous to the existing upc_mem* library, as described in spec v1.2 section B.3.2.1. The size, order and temporal position of these operations within the transfer interval is deliberately unspecified. However, due to property

    1. 2 and #4, such details cannot be observed by any valid program anyhow. 8. Because of #6 and #7, the expression wait_sync(upc_mem*_async()) is semantically identical to the blocking version upc_mem*(), and is expected to exhibit a very similar performance profile in high-quality implementations.

    Notable differences from the Cray proposal: ---------------------------------------------- 1. In Berkeley asyncs the library transfer interval does not end when a thread issues a fence or other strict operation (possibly from within an unrelated library call or other callee). This enables deeper communication overlap in large, multi-module/multi-author applications. 2. Berkeley asyncs do not assume the application always wants to couple thread synchronization with data transfer. In this way they are more primitive and also potentially lighter-weight, and designed to efficiently apply to a broader set of usage cases. They enable applications to coalesce synchronization overheads for multiple data transfers (eg think bulk synchronous HALO exchange), rather than incuring thread synchronization costs on a per-operation basis. 3. The Berkeley asyncs are semantically very close to the existing upc_mem* library. They intentionally add the minimal possible semantics to enable explicitly asynchronous data transfer, and remain orthogonal to issues of thread synchronization. The memory model formalism used to describe the upc_mem* library also applies to the Berkeley asyncs.

    ```

    Reported by `danbonachea` on 2012-08-10 16:13:33

  49. Former user Account Deleted

    ``` "However, I'm very dismayed the Cray team is insisting on continuing to claim the Berkeley semantics somehow "violate" or "break" the memory model."

    Let UPC 1.2 have memory model M. I claim that the BUPC async proposal, by adding things to the language that are immune to certain aspects of M, like fences, creates memory model M'. The BUPC async proposal is fully compatible with M'.

    I believe that BUPC considers M == M' so that when we say that the BUPC asyncs violate M, it is interpreted as nonsense, since M' was constructed for the BUPC asyncs and they are compatible with it. I do not believe that M == M' because if I try to reason about a program using only M and that program contains BUPC asyncs, then I can come to incorrect conclusions about my program's behavior.

    I understand your point that the memory model "is not some touchy-feely generalization or hand-wavy set of properties," but based on responses to other Issues on this site, the UPC 1.3 memory model should remain equivalent to the UPC 1.2 memory model for backward compatibility and to avoid confusing users. We might be quibbling over what "equivalent" means. I think you are saying that the "formal mathematical model" in the appendix has not changed and the new stuff that the BUPC async proposal would add is declared to be independent, and therefore M == M'. I'm saying that I can't necessarily apply M and M' to the same program and reach the same conclusions; therefore, to me, as someone trying to apply the memory model as a tool for program interpretation, M != M'.

    The original Cray proposal and the newer consensus proposal stay completely within M. The half-fence idea is already part of M because upc_fence already has two roles in blocking movement of relaxed accesses; there is simply no user-accessible syntax for using one of its two roles without the other. In this instance, we needed only one of the two existing roles, similar to the way that upc_barrier can be divided into upc_notify and upc_wait.

    Hopefully that clears up what I mean by "break" or "violate." ```

    Reported by `johnson.troy.a` on 2012-08-10 17:21:26

  50. Former user Account Deleted

    ``` " I do not believe that M == M' because if I try to reason about a program using only M and that program contains BUPC asyncs, then I can come to incorrect conclusions about my program's behavior."

    The memory model has not "changed" in any way. The Berkeley asyncs do not propose to change it at all. What is changing is the set of OPERATIONS that you can express in the language, and this is true of both proposals. Furthermore, the "reasoning" properties you refer to remain true for all VALID UPC programs (ie those that do not break the library interface and access undefined values). The memory model only defines behaviors for VALID UPC programs and that remains true as well.

    Pure 1.2 programs do not use the new libraries at all, so trivially meet the validity requirement of the library. 1.3 programs that add calls to the new libraries must follow the rules of the interface and not touch the buffers during the transfer interval, and any properties still remain true. The only way to observe the differences you claim would be through an INVALID program (ie one that touches buffers in the forbidden interval). I challenge you to write a VALID program using the Berkeley library semantics that breaks any property of interest. We're not interested in the behavior of INVALID programs, ie those that access undefined values or break rules which lead to undefined results, because therein lies the road to madness.

    Where the two proposals differ is the point at which the transfer interval ends, ie the extent of the undefined buffers. Berkeley states it unconditionally extends from the explicit init to the explicit sync. Cray states it can conditionally end earlier if the calling thread performs a strict operation. This difference is a philosophical one and really has NOTHING to do with the formal memory model at all - it's simply the declared boundary of the asynchronicity of the library we're ADDING. Either option can fit perfectly inside the existing memory model without changes. We should focus our discussion on the user impact of this difference and the pros and cons of either way, instead of wasting time inventing falsehoods about how either way "breaks" the memory model.

    "The half-fence idea is already part of M because upc_fence already has two roles in blocking movement of relaxed accesses; there is simply no user-accessible syntax for using one of its two roles without the other. "

    I encourage you to carefully re-read the formal memory model. The terms "half-fence", "local completion" and "global completion" do not appear anywhere in the entire 1.2 spec. Any library spec that wishes to use such terms to specify consistency semantics would need to introduce FORMAL mathematical definitions that fit within the existing formalism.

    I'd also remind you that consistency semantics of upc_fence are clearly defined in B.5.3.1: "A upc fence statement implies a strict write followed by a strict read: SW(l_synch, 0) ; SR(l_synch, 0)" The "half-fence" you're looking for is probably just a strict write: SW(l_synch, 0) but since it sounds like we're probably dropping that semantic anyhow it's not really relevant anymore.

    ```

    Reported by `danbonachea` on 2012-08-10 20:42:36

  51. Former user Account Deleted

    ``` I think we should just wait for the conference call now. :) Neither one of us is convincing the other and I think that we'll need to have a majority of non-BUPC, non-Cray folks help solve this issue, preferably with input on what users want. ```

    Reported by `johnson.troy.a` on 2012-08-10 20:53:01

  52. Former user Account Deleted

    ``` "I think we should just wait for the conference call now. :) Neither one of us is convincing the other and I think that we'll need to have a majority of non-BUPC, non-Cray folks help solve this issue, preferably with input on what users want."

    I agree - as I mentioned earlier this textual discussion doesn't seem to be progressing us towards resolving anything. I was just hoping to narrow the discussion points to the precise technical differences so we can focus on resolving what actually matters.

    In preparation for the call, I'd like everyone to consider the following program:

    1: shared int p = 0; assume no other threads are touching these locations 2: shared int q = -10; 3: ... 4: if (MYTHREAD == 2) { 5: #ifdef UPC_NB_MEM_ 6: upc_handle_t h = upc_memcpy_async(&p,&q,sizeof(int)); start an async copy that eventually writes -10 to p 7: #endif 8: p = 20 9: int l_p = p; 10: printf("%i",l_p); 11: }

    Under UPC 1.2 (without UPC_NB_MEM_), I can state the following property concerning lines 8-9: "Because no other thread is touching p in this synchronization phase, and this thread performs a relaxed write of the value 20 to p on line 8, then the immediately subsequent relaxed read of p on line 9 from this thread is guaranteed to return 20.". More concisely and intuitively, "No other thread is touching p, so when I write p and then immediately read p I get back the value I wrote".

    When you enable 1.3 UPC_NB_MEM_ and add the EARLIER call to upc_memcpy_async on line 6, this property is no longer true. The spec allows implementations which sometimes print 20, sometimes print -10, and sometimes print something completely different. This is because the transfer interval for the unsynced library call has made the contents of p UNDEFINED, and this program is writing and reading that buffer while it has an undefined value, leading to undefined behavior. This is an incorrect use of the library, and it has invalidated a very simple property derivable from the 1.2 memory model regarding accesses elsewhere in the program.

    Note moreover that everything stated about the program above is true for BOTH the Berkeley and Cray libraries. Under BOTH semantics this program is reading undefined values by the addition of the earlier library call and now has undefined behavior. This is an example of how the addition of the library features proposed by Berkeley OR Cray can invalidate simple reasoning about the memory model in programs that use the new libraries incorrectly. This doesn't mean the memory model is "broken", it just means there is a new way to write programs with undefined behavior that the memory model does not govern. Programmers who embark upon using a new library feature are burdened with using it CORRECTLY, otherwise they get undefined behavior and commonly derived properties of the memory model may not hold true.

    ```

    Reported by `danbonachea` on 2012-08-10 23:01:45

  53. Former user Account Deleted

    ``` The following is NOT intended to pour gasoline on any fires.

    I just wanted to take a moment to restate a key point of the Berkeley proposal that might help reduce confusion. In particular, we keep stating that the Berkeley proposal does not "break", "modify" or "circumvent" the existing (Appendix B) memory model. Yet so far the explanations we have offered have not been effective at convincing Troy of Steven of that assertion. So, I want to try again (twice in fact). I hope this will help.

    To us (at least Dan and I) it is natural to define an "asynchronous" transfer as taking place at an UNdefined point in time (or more than one) between the init and sync calls. This is what naturally happens when the init call queues work to the network and the sync call blocks for completion of that work. Making that UNdefined nature of the transfer part of the specified semantics of a UPC-level async interface leads "naturally" to a situation (descried two ways below) in which an implementation can legally perform the async transfer entirely without interaction with strict accesses which take place in the transfer interval. So, we believe that this is in no way an "unnatural" behavior, even if it may be "unintuitive" to some/many.

    Note that the following are just two approaches to explaining the SAME idea, not two alternative/competing ideas.

    EXPLANATION #1:

    The Berkeley async data movement functions do NOT declare themselves as having a special exemption from the memory model's requirement that a strict operation (fences included) will order all preceding relaxed operations. However, the semantics are constructed such that an implementation may LEGALLY perform the transfers without interaction with strict references, as follows...

    What they DO is define a "transfer interval" which extends from the start of the init call to the end of the sync call, and the define the transfer as being a performed by an UNDEFINED sequence of (one or more) relaxed operations which may take place at any time (and in any order) for the duration of that interval.

    BECAUSE the time(s) at which the relaxed operations occur is NOT defined, there is no way to "prove" that any portion of the transfer is the result of a relaxed operation which precedes the strict operation. Therefore, an ordering in which all of the relaxed operations FOLLOW the strict operation (which itself must PRECEDE the sync call, or we'd not be concerned about it in this discussion) is "admissible" (for lack of a better word) under the CURRENT memory model. The constraint that the destination memory is undefined during the interval (just as with an MPI_Irecv, FWIW) prohibits a correct program from performing reads which could observe any ordering other than one in which the transfer occurs entirely after the strict operation(s).

    EXPLANATION #2:

    1) The specified semantics allow a "move-all-data-at-sync" implementation (of course they do not require it) in which the ENTIRE relaxed transfer takes place IN the sync function and thus FOLLOWS any strict operations executed WITHIN the transfer interval.

    2) No legal UPC program (meaning it obeys the specification's prohibition against reading the destination or writing the source) may OBSERVE the actual order/timing in which the data is written to the destination, NOR may it modify the source data (for instance to "sneak" information about the timing of the transfer into the destination memory).

    3) Therefore, an implementation is permitted to transfer the data without "forced completion" by strict operations which occur in the transfer interval because the only OBSERVABLE (by legal programs) execution is indistinguishable from move-all-data-at-sync (in which the data movement FOLLOWS the strict operations).

    ```

    Reported by `phhargrove@lbl.gov` on 2012-08-11 01:02:39

  54. Former user Account Deleted

    ``` As another interesting data point, the UPC I/O library in spec 1.2 section 7.4.7 already includes asynchronous modification of memory, and is the closest example of explicitly asynchronous library behavior in the current spec:

    "For asynchronous read operations, the contents of the destination memory are undefined until after a successful upc_all_fwait_async or upc_all_ftest_async on the file handle. For asynchronous write operations, the source memory may not be safely modified until after a successful upc_all_fwait_async or upc_all_ftest_async on the file handle."

    The behavior of the async I/O library is consistent with the Berkeley approach, in that the transfer interval and undefined buffers ends at the library sync call, not at earlier arbitrary fences or other strict operations. Using async I/O in UPC 1.2 it is ALREADY possible to construct scenarios where even a upc_barrier does not "quiesce" all in-flight changes to memory, and a library continues to modify destination buffers after a fence or barrier.

    ```

    Reported by `danbonachea` on 2012-08-11 02:09:33

  55. Former user Account Deleted

    ``` RE Comment 57:

    Hi Paul. I think Dan previously made the argument in your Explanation #1. Here's my problem with it...

    The existing upc_mem* functions are already collections of relaxed accesses that can issue in any order between two fences; however, to obey the same-address ordering rule, a upc_mem* implementation generally must wait until the accesses fully complete before returning. It would not be correct for a UPC compiler to reorder a upc_mem* call with respect to a upc_fence, either before the call or after the call:

    upc_memput( ... ); upc_fence; --x--> upc_fence; upc_memput( ... );

    upc_memput( ... ); upc_fence; --x--> upc_fence; upc_memput( ... );

    The major utility of a upc_mem*_nb function is that it can start issuing accesses and not wait for them to complete before returning, allowing both overlapping communication and overlapping communication with computation. Ideally the accesses issue as soon as possible, but it would NOT be correct for a compiler to issue them before the previous fence:

    h = upc_memput_nb( ... ); upc_fence; --x--> upc_fence; h = upc_memput_nb( ... ); sync( h ); sync( h );

    I haven't seen anyone argue that the above is legal using the BUPC async functions, and that's good.

    Now, going the other direction, it also should not be legal to allow the accesses to issue AFTER the next fence. BUPC claims that for the async functions the accesses could all issue after the fence, or even all issue at the sync. This situation is as if the _nb call was made after the fence:

    h = upc_memput_nb( ... ); upc_fence; --?--> upc_fence; h = upc_memput_nb( ... ); sync( h ); sync( h );

    For this transformation to be legal with an async call when it is not legal with the existing upc_mem* calls is truly bizarre. If I'm an application programmer and I want to incrementally introduce greater overlap, I'd probably start by doing this:

    upc_memput( ... ); --> h = upc_memput_nb( ... ); sync( h );

    And then move the upc_memput_nb call upward and move the sync call downward in the program as far as I could. Fences are brick walls here and that's fine -- that is to be expected. (In the original Cray proposal, if the sync made it all the way to the fence, you could remove it.) Then, after doing that, I might get a bit fancier and start software pipelining, which is where we get to the idea of providing an lsync as well as a gsync so that buffers can be reused and completion can be guaranteed for individual copies.

    To summarize, an asynchronous upc_mem call plus a sync (lsync) that doesn't behave like its synchronous counterpart would be severely unintuitive and not what we want to offer our users. ```

    Reported by `johnson.troy.a` on 2012-08-11 04:07:30

  56. Former user Account Deleted

    ``` "The major utility of a upc_mem*_nb function is that it can start issuing accesses and not wait for them to complete before returning, "

    No, that's only half the benefit. It also allows the implementation to delay issue based on network conditions. RDMA systems usually have a limited-depth queue for network injection, and it can be advantageous under contention to post operations to a software queue which is serviced later as the network drains. Even if the hardware accepts the operation immediately, an RMA offload engine may continue to read source buffers after the init returns, resulting in injections which are "issued" after init returns and other client code has run. In extreme cases of contention, injection might not even occur until within the sync call, although this is hopefully not a common case as it would defeat overlap for that operation.

    I think this may be a core point of philosophical confusion - when we say the relaxed operations may be issued ANYWHERE in the transfer interval, we really mean that. This is not some semantic dodge, we really envision platforms that will exhibit this behavior. Cray needs to loosen its preconception that all operations are issued before the init call returns, as this is NOT a requirement we want to add - doing so would severely inhibit the overlap potential of this library on systems of interest.

    "For this transformation to be legal with an async call when it is not legal with the existing upc_mem* calls is truly bizarre."

    The existing library calls are monolithic and inseparable. The new library calls are asynchronous and require two related calls, init and sync. Asynchronous calls introduce the concept of a transfer interval, which have no analogue in the blocking call. The behavior of this transfer interval is what we're discussing, so the "behavior" of a non-existent transfer interval in the blocking calls is irrelevant.

    "asynchronous upc_mem call plus a sync (lsync) that doesn't behave like its synchronous counterpart would be severely unintuitive and not what we want to offer our users."

    If they "behave" identically, then there's no point in introducing a new library. The new library adds the ability to express important semantic loosening which an advanced user can exploit to achieve overlap and improved performance. This is not a UPC 101 feature - this is a library for advanced users who want to maximize the performance of their app and are willing to learn the slightly more complicated rules for using them correctly. Perhaps Cray's goal in this regard is different.

    "want to incrementally introduce greater overlap..move the upc_memput_nb call upward and move the sync call downward in the program as far as I could. Fences are brick walls"

    You're looking at this as a compiler writer performing an automatic transformation based on static analysis. I agree that a static optimization which attempts to move the init and sync calls would have to stop movement at fences, without a great deal more parallel analysis. But this interface is NOT a UPC compilation target - implementations have their own internal interfaces for that. This is a user API where the programmer has full knowledge of buffer access patterns, and can place his init and sync library calls correctly to ensure he doesn't touch buffers during the transfer interval where they are undefined. He can add fences and other unrelated computation and communication during the transfer interval which do not touch the in-flight buffers, and everything still works. Even if he takes the approach of starting with init();sync(); and slowly spreading the calls, he CAN safely move them past unrelated computation, communication and even syncs using his Programmer Knowledge to ensure he doesn't move them past the code which consumes/overwrites the transfer buffers or synchronization which signals other threads to do so. The fact this transformation cannot blindly and automatically be performed by an optimizer in most interesting cases is a large part of the motivation for exposing an explicitly asynchronous library in the first place whereby an advanced user can do this manually.

    ```

    Reported by `danbonachea` on 2012-08-11 05:12:48

  57. Former user Account Deleted

    ``` Troy wrote, in part:

    Now, going the other direction, it also should not be legal to allow the accesses to issue AFTER the next fence. BUPC claims that for the async functions the accesses could all issue after the fence, or even all issue at the sync. This situation is as if the _nb call was made after the fence:

    In that last sentence the phrase "as if the _nb call" distills the essence of the difference between our lines of thought. In your line of thought the relaxed accesses are firmly associated with the _nb call. In the Berkeley line of thought the association is weaker, with the _nb call only defining one end of an interval.

    So, I do NOT agree that the transformation you describe (moving the _nb call past the next fence) is illegal. In fact, it is my position that it is allowed PRECISELY because the _nb call's semantics (as given in the Berkeley proposal) are designed to "give license" to perform this transformation (but ONLY with respect to fences within the transfer interval). In the Berkeley proposal use of an async transfer is an assertion by the user that the source and destination memories are untouched within the transfer interval, and the transformation is permitted.

    This *is* what we (Berkeley) want to offer to our users, and where our "philosophies" differ.

    I don't feel this behavior is "unintuitive" when one thinks about asynchronous data movement by network hardware or other "asynchronous agent". I AGREE that it does not follow "intuitively" from thinking about "sync(upc_memFOO_nb(...))" as equivalent to "upc_memFOO(...)", but that is NOT an equivalence we provide, and this could be made clear(er) in the specification/documentation if desired.

    Let's look at this difference in another way. Given the following: h = upc_memput_nb( ... ); other_work1(); upc_fence; other_work2(); sync( h ); Cray's proposal PROHIBITS transformation into other_work1(); upc_fence; other_work2(); h = upc_memput_nb( ... ); has moved DOWN 3 lines sync( h ); and it REQUIRES a transformation equivalent to: h = upc_memput_nb( ... ); other_work1(); sync( h ); has moved UP 2 lines upc_fence; other_work2();

    So, why is moving the sync() across the fence REQUIRED while moving upc_memput_nb() is PROHIBITED? To me THAT is an unintuitive aspect of the Cray proposal.

    In the Berkeley proposal, the init and sync are endpoints of an interval with neither one "anchoring" the transfer. Thus both transformations are legal, but neither is required (again this only applies to movement across fences WITHIN the transfer interval).

    ```

    Reported by `phhargrove@lbl.gov` on 2012-08-11 05:43:51

  58. Former user Account Deleted

    ``` "Cray needs to loosen its preconception that all operations are issued before the init call returns"

    We do not have a preconception that all are issued; that's not even how our own hardware works. We expect that the implementation will start issuing accesses in the init call. It very well may offload the whole transfer to an entirely separate piece of hardware and, sure, that hardware may be overwhelmed and not do anything for a while, but the point is that it tries to start the transfer. The asynchronous call is essentially the programmer saying please try to start this transfer right away and return as soon as you can so that I can do something else in the meantime.

    "The behavior of this transfer interval is what we're discussing, so the "behavior" of a non-existent transfer interval in the blocking calls is irrelevant."

    No, it is quite relevant because there are existing programs that use the blocking upc_mem* calls that will migrate to the non-blocking calls for better performance and the programmer needs a semantic equivalence to get started.

    "If they "behave" identically, then there's no point in introducing a new library."

    They should be identical until you separate them. This is the equivalence that I'm talking about. The fact that the transfer window vanishes to nothing for the blocking call does not invalidate this equivalence. The utility comes from being able to separate them and that is the point of extending the library interface. The separation process is familiar and similar to replacing a upc_barrier with a upc_notify plus upc_wait and then stretching them apart. Before they are separated they are by definition identical to the original upc_barrier.

    "You're looking at this as a compiler writer performing an automatic transformation based on static analysis. ... The fact this transformation cannot blindly and automatically be performed by an optimizer in most interesting cases is a large part of the motivation for exposing an explicitly asynchronous library in the first place whereby an advanced user can do this manually."

    The advanced user can achieve overlap while still respecting fences and barriers. I don't believe that whatever additional overlap might be made possible by being able to suspend the normal fence and barrier rules for the duration of the async transfer would be worth the ensuing weirdness of not having a blocking/non-blocking+sync equivalence and not having a upc_barrier complete all accesses. ```

    Reported by `johnson.troy.a` on 2012-08-11 06:23:56

  59. Former user Account Deleted

    ``` "I AGREE that it does not follow "intuitively" from thinking about "sync(upc_memFOO_nb(...))" as equivalent to "upc_memFOO(...)", but that is NOT an equivalence we provide, and this could be made clear(er) in the specification/documentation if desired."

    It's an equivalence that we intentionally provided. Are you suggesting that we have the UPC 1.3 spec not guarantee the equivalence but as a vendor document that ours is equivalent? That's possible to do, but less standardization is worse for users and I'd like to avoid that if possible.

    "So, why is moving the sync() across the fence REQUIRED while moving upc_memput_nb() is PROHIBITED? To me THAT is an unintuitive aspect of the Cray proposal."

    We don't allow any access to cross any fence. It does not matter that the access is being performed for the UPC thread via some asynchronous agent. The accesses are still logically being performed by a particular UPC thread and its accesses are ordered with respect to its fences. We follow this rule consistently. As another example of an asynchronous agent, if a UPC thread has OpenMP threads underneath it and the OpenMP threads issue accesses, then a upc_fence completes all accesses issued by all OpenMP threads that are part of the UPC thread. (I know this isn't possible for GASNet-based UPC implementations because you aren't allowed to sync a handle from a different pthread.) ```

    Reported by `johnson.troy.a` on 2012-08-11 07:08:26

  60. Former user Account Deleted

    ``` While I clearly haven't converted Troy, if feels to me like both sides now understand how the respective specification's goals/requirements differ. Hopefully things are now clear enough to an "outside observer" that others can make up their minds prior to the conf call.

    Troy wrote:

    Are you suggesting that we have the UPC 1.3 spec not guarantee the equivalence but

    as a vendor

    document that ours is equivalent?

    No. What I was suggesting was that if the Berkeley design is chosen and one is concerned that users would expect the equivalence, then the spec could make a clear(er) statement about the lack of equivalence.

    We don't allow any access to cross any fence.

    Berkeley instead leaves undefined which side of the fence the access(es) occur on, thus removing the issue of "crossing" one. This is a fundamental difference and each side has presented arguments for their own approach and against the other. I think these past few comments have been more "level headed" than some previous ones and allowed us to state our positions w/o name-calling or "mischaracterization" of the other.

    It does not matter that the access is being performed for the UPC thread via some

    asynchronous agent.

    As Dan mentioned, the UPC I/O specification also has a model of asynchronous data movement similar to what the Berkeley async memcopy proposal contains. Does the Cray implementation also complete async I/Os at fences and other strict references? If NO, what is the reasoning behind this different interpretation from how you now want async memcopy operations to be defined? If YES, does any other vendor share the interpretation that upc_fence must complete UPC I/O operations? (Perhaps a tough question to answer, since I seem to recall that no other vendor has a UPC I/O implementation at the moment). ```

    Reported by `phhargrove@lbl.gov` on 2012-08-11 08:41:41

  61. Former user Account Deleted

    ``` "They should be identical until you separate them. This is the equivalence that I'm talking about."... ""I AGREE that it does not follow "intuitively" from thinking about "sync(upc_memFOO_nb(...))" as equivalent to "upc_memFOO(...)", but that is NOT an equivalence we provide, and this could be made clear(er) in the specification/documentation if desired."

    Let me clarify here, because the discussion has become quite arcane for the casual reader. Under the Berkeley proposal, the exact line: sync(upc_memFOO_nb(...); IS SEMANTICALLY EQUIVALENT to: upc_memFOO(...) This is the special case of an EMPTY transfer interval, so all the discussion about how opposing views handle that interval trivially collapse to no difference. Both lines of code encapsulate a synchronous transfer that consumes and produces the transfer buffers before returning, so surrounding code from this thread are guaranteed to observe updated values.

    Cray's current "half-fence" semantics for the first line additionally force some synchronization with other threads before sync() returns, which means the split-phase version generates MORE synchronization (and presumably more performance overhead) than the purely blocking version. This over-synchronization can never cause a correct program to become incorrect, but it IS a difference nonetheless. So unless they follow-thru removing that part of the semantics, it is the CRAY proposal that lacks semantic equivalence to the blocking version.

    Now when it comes to TRANSFORMING the split-phase call and separating the init and sync, the interval becomes non-empty and that's where more care is required. Note that once you start moving ANY non-comment lines of code around, the program is by definition no longer the same program - what we're talking about is transformations which are guaranteed to have no semantically observable effect on the BEHAVIOR of the program, relative to the one containing a blocking call. This is very different from saying the programs are EQUIVALENT, because the entire point is to overlap communication and improve performance (so at the very least the network traffic and performance will not be "equivalent", and we want that).

    Neither proposal allows movement of init or sync past accesses which potentially conflict with the buffers in-flight - this could potentially cause the program to access undefined data, and thereby demonstrate a change in behavior and correctness. Neither proposal allows movement past a fence IN GENERAL, although under the Berkeley proposal that transformation is SOMETIMES valid, if the fence is synchronizing unrelated data and provably does NOT affect access to the transfer buffers by some other thread. Both types of transformation require information that compilers usually cannot (or realistically, can never) prove, so we're talking about manual transformations by an advanced user who understands his data dependencies. An advanced user could use the same knowledge to move a blocking upc_mem* call past fences in his program that he knows are unrelated to the data transfer.

    "We don't allow any access to cross any fence. ... We follow this rule consistently. ... a upc_fence completes all accesses issued by all OpenMP threads that are part of the UPC thread. "

    What about with MPI non-blocking operations?:

    Thread 0 Thread 1 MPI_Irecv(&destbuffer,...,1,&handle); ... upc_fence; MPI_Send(...) /* to T1 */ MPI_Recv(...) /* from T0 */ ... MPI_Send(...) /* to T0 */ MPI_Wait(&handle,...)

    I hope your fence implementation doesn't stall to await completion of that Irecv (which is asynchronously writing to destbuffer), or one could easily construct many examples where that additional synchronization creates deadlock..

    ```

    Reported by `danbonachea` on 2012-08-11 18:13:17

  62. Former user Account Deleted

    ``` MPI functions are a red herring because they do not perform UPC shared accesses.

    "As Dan mentioned, the UPC I/O specification also has a model of asynchronous data movement similar to what the Berkeley async memcopy proposal contains. Does the Cray implementation also complete async I/Os at fences and other strict references?"

    First, I'll directly answer your question by explaining that our UPC async I/O isn't very asynchronous at this point, so fences and strict accesses don't worry about this problem, but even if it was fully asynchronous, technically, the fences don't have to worry about this problem...

    UPC I/O asyncs are weird and different from either the Cray or the BUPC proposal. A UPC thread calling a UPC I/O async function actually doesn't perform any UPC shared accesses as part of the call!

    "Each call to upc_all_{fread,fwrite}_shared[_async] with a common file pointer behaves as if the read/write operations were performed by a single, distinct, anonymous thread which is different from any compute thread (and different for each operation)." [UPC 1.2 A.1.1.3 Paragraph 3]

    According to UPC I/O, if the calling thread executes a fence, then it doesn't affect the ordering of the accesses being done by this other anonymous thread because a fence is an operation local to one thread. Personally I think the UPC I/O spec is pretty wacky and this particular paragraph raises a bunch of questions like: What data has affinity to the anonymous thread? Why isn't it counted by THREADS? Why does the description of collective functions in the spec not have to exclude these threads -- presumably they don't have to participate in a barrier? Where does this thread come from?

    But I digress. Fact is, the current I/O appendix deals with the async issues by introducing the concept of anonymous UPC threads and neither the BUPC proposal nor the Cray proposal went that route. With the existing upc_mem* functions and the Cray non-blocking functions, the UPC thread calling the function makes relaxed accesses that respect fences. The UPC thread calling a BUPC async function also makes relaxed accesses, but ignores any fences or barriers between the init and the sync. The claim is that the accesses occur at an undefined point relative to the intervening fences so that the fence has no effect, but the calling UPC thread is still logically the entity performing the accesses.

    If BUPC would like to redefine their async functions as causing accesses to be performed by an anonymous thread, then I will concede that a BUPC async gets to ignore the fences, but then BUPC needs to flesh out this concept of an anonymous thread a lot more. I really don't like that the anonymous thread concept is mentioned in passing in the I/O appendix as if it were a trivial thing because it's not.

    If the definition of anonymous threads can be pinned down, then we need to see if users expect the calling thread to be the thing logically responsible for issuing the accesses or if they're comfortable thinking in terms of anonymous threads.

    Personally I don't like the anonymous thread concept and I hope we don't have to go there. UPC programs (currently) are supposed to have exactly THREADS number of threads for their entire lifetime. I much prefer saying that the accesses are performed by the calling thread and hope that we end up with that. ```

    Reported by `johnson.troy.a` on 2012-08-12 02:57:24

  63. Former user Account Deleted

    ``` "MPI functions are a red herring because they do not perform UPC shared accesses."

    MPI is NOT a red herring, because MPI functions are very commonly used in hybrid UPC/MPI programs to modify shared memory locations. Hybrid UPC+MPI apps are a very important usage class, and this property of using upc_fence to quiesce all changes to shared memory that Cray seems so married to simply doesn't hold true for those apps.

    "Personally I don't like the anonymous thread concept and I hope we don't have to go there. UPC programs (currently) are supposed to have exactly THREADS number of threads for their entire lifetime. I much prefer saying that the accesses are performed by the calling thread and hope that we end up with that."

    I agree that we should not extend the anonymous thread concept from UPC-IO. It is well-motivated for the parallel I/O library for a number of different reasons, mostly having to do with how collective parallel I/O libraries work, but those aren't applicable to non-collective upc_mem*_async. My only point in mentioning the UPC-IO example was to demonstrate that 1.2 already includes operations that asynchronously modify shared memory that are not governed by upc_fence and upc_barrier. So the fairy-tale of using upc_barrier to quiesce the whole world doesn't even work in pure UPC-1.2, let alone UPC+MPI+POSIX-async-IO... This derived and undocumented property only works in very a narrowly limited subset of UPC, so we should stop worrying about preserving it.

    "The UPC thread calling a BUPC async function also makes relaxed accesses, but ignores any fences or barriers between the init and the sync. "

    False - as we've said over and over the thread does not IGNORE them, in fact any fences or barriers continue to govern normal relaxed accesses issued by the application. However the library's relaxed accesses which comprise the async memcpy are permitted to take place anywhere inside the transfer interval, and the transfer buffers remain officially "off-limits" for the duration of the transfer interval (until the sync), so no valid UPC program can PROVE whether or not the fences in the interval had any effect on the accesses that comprise the transfer.

    ```

    Reported by `danbonachea` on 2012-08-12 03:44:20

  64. Former user Account Deleted

    ``` "MPI is NOT a red herring, because MPI functions are very commonly used in hybrid UPC/MPI programs to modify shared memory locations. Hybrid UPC+MPI apps are a very important usage class, and this property of using upc_fence to quiesce all changes to shared memory that Cray seems so married to simply doesn't hold true for those apps."

    Correct -- we don't expect a upc_fence to affect MPI. I said that MPI is a red herring because Paul asked how our fences handled MPI_Send. The answer is that UPC fences have no effect on MPI communication. If you have a UPC program and it calls MPI_Send, the shared qualifiers get casted off. From the perspective of the UPC program, the call is to a function that can touch only local data and at most the upc_fence is obligated to do an mfence() to enforce ordering, which generally won't do anything to an MPI transfer, but that's fine.

    "So the fairy-tale of using upc_barrier to quiesce the whole world doesn't even work in pure UPC-1.2, let alone UPC+MPI+POSIX-async-IO"

    I never said that upc_barrier covered other communication models. It quiesces UPC shared accesses for the UPC threads involved in the barrier (the anonymous UPC I/O threads are evidently not participants in upc_barrier). We should preserve this property. MPI calls do not make UPC shared accesses. An OpenMP thread or a POSIX thread could make a UPC shared access, and those have to be covered by fences and barriers.

    From the above discussion, it has become apparent that we want the following things that part of the BUPC team does not want:

    - Full equivalence between existing upc_mem* functions and their upc_mem*_nb counterparts plus a local sync. (I believe Yili agreed with this point when we worked on the consensus proposal.) - upc_barrier should continue to quiesce all shared accesses. - Non-blocking shared accesses are ordered by fences, no matter if the fence is before, within, or after the init-sync interval, just like normal shared accesses. - The ability to fully (globally) complete an individual nb transfer without forcing completion of other communication.

    Note that as part of the compromise approach, we conceded requiring the sync call to be made for the _nb calls instead of allowing the user to omit it because BUPC needed that to free resources, and BUPC conceded that the sync call could be optional for _nbi calls. ```

    Reported by `johnson.troy.a` on 2012-08-12 05:20:39

  65. Former user Account Deleted

    ``` " - Full equivalence between existing upc_mem* functions and their upc_mem*_nb counterparts plus a local sync. (I believe Yili agreed with this point when we worked on the consensus proposal.)"

    See comment 65 - the Berkeley proposal provides this exact equivalence, the CRAY proposal does not (because it adds extra synchronization which is not present in the blocking version).

    " - upc_barrier should continue to quiesce all shared accesses."

    Programmers who care about this property can achieve it by placing their barriers outside transfer intervals. Comment #56 demonstrates why programmers need to be aware of the boundaries of their transfer intervals to write correct programs with this library, under either proposal.

    In my opinion this unmoderated textual conversation has become a waste of time on both sides. Neither side is barely listening to the other, and we're certainly not making progress. Nobody aside from the two arguing parties seems to be reading it. I'm done with this thread.

    ```

    Reported by `danbonachea` on 2012-08-12 17:41:41

  66. Former user Account Deleted

    ``` "See comment 65 - the Berkeley proposal provides this exact equivalence, the CRAY proposal does not (because it adds extra synchronization which is not present in the blocking version)."

    I want equivalence for the nb call plus the lsync, not the nb call plus the gsync. It's the gsync call that has the half-fence, so I think you've confused the two proposed sync calls.

    Another example to think about before the call...

    assume THREADS == 2 shared int x[THREADS]; int delivered; ... if ( MYTHREAD == 0 ) { x[0] = 1;

    /* BUPC: Starts an async transfer window that is terminated by a sync call.

    • Cray: Starts an async transfer window that is terminated by a sync call
    • or by any fence.
    • / upc_memput_[async/nb] with target x[1], source x[0], for sizeof( int ) bytes

    ... /* Computation or other communication. */ ...

    /* BUPC: Has no effect on the async transfer.

    • Cray: Forces completion of the async transfer.
    • / upc_barrier; upc_barrier; upc_barrier;

    /* BUPC: Required to terminate the async transfer window and to free

    • resources.
    • Cray: Concedes that the sync can be required in UPC 1.3 to give other
    • implementations an opportunity to free resources, but does not currently
    • require it because any fence will terminate the window and Cray has
    • sufficient resource tracking without it.
    • / upc_waitsync / upc_gsync } else { upc_barrier;

    /* BUPC: This access is illegal, even though a barrier separates it from

    • Thread 0's x[0] = 1 assignment because Thread 0 has not yet sync'd a
    • copy that uses x[0] as a source buffer.
    • Cray: This access is legal and has no effect on Thread 0's async copy
    • because the barrier already forced completion of the copy.
    • / x[0] = 2;

    upc_barrier;

    /* BUPC: This local access is illegal because Thread 0 has not yet sync'd

    • a copy that uses x[1] as a destination buffer. The value of x[1] may be
    • 0, 1, 2, or even something else.
    • Cray: This local access is legal and will read that x[1] == 1.
    • / delivered = ( x[1] > 0 );

    upc_barrier;

    if ( ( delivered == 0 ) && ( x[1] == 1 ) ) { /* BUPC: The claim is that this program is already illegal because it

    • contains illegal accesses so this test proves nothing. Cray
    • considers the accesses to be legal, such that if this block executes,
    • it proves that the async read the x[0] value from one side of a
    • barrier but wrote it to x[1] on the other side of the barrier,
    • thereby detecting a relaxed access that crossed a barrier. BUPC
    • claims this situation is possible but not legally observable and
    • therefore not a memory model violation.
    • Cray: This block will never execute because the x[1] value was
    • delivered back at the first barrier.
    • / } } ```

    Reported by `johnson.troy.a` on 2012-08-12 20:34:19

  67. Former user Account Deleted

    ``` A related issue is the semantics for non-blocking data movement as they will be applied to an eventual non-blocking collectives proposal. This is related to the current topic because it has been stated (By George in comment #4) that it is strongly desired that the "Collectives 2.0" proposal, when revised, use the same model/semantics for asynchronous data movement as the NB memcpy family (to the extent that can be done, given that one is point-to-point while the other is collective).

    It is my position that the Berkeley semantics (destination(s) undefined during a transfer interval spanning from init to sync) is equally/easily applicable to specification of semantics for non-blocking collectives. I don't have a similar confidence that the Cray semantics would work well for collectives, but feel that to be fair Cray should make that judgement, not me. So, here are two "points of concern" which come to mind regarding NB collectives.

    1) Unlike the point-to-point NB memcopy interfaces, where is no distinguished initiator. For instance, given a Gather operation: is the root thread the initiator of all data movement (Pull), or are all threads initiators of the pieces with local affinity (Push). This is important to define with the Cray semantics because it determines which strict operations would be required to complete the collective operation. "All strict operations on all participating threads" is one possible answer that is completely unambiguous. Another unambitious option might be "strict operations fence all data movement to and from memory with affinity to the issuing thread".

    2) Even in the presence of a clear (and agreeable to all parties) definition of which strict operations completed which portions (or the entirety) of a collective operation, I wonder if such a semantic is practical to implement given that most scalable implementation involve spanning trees and thus multiple intermediaries along the data movement path (again envisioning a Gather as an example). Thus, a strict reference may require global communication/coordination even when issued by a single thread (thus NOT collectively). ```

    Reported by `phhargrove@lbl.gov` on 2012-08-12 21:09:50

  68. Former user Account Deleted

    ``` In comment #68, Troy wrote: "I said that MPI is a red herring because Paul asked how our fences handled MPI_Send."

    No, I did not. No reference to MPI_Send appears in any comments I have authored.

    Keeping this discussion civil is hard enough without people putting words in my mouth. Perhaps it was an honest mistake, since Dan (who DID post a mixed UPC/MPI example) and I have similar viewpoints on most things, but that mistake just reinforced Dan's observation that "Neither side is barely listening to the other". ```

    Reported by `phhargrove@lbl.gov` on 2012-08-12 21:38:01

  69. Former user Account Deleted

    ``` "No, I did not. No reference to MPI_Send appears in any comments I have authored."

    Oops, yep, that was a Dan comment in #65. My apologies. 72 comments is a lot to keep track of!

    I'm sorry if you and Dan don't feel that I'm listening. I've tried to respond to all comments as quickly as possible (it's the weekend, even...) because I had hopes that we could resolve more issues prior to the call. This online discussion is a wonderful tool and has shed a lot of light on how different people interpret the UPC 1.2 spec as well as plans for changes. The fact that a discussion occurred that did not cause any of the three of us to switch sides is unfortunate, but it happens, and believe it or not there are other issues on this site that I'm MUCH more interested in resolving than this one. Hopefully there is enough info here now that others who will be on the call can come in knowing which design they support, or even come in with a new third idea. ```

    Reported by `johnson.troy.a` on 2012-08-12 23:02:04

  70. Former user Account Deleted

    ``` Dan wrote: "In my opinion this unmoderated textual conversation has become a waste of time on both sides. Neither side is barely listening to the other, and we're certainly not making progress. Nobody aside from the two arguing parties seems to be reading it. I'm done with this thread."

    Lest my lack of interaction on this topic is interpreted as implicit agreement/disagreement or apathy -- I have followed the exchanges with interest and formed my own opinion. Although the debate has been sometimes heated, I agree with Troy that I think that many good points have been made on both sides and issues raised.

    For the record, as I've stated before: I'm not generally in favor of adding any async. operations to UPC, because to me they feel too "MPI like" and because they introduce a degree of anachronism (by design) that make parallel programs written in UPC even more difficult to understand and analyze. Separately, I questioned the suitability of some enhancements made in the NPB benchmark suite that used Berkeley's async. memory copy operations. At the time, it was a fairly easy call that those non-standard vendor-only extensions should be excluded from the NPB suite given its role as a reference benchmark. Even when an async. memory op. library is accepted as a required library, I'd probably question its use the default implementations of the NPB benchmarks, but I'd have less ground to stand on.

    That said, given that the decision has been made to add an async. memory op. library to the required UPC library, I support Berkeley's position that the transmission interval between an init and sync. call should be defined in a way that it is unaffected by strict accesses or upc_fence statements that occur within the transmission interval and conversely those strict accesses and upc_fence statements do not affect the semantics of the async. memory op. library calls.

    My main reason for supporting Berkeley suggested definition is based on the assumption that a programmer who uses these async. memory op. library calls is likely a more expert/advanced UPC programmer, and that this programmer has decided to use the async. memory operation calls to improve program performance. To gain the most performance improvement that the async. mem. op. library has to offer, I think that the programmer will willingly accept the responsibility to ensure that the source data buffers remain untouched during the transfer interval and that the destination buffer is not valid until the relevant sync. call has returned.

    I also support the addition of a signalling put capability, perhaps as a separate library. I would in fact like to see it added as a required library in the 1.3 spec. if it has been determined to be both suitably well-defined and understood.

    If the decision is made to maintain the "half fence" (a term coined in comment #24), and then Berkeley holds to their position of maintaining their bupc_* calls as a response, I would view this as a failing of the UPC language design and specification process. In fact, as a UPC compiler and runtime implementer, my reaction is: "why bother?" when considering the implementation of the new required library. Keeping separate bupc_ and cray_ forms of the library call seems superior to me, because at least it emphasizes the differences and doesn't offer the illusion of standardization.

    Given that the current Cray definition of the async. mem. ops. is more restrictive than Berkeley's, there is no backwards compatibility issue for their users if the Berkeley semantics are accepted. That is: programs written to the Cray definition of the library will continue to work as they did before. Thus it seems that decision of imposing additional restrictions on the use of the async. mem. ops. reduces to a philosophical difference, as noted in one of the preceding comments.

    In forming my opinion, I found the following points to be most compelling:

    1. Comment #23: "One of my many criticisms of the current async proposal is that it cannot achieve this level of overlap on the important case of cache-coherent shared-memory systems, because the semantics require the insertion of "half-fences" and guarantees of local/global visibility, which are significantly more heavyweight and would inhibit the hardware-provided reordering optimizations. As such, the simple case of upc_mem*_async();gsync() would be expected to perform SLOWER and provide LESS overlap than the existing analogous upc_mem* call for those systems. It makes no sense to introduce an "async" library whose semantics for memory overlap are often MORE restrictive than the existing synchronous counterparts."

    2. comment #43: "Our goal with the Berkeley async transfer library was to enable far more aggressive overlap of communication with unrelated computation and other communication. We are trying to overlap the entire cost of a communication, and allow it to asynchronously continue undisturbed without interference from unrelated operations."

    3. comment #47: "I think the Berkeley semantics are better described as an overhead expressway - it bypasses all the city traffic below and is unaffected by city traffic lights, because the laws of the road guarantee the cars below cannot even SEE the highway traffic, let alone interact with it. The on-ramps and off-ramps are clearly defined by the library calls which define where cars enter and exit the normal flow of relaxed operations on the city streets, but while they're "in-flight" on the expressway they operate completely independently of everything else."

    4. comment #56: "This doesn't mean the memory model is "broken", it just means there is a new way to write programs with undefined behavior that the memory model does not govern. Programmers who embark upon using a new library feature are burdened with using it CORRECTLY, otherwise they get undefined behavior and commonly derived properties of the memory model may not hold true."

    Although I appreciate that some users have indicated their preference for the "half-fence" semantics proposed by Cray, I wonder if they would still support that preference if they were told that the performance of the library call might be impacted by this restriction?

    ```

    Reported by `gary.funck` on 2012-08-13 05:14:54

  71. Former user Account Deleted

    ``` Hi Gary. Thanks for offering another perspective. I would like to point out that the async plus the gsync is not equivalent to the existing upc_mem* call because the existing call guarantees only local completion wheras the gsync is global completion (and the gsync is where the half fence is). So by converting an existing upc_mem*call call to an async plus an lsync, which is equivalent, there is no new performance hit. ```

    Reported by `johnson.troy.a` on 2012-08-13 05:52:03

  72. Former user Account Deleted

    ``` Correction: "introduce a degree of anachronism" should read "introduce a degree of asynchronism". (My mis-use of the spell checker.) ```

    Reported by `gary.funck` on 2012-08-13 05:58:59

  73. Former user Account Deleted

    ``` In Comment 75, Troy wrote (in part): "So by converting an existing upc_mem*call call to an async plus an lsync, which is equivalent, there is no new performance hit."

    Understood. Thanks for the clarification.

    ```

    Reported by `gary.funck` on 2012-08-13 06:00:30

  74. Former user Account Deleted

    ``` Gary Funk wrote (in part):

    If the decision is made to maintain the "half fence" (a term coined in comment #24),

    and then Berkeley

    holds to their position of maintaining their bupc_* calls as a response, I would

    view this as a failing of

    the UPC language design and specification process. In fact, as a UPC compiler and

    runtime implementer,

    my reaction is: "why bother?" when considering the implementation of the new required

    library.

    Gary,

    Berkeley UPC has an obligation to backwards compatibility for our existing users' codes which demands that we retain these entry points at least for some reasonable deprecation period durring which uses can convert their codes to the standardized interfaces. This would be true regardless of what semantics are chosen for the standard. How close or far the standard's semantics are from those of our current interfaces will determine if there are two distinct implementations within our runtime, or just a layer of #defines.

    I fully expect that Cray has a similar commitment to keeping their users' codes operating unchanged, at least in the near term. I don't think any of us are retaining legacy interfaces for the purpose of undermining the standardization process.

    Although I appreciate that some users have indicated their preference for the "half-fence"

    semantics

    proposed by Cray, I wonder if they would still support that preference if they were

    told that the

    performance of the library call might be impacted by this restriction?

    It is our hope that the signaling-put (which will have both blocking and async flavors) will prove applicable to the same usage case (producer-consumer notification) that the "half-fence" or "gsync" are intended to address. Quoting now from Steven in comment

    1. 40: +> The "half-fence" that we proposed on the global sync formally provides acquire semantics on relaxed +> accesses. This is necessary to permit pairwise synchronization with a remote thread via relaxed +> operations to notify that thread that the non-blocking operation is complete. While the signaling-put is intended to "package" the relaxed (blocking or non-blocking) operation and the pairwise synchronization into a single operation. ```

    Reported by `phhargrove@lbl.gov` on 2012-08-13 06:10:53

  75. Former user Account Deleted

    ``` An addition to my comment 74: If a consensus can't be reached where both Cray and Berkeley agree to deprecate their flavors of the async. mem. op libraries in favor of the new standard required library, I would move to defer inclusion of the async. mem. op. library in version 1.3 of the specification.

    A separate editorial note: I understand the advantage of vendors offering new libraries prefixed with something that identifies the vendor, to avoid confusion that the new library appears to be part of the UPC standard, for example. But from my perspective this has created a Tower of Babel where for example, we have separately named libraries for returning cpu timer ticks, or we have async. mem. op. libraries with differing semantics (and differing names). What I would much prefer is that vendors conferred and then drafted a single proposed library that appears in the Proposed library section of the UPC specification. It is this form of the library that users might experiment with so that ultimately the definition is finalized and moved to "required" or "optional". Users who use the proposed library accept the responsibility to change their code as the library definition evolves before it is finalized. ```

    Reported by `gary.funck` on 2012-08-13 06:12:05

  76. Former user Account Deleted

    ``` Paul wrote (in part): "Berkeley UPC has an obligation to backwards compatibility for our existing users' codes which demands that we retain these entry points at least for some reasonable deprecation period durring which uses can convert their codes to the standardized interfaces. This would be true regardless of what semantics are chosen for the standard. How close or far the standard's semantics are from those of our current interfaces will determine if there are two distinct implementations within our runtime, or just a layer of #defines."

    I understand that those vendor-specific interfaces will have to be preserved -- for a while. The question is: when/if those old vendor-specific interfaces are deprecated and then removed. My guess is that if they offer distinct capabilities separate from the new standard library that they will never be deprecated. It is this potential for divergence that bothers me and in my view significantly lessens the value of a standard definition.

    ```

    Reported by `gary.funck` on 2012-08-13 06:20:31

  77. Former user Account Deleted

    ``` Gary wrote (jn part):

    My guess is that if they offer distinct capabilities separate from the new standard

    library that they will

    never be deprecated. It is this potential for divergence that bothers me and in

    my view significantly

    lessens the value of a standard definition.

    There must be a balance struck between the individual implementer's freedom to "innovate" and the "good of the community". I suspect that the economic realities faced by commercial vendors will make them less quick to remove the implementation-specific interfaces than the not-for-profits. So, there may be no "right" answer. However, that won't prevent me from putting forth a set of SUGGESTIONS:

    + As long as a new library is Optional there should be no "pressure" for removal of a vendor-specific (near) equivalent. This is intended to encourage head-to-head comparison which will help form a community consensus as to whether the Optional library should be promoted to Required status or potentially dropped if a better solution is to be found by embracing an alternative in a future Spec revision.

    + Once a library has reached Required status, the first compiler release by a given vendor to claim to support the new spec version should document any "competing" vendor-specific library/feature as deprecated.

    + Once a library has reached Required status, a vendor should REMOVE "competing" extensions with their first complier to claim to support the NEXT spec, or after 2 calendar years (whichever is sooner). Example: a compiler for UPC spec v1.4 drops vendor-specific libs which compete with Required libs introduced in the 1.3 spec.

    Since there is no UPC "branding" there is no manner by which we can "enforce" these suggestions, but I'd like to think we can all agree to behave as members of a cooperative community.

    Thoughts? ```

    Reported by `phhargrove@lbl.gov` on 2012-08-13 06:48:24

  78. Former user Account Deleted

    ``` I spent a lot of time thinking about this over the weekend (in between moving stuff into the new apartment, so I missed all the new comments...). I did end up finding a significant use case for allowing non-blocking calls to continue beyond a fence that I could not come up with a good way of restructuring to avoid the fence. In light of that, I sat down with Troy this morning and discussed what we could do to make this use-case work. We still have some reservations about the Berkeley proposal, but I think Cray can get a lot closer to it now that we have a good concrete use-case to work with. We haven't fully fleshed things out yet, but here's the major changes that we're considering at the moment:

    1. Remove the *_gsync() and "half-fence" from the non-blocking proposal. There would only be a single sync call mostly matching the semantics of the original Berkeley proposal. As Dan pointed out, most of the desire for this call on Cray's part is subsumed by the signalling put proposal from Berkeley (Could this be added to the issue tracker? Also, I'd also like to see a signalling get in there...). I still think that the functionality provided by the "half-fence" could be generally useful in the future, so I'll make a separate proposal for language level support for the addition of less restrictive fences. And yes, that proposal will include the proposed formal changes to the memory model.

    2. Permit non-blocking operations to continue beyond a fence with some restrictions.

    3. Only the calling/initiating thread would be explicitly restricted from accessing memory locations involved in the call until the sync. I'm still working out the formal modifications to the memory model, but the gist of it is that the read and write operations involved in a non-blocking call occur at some unspecified time between the initiation and the sync, and are treated as relaxed (or local for the non-shared arguments to memget/memput) operations, thus they can be re-ordered with respect to other relaxed operations, including those occurring "after" the sync in the source. However, all threads must agree on their order with respect to strict operations issued by the calling thread, and thus must also agree on the set of relaxed operations with which they can be re-ordered.

    I think these changes would suffice to enable the functionality and performance that we're looking for in non-blocking routines, hopefully without significantly burdening implementations or adding undue complexity to memory access rules. Importantly, this restricts the possible results of overlapping accesses to involved memory locations rather than making such programs illegal, which should make debugging and formal analysis of programs using the non-blocking operations simpler.

    Programs using the existing Berkeley routines could switch to using the standard ones with no changes. Programs using the existing Cray routines might require changes, though I think we could safely make this change as the routines were mostly used to avoid the overheads caused by the same-address restriction.

    This is still a rough idea, so comments are welcome. ```

    Reported by `sdvormwa@cray.com` on 2012-08-13 16:44:09

  79. Former user Account Deleted

    ``` To help clarify comment 82, consider the following short UPC program:

    1. include <upc_relaxed.h> relaxed shared [2] int src[2*THREADS]; relaxed shared [2] int dst[2*THREADS]; strict shared int phase = 0; int main() { int i; upc_handle_t handle; if ( THREADS < 2 ) return -1; upc_forall(i=0;i<2*THREADS;++i;&src[i]) src[i] = dst[i] = MYTHREAD; upc_barrier; if ( MYTHREAD == 0 ) { phase = 1; handle = upc_memcpy_nb( &dst[2], &src[0], 2 * sizeof(int) ); phase = 2; upc_sync( handle ); phase = 3; } else { int my_phase[2], my_dst[2]; my_phase[0] = phase; my_dst[0] = dst[2]; my_dst[1] = dst[3]; my_phase[1] = phase; } return 0; }

    Obviously, if a thread's first read of phase (into my_phase[0]) observes the value '3', then that thread is guaranteed to observe the value '0' for both locations of dst, as the sync call occurs before thread 0's strict write of the value '3' into phase. Likewise, if a thread's final read of phase (into my_phase[1]) observes the value '0', then that thread is guaranteed to observe the value '1' for both locations of dst, as the initiation call occurs after thread 0's strict write of the value '1' into phase.

    If my_phase[0] reads any of '0', '1', or '2' from phase and the value read into my_phase[1] doesn't prevent it, or my_phase[1] reads any of '1', '2', or '3' and the value read into my_phase[0] doesn't prevent it, then the values read into my_dst[0] and my_dst[1] could be either '0' or '1' (or something else entirely, as noted in UPC 1.2 B.3.2.1 paragraph 2). This is due to the relaxed operations associated with the upc_memcpy_nb() occurring at an unspecified time between the initiation and the sync, and thus the values aren't guaranteed except outside of that region.

    However, the values must be consistent across threads. Therefore, the following behavior is disallowed:

    Thread 1 observes my_dst[0] is 0 my_phase[1] is 1

    Thread 2 observes my_phase[0] is 2 my_dst[0] is 1

    Because all threads must observe the operations occur in the same order with respect to strict operations on the calling thread, it is not permitted for thread 2 to observe the original value of dst[2] after observing the value '2' for phase if thread 1 has observed the updated value of dst[2] before observing the strict write of the value '2' into phase.

    Formally, thread 1 and thread 2's orderings include (with the first argument representing the thread issuing the operation)

    T1: RW(T1,dst[2],1) SW(T0,phase,1) RW(T0,dst[2],0) RR(T1,dst[2],0) SR(T1,phase,1) SW(T0,phase,2) T2: RW(T1,dst[2],1) SW(T0,phase,1) SW(T0,phase,2) SR(T2,phase,2) RR(T2,dst[2],1) RW(T0,dst[2],0)

    Since the thread 1 and thread 2's orderings disagree with when RW(T0,dst[2],0) occurs relative to SW(T0,phase,2), the orderings are incompatible, and thus disallowed.

    The following similar behavior is allowed though:

    Thread 1 my_dst[0] is 0 my_phase[1] is 2

    Thread 2 observes my_phase[0] is 2 my_dst[0] is 1

    There are now at least two potential orderings for thread 1, one matching the scenario above, where the write to phase happens after the write to dst,

    T1: RW(T1,dst[2],1) SW(T0,phase,1) RW(T0,dst[2],0) RR(T1,dst[2],0) SW(T0,phase,2) SR(T1,phase,2)

    and one where the write to phase happens before the write to dst

    T1: RW(T1,dst[2],1) SW(T0,phase,1) SW(T0,phase,2) RW(T0,dst[2],0) RR(T1,dst[2],0) SR(T1,phase,2)

    Notice that both generate the observed behavior, as thread 1 doesn't have enough information to show where RW(T0,dst[2],0) occurred relative to SW(T0,phase,2). Thread 2 still has

    T2: RW(T1,dst[2],1) SW(T0,phase,1) SW(T0,phase,2) SR(T2,phase,2) RR(T2,dst[2],1) RW(T0,dst[2],0)

    which is compatible with the second option for thread 1's ordering. Thus, there is a valid ordering on all threads and the behavior is therefore allowed. ```

    Reported by `sdvormwa@cray.com` on 2012-08-13 19:59:57

  80. Former user Account Deleted

    ``` "1. Remove the *_gsync() and "half-fence" from the non-blocking proposal. There would only be a single sync call mostly matching the semantics of the original Berkeley proposal."

    I'm glad to hear we may be making some progress here.

    "As Dan pointed out, most of the desire for this call on Cray's part is subsumed by the signalling put proposal from Berkeley (Could this be added to the issue tracker? Also, I'd also like to see a signalling get in there...)."

    See issue 80: http://code.google.com/p/upc-specification/issues/detail?id=80

    "However, all threads must agree on their order with respect to strict operations issued by the calling thread,"

    I still strongly believe it's a mistake to go down that road. It's simpler and less confusing to state the contents of the destination buffer are undefined for the entire transfer interval (until after the sync), which also gives implementations more flexibility wrt in-flight operations. I don't see a motivation for accommodating programs that want to touch the transfer buffers (from ANY thread) before the explicit sync.

    Regarding your example in comment 83:

    If non-zero threads read my_phase[1]==0, then the reads of dest are guaranteed to have seen the initial values. We agree there.

    However in all other cases, the Berkeley semantics state that if non-zero threads read anything except '3' into my_phase[0], then the subsequent reads of dst are reading undefined values (because the program failed to ensure they occurred after the successful upc_sync). The contents of the destination buffer remain undefined for ALL threads until after the upc_sync, which in practice means other threads cannot safely read the contents until the initiating thread has used a strict operation or other synchronization method to signal that the upc_sync has successfully completed.

    Furthermore, "undefined" should not be read as "the prior or subsequent value", or "something I can explain via the memory model", it really is COMPLETELY undefined (ie garbage). In particular, there is no guarantee that two different threads should read the same undefined values, or that anything read should look anything like the original or final values. Once a program has strayed into reading undefined values, the memory model provides no guarantees whatsoever, and there is no need for the undefined values to be "explainable" in terms of the memory model. As a consequence of this, VALID programs can never even observe "where" in the transfer interval the "unspecified" data transfer operations occurred, let alone agree on where they occurred.

    This philosophy gives implementations the maximal freedom within the transfer interval to perform the transfer via the most efficient means possible, with guaranteed freedom from conflicting accesses by perverse programs. We want the destination buffer to remain completely "off-limits" until the explicit sync succeeds. This flexibility enables a number of optimizations that would not otherwise be possible. We expect anything less than this to have a measurable performance impact on the common case for platforms of interest, and therefore unacceptable.

    "restricts the possible results of overlapping accesses to involved memory locations rather than making such programs illegal, which should make debugging and formal analysis of programs using the non-blocking operations simpler."

    This is a valid consideration, but in my opinion should be completely secondary to any performance-related concerns. The advanced user elected to use this advanced interface solely to gain performance - that's the entire motivation for the added complexity of this interface. That being said, there are various potential approaches to debugging the use of this library - the simplest being to replace non-blocking calls back to their blocking equivalents and see what changes. Another approach, taken by dynamic race condition detectors, would be to explicitly write "trap" values into the destination buffers during the transfer interval and signal when threads touch them.

    ```

    Reported by `danbonachea` on 2012-08-16 00:57:10

  81. Former user Account Deleted

    ``` "However in all other cases, the Berkeley semantics state that if non-zero threads read anything except '3' into my_phase[0], then the subsequent reads of dst are reading undefined values (because the program failed to ensure they occurred after the successful upc_sync). The contents of the destination buffer remain undefined for ALL threads until after the upc_sync, which in practice means other threads cannot safely read the contents until the initiating thread has used a strict operation or other synchronization method to signal that the upc_sync has successfully completed."

    This implies that the non-blocking forms, if synced before any fences, are NOT equivalent to the "blocking" forms. Threads can touch memory involved in upc_mem{cpy|get|put}() during the operation, and there are well-defined, though very loose, rules regarding the results. This should also be true of the non-blocking variants.

    "In particular, there is no guarantee that two different threads should read the same undefined values, or that anything read should look anything like the original or final values."

    The memory model already explicitly permits this: 'Furthermore, because the shared writes implied by the library call have unspecified size, thread 1 may even read intermediate values into local_z0 and local_z1 which correspond to neither the initial nor the final values for those shared objects.'

    "This philosophy gives implementations the maximal freedom within the transfer interval to perform the transfer via the most efficient means possible, with guaranteed freedom from conflicting accesses by perverse programs. We want the destination buffer to remain completely "off-limits" until the explicit sync succeeds. This flexibility enables a number of optimizations that would not otherwise be possible. We expect anything less than this to have a measurable performance impact on the common case for platforms of interest, and therefore unacceptable."

    As defined, it permits using a (non-UPC) thread to do the transfer asynchronously. It permits use of RDMA hardware. It permits memory reordering on SMPs. It permits caching/aggregation, assuming flushing occurs on fences, as is required for regular relaxed accesses. What non-blocking optimizations does it prohibit?

    "That being said, there are various potential approaches to debugging the use of this library - the simplest being to replace non-blocking calls back to their blocking equivalents and see what changes. Another approach, taken by dynamic race condition detectors, would be to explicitly write "trap" values into the destination buffers during the transfer interval and signal when threads touch them."

    Neither of which can conclusively answer the question "Is my program a valid UPC program?" in all cases. No other memory operation in UPC places USER restrictions on other threads' memory operations. There are restrictions placed on the implementation, but never the user. The code may not behave the way the user expects or intends, but it is still a valid UPC program. Prohibiting the calling thread from accessing the memory is ok, as it is relatively easy to prove that the user's code is valid or invalid with that restriction. ```

    Reported by `sdvormwa@cray.com` on 2012-08-16 15:23:23

  82. Former user Account Deleted

    ``` "This implies that the non-blocking forms, if synced before any fences, are NOT equivalent to the "blocking" forms. Threads can touch memory involved in upc_mem{cpy|get|put}() during the operation, and there are well-defined, though very loose, rules regarding the results. This should also be true of the non-blocking variants."

    I disagree. B.3.2.1 specifies that blocking upc_mem* updates the destination buffer with "relaxed shared writes of unspecified size and ordering", which implies that threads concurrently reading the destination buffer without synchronization are effectively reading garbage (it may not look anything like the initial or final values). This is exactly the same as a Berkeley upc_sync(memput_async()), ie an empty transfer interval. In both cases the destination values read by other unsynchronized threads are completely indeterminate. Equivalence of undefined values is meaningless.

    "As defined, it permits using a (non-UPC) thread to do the transfer asynchronously. It permits use of RDMA hardware. It permits memory reordering on SMPs. It permits caching/aggregation, assuming flushing occurs on fences, as is required for regular relaxed accesses. What non-blocking optimizations does it prohibit?"

    It suffers from the same performance pitfall as the original proposal - namely that random strict operations issued by the caller for unrelated computations during the transfer interval globally destroy the overlap potential. This is especially a problem in multi-module/multi-author/multi-domain applications where one module wants to initiate some communication in preparation for the next time step, then call a different, independent module to compute on its own independent data in the overlap interval. If the programmer has explicitly asserted a given transfer can proceed independently of everything until the explicit sync and has promised not to touch the transfer buffers until then, we should not stall on the first random strict operation in the transfer interval to await completion (thereby silently defeating the programmer's attempt at overlap).

    This pitfall creates the possibility of "performance bugs" whenever the program wants to call a function they did not write during the transfer interval, if that function happens to include a strict operation. The user has taken pains to arrange an overlapped transfer and everything in his code is kosher, but the unrelated library call is stalling to complete his overlapped transfer for no reason. This is a "global" kind of bug that crosses all module boundaries, and may be impossible for the programmer to diagnose in the case of separate compilation. This seems far more surprising and problematic to me than programs which "break the contract" of the non-blocking transfer interval and issue unsynchronized reads to the destination that result in undefined values; that is usually a "local" kind of bug because escape analysis can at least tell you which modules may possibly be touching the data buffers, and in common cases it should only be the current module.

    "Neither of which can conclusively answer the question "Is my program a valid UPC program?" in all cases."

    UPC is not a type-safe language. That property is impossible to prove for most interesting programs, even without this library. That's not even a goal of C-based languages. Programmers who want/need that type of guarantee should be using a type-safe, high-level language. Otherwise, they need to fully understand the semantics of their language/library and use their brains to avoid straying into undefined behavior.

    "No other memory operation in UPC places USER restrictions on other threads' memory operations. There are restrictions placed on the implementation, but never the user. The code may not behave the way the user expects or intends, but it is still a valid UPC program. Prohibiting the calling thread from accessing the memory is ok, as it is relatively easy to prove that the user's code is valid or invalid with that restriction."

    Disagree again. Even in plain 1.2, any program which contains data races can result in reading indeterminate values. Every read-after-write data race between threads (ie without intervening synchronization) results in indeterminate values being read.

    It's important to note the bulk of the formal memory model is concerned with individual reads and writes of MATCHING scalar size, where word-tearing is implicitly assumed to not occur (this is a weakness of the current model which remains to be solved - issue 61). Under those conditions and assumptions the "indeterminate" values returned by those racing reads are guaranteed to return either the initial or final value. All of the memory model examples in the appendix work under that regime. However once you throw byte-oriented bulk transfers into a race, that small assurance also goes out the window and racing reads may observe values which are neither the initial nor final value, and are just plain undefined. This is why B.3.2.1 specifies that byte-oriented library calls make relaxed read/writes of "unspecified SIZE and order", to make it clear that conflicting reads (of even blocking upc_mem* calls) will get undefined values (this should perhaps be made more explicit). This is ugly but it reflects the reality of implementation - it may be possible (in most cases) to prevent word-tearing for scalar read/writes, but it's unavoidable for untyped bulk transfers.

    It's impossible in general to prove a lack of data races, even using runtime techniques. This is just another form of data race - the calling thread is performing writes of undefined size and order throughout the entire transfer interval, and if another thread reads the destination during that interval (ie without synchronization protecting it from doing so) then it gets undefined garbage values. Even if we adopted the Cray proposal and forced strict operations to terminate the transfer interval, you can still easily construct data races where unsynchronized threads read garbage from the destination before the caller issues a strict op. This potential for data races is a fundamental property of UPC 1.2 (and most shared-memory programming), and something the programmer must already keenly avoid. The only difference with the new library is the destination buffer contains undefined values for the several lines of code until the sync, rather than the duration of the single line of code of the blocking upc_mem*.

    I think we're just avoiding the core difference here, which is when the transfer interval ends. Arguments about races and consistency apply equally to conflicting unsynchronized accesses in the transfer interval regardless of where it ends. Berkeley wants the transfer interval to terminate only at the explicit sync. Cray wants it to additionally terminate at any earlier random strict operation issued by the calling thread during the transfer interval. Performance and modularity concerns both argue for the Berkeley version. The Cray version seems to primarily be motivated by perverse programs that issue an explicitly async transfer and then want to read guaranteed results before it has been explicitly synced. I have yet to see a realistic example of a program that needs this additional guarantee, that cannot be rewritten to explicitly sync the transfer before the synchronizing strict op (or use a signalling memput instead, for that usage case). The secondary Cray motivation seems to be programs that "accidentally" violate the contract of the library interface and want stronger semantics to aid in debugging, but Berkeley is not willing to sacrifice the performance of correctly-synchronized production codes to accomodate the behavior of insufficiently-synchronized buggy programs.

    ```

    Reported by `danbonachea` on 2012-08-16 22:29:57

  83. Former user Account Deleted

    ``` "It suffers from the same performance pitfall as the original proposal - namely that random strict operations issued by the caller for unrelated computations during the transfer interval globally destroy the overlap potential."

    I think you're misunderstanding what I was proposing, which is probably my fault ;). I'm proposing that the relaxed writes involved in a non-blocking transfer may occur at any time between the initialization and sync, but that all threads must agree on when any given individual relaxed write occurs relative to fences on the calling thread. The writes are still completely independent and of unspecified size, so all this really says is that if you observe the final result at a particular location before a fence, then observe the fence, you are guaranteed that that any relaxed writes you issue TO THAT LOCATION after observing the fence are guaranteed to be "correctly" ordered on all other threads, even if the sync hasn't actually occurred yet. It doesn't tell you anything about any other memory locations involved, as they are independent of one another.

    "Disagree again. Even in plain 1.2, any program which contains data races can result in reading indeterminate values. Every read-after-write data race between threads (ie without intervening synchronization) results in indeterminate values being read."

    UPC still permits valid programs to contain data races though. Our point of contention was that we understood your proposal to say that programs with data races on memory involved in non-blocking transfers are not valid UPC programs. We have no problems with the results of the race being indeterminate, but we do not want to say that such programs are no longer UPC programs merely because of the race condition.

    "Even if we adopted the Cray proposal and forced strict operations to terminate the transfer interval"

    This is completely misunderstanding what I was proposing. All I intended was that once a thread (other than the calling thread) observes the final result at a particular memory location AND a fence on the calling thread that provably occurred after it, then all threads are guaranteed observe the final result at that memory location IF they have also observed that fence. ```

    Reported by `sdvormwa@cray.com` on 2012-08-17 03:38:07

  84. Former user Account Deleted

    ``` "I'm proposing that the relaxed writes involved in a non-blocking transfer may occur at any time between the initialization and sync, but that all threads must agree on when any given individual relaxed write occurs relative to fences on the calling thread.... if you observe the final result at a particular location before a fence, then observe the fence, you are guaranteed that that any relaxed writes you issue TO THAT LOCATION after observing the fence are guaranteed to be "correctly" ordered on all other threads"

    There are several implicit assumptions in your statements above that I think are problematic:

    1. It assumes the initial and final values of the given memory location differ, so there's a change to notice at all. I think you'd agree that when they are the same value, no conclusion can be drawn.

    2. It assumes lack of word-tearing, which can easily occur between the byte-oriented library writes and any reads larger than a byte. Consider this case of a 4-byte memory "location" that appears somewhere in the middle of the destination buffer: Byte address: 0 1 2 3 initial value (hex): 00 00 00 00 final value (hex): 00 77 00 00 If a conflicting 4-byte read from a remote thread gets back "0x00770000", that LOOKS like the final value has been written, but the library may in fact still be writing the last two bytes (as a separate "relaxed write of unspecified size"). If the thread mistakenly assumes the transfer is "done" and issues a concurrent write to the word, then the value it writes may later get clobbered when the library writes the second half of the bytes.

    3. It implicitly assumes that any given byte in the destination buffer is only written once, and that there aren't multiple "relaxed writes of unspecified size and order" that affect the same byte.

    4. Finally it assumes the destination bytes are only written with the "new" data and doesn't temporarily hold "something else entirely"

    I see no reason to restrict implementations to require 3 and 4, simply to accommodate "perverse" programs that want to try and infer transfer completion using data races, rather than using the explicit sync call and proper, guaranteed-correct synchronization. We have a real example of at least one network that can accelerate transfers by using the destination buffer as "scratch space" for storing temporary metadata while the transfer is in-flight.

    Looking forward to the non-blocking collectives, it will likely be standard practice to utilize destination buffers as scratch space during the transfer interval, and programs attempting the type of inference you describe by peeking at the destination will do so at their own peril. As mentioned in an earlier comment, we should at least attempt to devise a general semantic philosophy that will also make sense for the closely-related non-blocking collectives.

    For all these reasons I continue to argue that the contents of the destination buffer should remain fully and truly undefined until after the successful explicit upc_sync (and for other threads, until they've used strict operations or other provided synchronization features to guarantee upc_sync has returned on the caller). The same goes for the restriction that the source buffer remain unmodified until the transfer is upc_sync'd. Bulk synchronous programs will probably accomplish such synchronization trivially at the next barrier. Programs that need more fine-grained point-to-point data delivery with synchronization should be using the signalling put library. We should stop worrying about the behavior of programs that read and act upon undefined values.

    "We have no problems with the results of the race being indeterminate, but we do not want to say that such programs are no longer UPC programs merely because of the race condition."

    Programs that read undefined values and then act upon them in any non-trivial way will exhibit undefined behavior (garbage in, garbage out). When we're dealing with untyped bulk library transfers racing with conflicting unsynchronized scalar accesses, the indeterminacy goes way beyond "initial-value or final-value", the contents of the destination buffer are completely undefined.

    For those who like analogies, my best "simple analogy" is a program which does something like:

    char *garbage = malloc(20); printf("%s",garbage);

    The contents of that uninitialized memory are technically "indeterminate" (ISO C99 7.20.3.3), but many/most interesting actions consuming those values will lead to undefined behavior (in this case, anything from zero output to a segmentation fault).

    ```

    Reported by `danbonachea` on 2012-08-17 06:15:50

  85. Former user Account Deleted

    ``` You're still misunderstanding my intention. When I said observed the final value, I meant the final value, not some intermediate value that happens to be the same. More formally, all I'm proposing is that the set StrictOnThreads(M), which enforces the ordering of relaxed operations issued by a thread relative to strict operations on that thread, must be globally consistent and complete, just like it currently is, rather than allowing threads to define it differently or excluding some pairs from it. Thus, from the perspective of the memory model, for each relaxed operation involved in the transfer Rop_nb, for every strict operation Sop_ct on the calling thread between the initiation and the sync, either (Rop_nb, Sop_ct) is a member of StrictOnThreads(M) OR (Sop_ct, Rop_nb) is a member of StrictOnThreads(M) and it is undefined which is (as long as it is consistent with the other pairs in StrictOnThreads(M)).

    "Programs that read undefined values and then act upon them in any non-trivial way will exhibit undefined behavior (garbage in, garbage out). When we're dealing with untyped bulk library transfers racing with conflicting unsynchronized scalar accesses, the indeterminacy goes way beyond "initial-value or final-value", the contents of the destination buffer are completely undefined."

    Yes, we agree on this completely. All we're asking is that it not be ERRONEOUS for UPC programs to access the memory before the sync as you said in comment 30, but rather that it results in undefined values read/written (note, not completely undefined behavior). If those undefined values are used later in the program, as in your printf example, it may or may not lead to undefined behavior, depending on how they are later used. ```

    Reported by `sdvormwa@cray.com` on 2012-08-17 13:41:33

  86. Former user Account Deleted

    ``` "from the perspective of the memory model, for each relaxed operation involved in the transfer Rop_nb, for every strict operation Sop_ct on the calling thread between the initiation and the sync, either (Rop_nb, Sop_ct) is a member of StrictOnThreads(M) OR (Sop_ct, Rop_nb) is a member of StrictOnThreads(M) and it is undefined which is (as long as it is consistent with the other pairs in StrictOnThreads(M))."

    I agree with the statement above.

    However, because the relaxed operations that comprise Rop_nb are of unspecified size (leading to word tearing), can occur anywhere in the transfer interval, might overlap or conflict with each other (writing the same location multiple times), or even temporarily write a completely unrelated value to the destination, there is effectively no useful property that can be inferred by reading the undefined values in the destination buffer during the transfer interval.

    I want to avoid giving programmers the impression they can accomplish anything useful by reading the undefined values during the transfer interval. I'll admit my language in comment 30 is a bit too strong - the implementation shouldn't crash the program for simply issuing these "sketchy" reads (or otherwise invalidate the unrelated parts of the program), but the observed values are effectively meaningless. That being said, I'd like to specify that conflicting writes in the transfer interval that overwrite the destination buffer lead to "undefined behavior" - not only "undefined results" in the destination, because implementations who use it as scratch space might easily crash or do other nasty things if their meta data is suddenly overwritten by random code while the library "owns" the buffer.

    The only point in specifying the transfer in terms of relaxed ops is to provide well-defined semantics to code OUTSIDE the transfer interval; so the calling thread is guaranteed to observe updated values via relaxed reads immediately after returning from upc_sync, and if it signals a second thread using a strict write issued AFTER the upc_sync, that thread can observe the fence using a strict read and subsequently also see guaranteed updated values in the destination via relaxed reads. Converse statements can also be made wrt relaxed writes of the source before the initiation. ```

    Reported by `danbonachea` on 2012-08-17 15:08:19

  87. Former user Account Deleted

    ``` "However, because the relaxed operations that comprise Rop_nb are of unspecified size (leading to word tearing), can occur anywhere in the transfer interval, might overlap or conflict with each other (writing the same location multiple times), or even temporarily write a completely unrelated value to the destination, there is effectively no useful property that can be inferred by reading the undefined values in the destination buffer during the transfer interval."

    Exactly. All I was trying to get at in comment 82 was that StrictOnThreads(M) should still be globally consistent and complete (and that we'll probably have to patch up the definition of Precedes() to work this in). I admittedly wasn't very clear about that, but at the time, I was still going through the memory model trying to figure out exactly how to say what I meant. B.3.2.1 would still prevent the programmer from making use of this fact in any meaningful way, since as you pointed out in comment 88, they could observe an intermediate value that happens to match the final value. ```

    Reported by `sdvormwa@cray.com` on 2012-08-17 15:24:04

  88. Former user Account Deleted

    ``` "That being said, I'd like to specify that conflicting writes in the transfer interval that overwrite the destination buffer lead to "undefined behavior" - not only "undefined results" in the destination, because implementations who use it as scratch space might easily crash or do other nasty things if their meta data is suddenly overwritten by random code while the library "owns" the buffer."

    Consider the following program:

    1. include <upc.h> relaxed shared int array[ELEMS_PER_THREAD*THREADS]; strict shared int phase = 0;

    int main() { upc_forall( int i=0; i<ELEMS_PER_THREAD*THREADS; i++ ; &array[i] ) array[i] = MYTHREAD; upc_barrier; upc_memcpy( &array[1], &array[MYTHREAD], ELEMS_PER_THREAD * sizeof(int) ); upc_barrier; return 0; }

    This is a legal UPC program, even if it is completely useless, and has no undefined behavior. However, the final value of the elements of array with affinity to thread 1 are undefined because the relaxed operations involved in the upc_memcpy() are of indeterminate size and could be arbitrarily reordered.

    If I change the upc_memcpy() to a non-blocking form with an immediate sync:

    1. include <upc.h> relaxed shared int array[ELEMS_PER_THREAD*THREADS]; strict shared int phase = 0;

    int main() { upc_forall( int i=0; i<ELEMS_PER_THREAD*THREADS; i++ ; &array[i] ) array[i] = MYTHREAD; upc_barrier; upc_handle_t handle = upc_memcpy_nb( &array[1], &array[MYTHREAD], ELEMS_PER_THREAD

    • sizeof(int) ); upc_sync( handle ); upc_barrier; return 0; }

    It should be equivalent to the first program. With your undefined behavior clause though, this program is not equivalent to the first, because it now has undefined behavior and could crash or initiate global thermonuclear war(*). That is unacceptable.

    Reported by `sdvormwa@cray.com` on 2012-08-17 17:37:32

  89. Former user Account Deleted

    ``` "It should be equivalent to the first program. With your undefined behavior clause though, this program is not equivalent to the first, because it now has undefined behavior and could crash"

    I agree this would be an additional small usage restriction, and figured you probably wouldn't agree (which is why I prefaced it with "I'd LIKE to specify"). If we were to add this restriction on conflicting, unsynchronized writes to the destination it would make sense to also add a similar restriction to the blocking variants for the same reason, but I'm guessing this will be a hard sell.

    This corner case unfortunately probably prohibits implementations that use the destination buffer for scratch space, unless they are Very Careful to prevent corrupted metadata from spreading garbage beyond the destination buffer. But this is not game-breaking for us, I think we can agree to work without that restriction. The main reason I brought it up was to expose another way in which the destination buffer contents are completely undefined during the transfer interval.

    ```

    Reported by `danbonachea` on 2012-08-17 17:47:32

  90. Former user Account Deleted

    ``` All "brand new" library proposals are targeted for starting in the "Optional" library document. Promotion to the "Required" document comes later after at least 6 months residence in the ratified Optional document, and other conditions described in the Appendix A spec process. ```

    Reported by `danbonachea` on 2012-08-17 17:53:59 - Labels added: Type-Lib-Opt - Labels removed: Type-Lib-Required

  91. Former user Account Deleted

    ``` Set default Consensus to "Low". ```

    Reported by `gary.funck` on 2012-08-19 23:26:19 - Labels added: Consensus-Low

  92. Former user Account Deleted

    ``` Official change proposal mailed to the lists on 10/4/2012.

    Description: ------------------ Attached is an updated proposal for the non-blocking library spec, also committed in the proposal area of SVN. I believe it represents our current semantic consensus, based on the issue thread and my telecon notes. I've spent many hours polishing the verbiage and I believe it is complete, but it's entirely possible I've missed something or left an ambiguity - constructive comments are welcome. I fully expect it will undergo minor revisions before being merged into the optional library document, but in the interests of expediency and in accordance with our official procedures I'm starting the mandatory 4 week comment period now.

    Some highlights:

    • The bulk of the semantics are specified under the common requirement section, to avoid tedious duplication. It defines several important terms used throughout. Each function section merely includes semantics specific to that function and a reference to the common requirements.
    • I've used the term "transfer" throughout to denote the non-blocking operations, since "copy" could be misread to exclude put and get, and upc_memset is in no way a "copy".
    • There are currently two code examples - providing an example for every function seemed like overkill and didn't add anything useful. This is consistent with library examples in the C99 spec (which are somewhat rare), eg see fprintf and fscanf.
    • Tweaked the spelling of the header file (upc_nb.h) and the feature macro (UPC_NB)
    • Section organization and synopses updated to match the core lib spec.

    ```

    Reported by `danbonachea` on 2012-10-04 10:43:21 - Status changed: `PendingApproval` - Labels added: Consensus-Medium - Labels removed: Consensus-Low

    <hr>

  93. Former user Account Deleted

    ``` Thanks for all of the hard work that went into this proposal. It looks great!

    Here's some feedback:

    Pg. 5, #10: The semantics with respect to source memory should also be specified. It seems like the intent is to allow concurrent reads from source memory -- for comparison concurrent reads from source memory are erroneous in MPI (e.g. for Isend/Irecv or RMA operations).

    Pg. 5, #12: Still trying to process this paragraph.

    Pg. 6, #4: Suggest s/shall/must/ to strengthen this statement.

    Pg. 6, #4: What happens if a UPC thread exits (e.g. returns from main) without completing explicit or implicit transfers? Given the current text, this is an error and anything can happen (hang, crash, catch fire, etc).

    Pg. 11, 7.4.5.3, #2: Args are missing in upc_memput_nbi in this paragraph. Should be upc_memput_nbi(dst, src, n).

    Sec. 7.4.6: The handle is an IN parameter. It might be helpful if this handle were INOUT and were reset to UPC_COMPETE_HANDLE once it has been synchronized (since the old value is invalid). This is helpful if the user is trying to complete an array of handles or is polling on a handle location that gets reused. The user can, of course, implement this behavior with the current interface, so this is not a strong suggestion.

    Sec. 7.4.6: Error handling -- what happens if the user passes an invalid handle? If we return a complete/incomplete flag, there's no opportunity to return an error. Have we adequately defined what happens if the handle is invalid?

    Sec. 7.4.7.1/2 #4: This statement suggests that implicit handle synchronization functions are forbidden from progressing explicit handle operations; I don't think this is the intended meaning. A clearer statement might be:

    The upc_synci(_attempt) function does not complete explicit-handle transfers. ```

    Reported by `james.dinan` on 2012-10-05 21:18:31

  94. Former user Account Deleted

    ``` I'm leery of 7.4.2.10 and 7.4.2.11. These to me seem to conflict with the relationship of the non-blocking routines to the existing routines in the even that a remote thread attempts to access memory involved in the transfer during the transfer interval and there are no fences in the transfer interval. I thought we'd agreed that we would simply note that the relaxed operations making up the transfer (as defined in B.3.2.1 of the 1.2 spec) occur at an unspecified time in the transfer interval specifically so that the non-blocking routines would match the existing routines behavior in this case. ```

    Reported by `sdvormwa@cray.com` on 2012-10-05 21:36:30

  95. Former user Account Deleted

    ``` Reponses to Jim and Steve:

    "Pg. 5, #10: The semantics with respect to source memory should also be specified. It seems like the intent is to allow concurrent reads from source memory -- for comparison concurrent reads from source memory are erroneous in MPI (e.g. for Isend/Irecv or RMA operations)."

    Regarding MPI, their stated rationale for prohibiting concurrent reads of source memory is to support RDMA on non cache-coherent systems, which I believe are firmly outside the scope of UPC.

    The source buffer is const-qualified, and concurrent reads of it are implicitly permitted. I'm adding the following paragraph to explicitly state this for clarity, and also clarify what happens when multiple concurrent transfers overlap in memory:

    \np The source memory specified in a transfer is not modified by the transfer. Concurrent reads of source memory areas by any thread are permitted and behave as usual. Multiple concurrent transfers initiated by any thread are permitted to specify overlapping source memory areas. If a transfer specifies destination memory which overlaps its own source, or the source or destination of a concurrent transfer initiated by any thread, the resulting values in all destination memory specified by the affected transfers are indeterminate.

    "Pg. 6, #4: Suggest s/shall/must/ to strengthen this statement."

    This sentence appears very early in the semantic descriptions while definitions are still being established. I intentionally prefaced the sentence with "Generally" and did not use "shall", because it's not a binding restriction - specifically in the case when the explicit-handle initiation returns UPC_COMPLETE_HANDLE, the operation is already complete and no sync call is required. However this is an unusual corner case and I wanted to provide a conceptual overview paragraph to familiarize the reader with the broad form of the interface, unclouded by such corner-cases, before getting into the actual nitty-gritty of requirements.

    "Pg. 6, #4: What happens if a UPC thread exits (e.g. returns from main) without completing explicit or implicit transfers? Given the current text, this is an error and anything can happen (hang, crash, catch fire, etc)."

    This is deliberate. As an implementor I don't have a strong feeling about this, but I wanted to allow other implementations freedom on this point. I can imagine implementations on loosely-coupled systems that might produce an ugly error if an RDMA packet arrives for a non-existing process. This is analogous to MPI_Finalize, probably for similar reasons. I think the alternative would be to augment exit() and friends with clauses similar to the current ones about flushing all open file streams, but that seems like it might be nasty to implement and could impact the performance of the common case (which I would consider unacceptable for this library).

    "Pg. 11, 7.4.5.3, #2: Args are missing in upc_memput_nbi in this paragraph. Should be upc_memput_nbi(dst, src, n)."

    Fixed - thanks for catching that!

    "The handle is an IN parameter. It might be helpful if this handle were INOUT and were reset to UPC_COMPETE_HANDLE once it has been synchronized (since the old value is invalid). This is helpful if the user is trying to complete an array of handles or is polling on a handle location that gets reused. The user can, of course, implement this behavior with the current interface, so this is not a strong suggestion."

    We've gone back and forth on this detail in various drafts. As you say the user can easily write a wrapper to provide one interface in terms of the other, so it's mostly a matter of taste. I prefer this interface for reasons of uniformity - the init functions return the handle by value, so the sync function also takes it by value. This is directly analogous to the by-value interface of malloc() and free() (and other UPC interfaces that mimic those). The C99 libraries seem to generally favor by-value calling conventions for types that are logically values, and there are very few examples of by-reference interfaces for logical values (excluding types like FILE and strings that are naturally pointers). If we ever add functions for syncing an entire array of handles as a group, we would use by-reference calling conventions - that's one of the enhancements we're considering for a future version of this interface.

    "what happens if the user passes an invalid handle? If we return a complete/incomplete flag, there's no opportunity to return an error. Have we adequately defined what happens if the handle is invalid?"

    This is a violation of a "shall" requirement, which by C99 Sec. 4 means behavior is undefined. That in turn implies by C99 3.4.3 that the implementation is permitted to do anything from ignoring it, to crashing the program and launching the nukes.

    "The upc_synci(_attempt) function does not complete explicit-handle transfers."

    I like this suggested wording and will add it.

    Steve wrote: "I'm leery of 7.4.2.10 and 7.4.2.11. These to me seem to conflict with the relationship of the non-blocking routines to the existing routines in the even that a remote thread attempts to access memory involved in the transfer during the transfer interval and there are no fences in the transfer interval. I thought we'd agreed that we would simply note that the relaxed operations making up the transfer (as defined in B.3.2.1 of the 1.2 spec) occur at an unspecified time in the transfer interval specifically so that the non-blocking routines would match the existing routines behavior in this case."

    I don't understand your objection. If the source or destination is modified by any thread during the non-blocking transfer interval, the resulting value in the destination is indeterminate. Similarly, reads of the destination memory during the transfer interval will get indeterminate results until the operation is synchronized. Note this is stronger than "undefined behavior" - the implementation is not permitted to crash if the program violates these rules, but the result is garbage. In comments 88-90 I believe we agreed upon these properties, which are also true of every non-blocking interface I've ever read and should be familiar to advanced users. 7.4.2.10-11 state these properties directly and unambiguously, rather than leaving it to be inferred by the reader. Unlike the blocking interface, it's very easy to write a program with the nb library that violates these rules, and I want to make it very clear that such a program gets inderminite results. The reader does not need a deep understanding of the memory model and word tearing to understand the straightforward and important restrictions stated in 7.4.2.10-11. These paragraphs make it unambiguously clear that programs like the one in comment 83 read undefined values when they violate the rules.

    Paragraph 12 defines the effects of the transfer interval as a whole with respect to surrounding operations that are guaranteed to be ordered outside the interval. Specifically, it explains that on-thread data dependencies are preserved for conflicting relaxed accesses issued by the initiating thread before the init or after the sync. Also, it defines the effects of the transfer interval as a whole behave like relaxed accesses when there are strict operations outside the transfer interval that synchronize with other threads who then touch the buffers. Realistically this paragraph is mostly targetted at compiler/optimizer writers, showing how the transfers fit into the memory model to prevent ambiguities. Footnote 2 summarizes the most important implications of this paragraph for end users.

    ```

    Reported by `danbonachea` on 2012-10-06 19:11:08

  96. Former user Account Deleted

    ``` "I don't understand your objection."

    My objection is merely that there are no equivalent statements to 7.4.2.10 and 7.4.2.11 for the "blocking" library routines. Instead, this behavior is described in the memory model. I therefore believe these statements are unnecessary for the non-blocking routines, and that the memory model (specifically B.3.2.1) should simply be updated to note that relaxed operations that make up a non-blocking routine occur at an unspecified time during the transfer interval. Otherwise, it may appear that the non-blocking routines have stronger restrictions than the "blocking" routines do. ```

    Reported by `sdvormwa@cray.com` on 2012-10-08 22:23:24

  97. Former user Account Deleted

    ``` Alternatively, we could simply make 7.4.2.10 and 7.4.2.11 apply to the existing "blocking" routines as well. ```

    Reported by `sdvormwa@cray.com` on 2012-10-08 22:28:58

  98. Former user Account Deleted

    ``` "My objection is merely that there are no equivalent statements to 7.4.2.10 and 7.4.2.11 for the "blocking" library routines...Alternatively, we could simply make 7.4.2.10 and 7.4.2.11 apply to the existing "blocking" routines as well."

    I don't object to adding some clarifying paragraphs to B.3.2.1, however I think it's important that these properties are directly stated in the nb library section. In the blocking case, there are by definition no conflicting accesses from the initiating thread, which automatically eliminates the easiest way for a programmer to "mess up". Programmers familiar with shared-memory programming already understand that synchronization is required when multiple threads touch the same data, so the data races that can arise when using the blocking library should be less surprising. Non-blocking transfers introduce new ways you can create a subtle data race and end up with indeterminate values, so I think it makes sense to be very clear about when that occurs.

    Also as Jim rightly pointed out it's worth clarifying that concurrent reads of source memory are permitted, since MPI's NB transfers notably prohibit that. Together these paragraphs neatly summarize the conditions under which conflicting operations are permitted and when they lead to indeterminate values. This provides all the information needed by the average user of this library, who will not need to consult the memory model and puzzle out the implications to decide if his program is correct.

    ```

    Reported by `danbonachea` on 2012-10-08 23:49:44

  99. Former user Account Deleted

    ``` I conferred with Pavan and he confirms my recollection that the MPI standard changed w.r.t. ISEND buffer reads. In MPI-1 and MPI-2.0, the user was not allowed to touch the ISEND buffer before the request was completed. However, because many users violated this prohibition and no implementation changed the send buffer before the request was completed, so MPI-2.1 and later standards no longer have this prohibition, hence it is now a restriction on the implementation that it not modify the ISEND buffer before the request is completed.

    Jim is, of course, still correct about users expectations based upon MPI-1, which is obviously the one that is most widely known. ```

    Reported by `jeff.science` on 2012-10-09 02:35:45

  100. Former user Account Deleted

    ``` Responding to comment 99, from Dan:

    --quote--

    "Pg. 6, #4: Suggest s/shall/must/ to strengthen this statement."

    This sentence appears very early in the semantic descriptions while definitions are still being established. I intentionally prefaced the sentence with "Generally" and did not use "shall", because it's not a binding restriction - specifically in the case when the explicit-handle initiation returns UPC_COMPLETE_HANDLE, the operation is already complete and no sync call is required. However this is an unusual corner case and I wanted to provide a conceptual overview paragraph to familiarize the reader with the broad form of the interface, unclouded by such corner-cases, before getting into the actual nitty-gritty of requirements.

    --quote--

    I think we might not have lined up on the text to which I was referring. I was looking at 7.4.3 #4 and I think you were looking at 7.4.2 #4. Shall is proper legalese in 7.4.3 #4, you should ignore my suggestion. 7.4.2 #4 looks fine. ```

    Reported by `james.dinan` on 2012-10-09 03:45:41

  101. Former user Account Deleted

    ``` "I don't object to adding some clarifying paragraphs to B.3.2.1, however I think it's important that these properties are directly stated in the nb library section."

    I just want it to be clear that the blocking and non-blocking routines have exactly the same semantics regarding remote threads touching the buffers during the transfer interval. Perhaps a footnote could be added to these paragraphs indicating that this is a direct consequence of the memory model (B.3.2.1) that also applies to the blocking routines, but is explicitly called out here because of the split nature of the transfer interval?

    "In the blocking case, there are by definition no conflicting accesses from the initiating thread, which automatically eliminates the easiest way for a programmer to "mess up"."

    This is only true if there is no threading layer (OpenMP, OpenACC, pthreads, etc) underneath UPC threads. While that is outside the scope of the UPC spec, it is important to keep in mind as mixing programming models is quite common in HPC. ```

    Reported by `sdvormwa@cray.com` on 2012-10-09 15:17:35

  102. Former user Account Deleted

    ``` "This is only true if there is no threading layer (OpenMP, OpenACC, pthreads, etc) underneath UPC threads. While that is outside the scope of the UPC spec, it is important to keep in mind as mixing programming models is quite common in HPC."

    I agree with this. I think it's important to allow UPC threads, which may be mapped to OS processes, to interoperate nicely with OS threads (e.g., pthreads) whenever possible. We have several applications using UPC+OpenMP/Pthreads, which is the most scalable way to use a NUMA multi-core cluster in our experiments so far.

    ```

    Reported by `yzheng@lbl.gov` on 2012-10-09 16:47:21

  103. Former user Account Deleted
    I've not heard anyone in HPC talk about OpenMP or OpenACC as compilation targets, except
    perhaps from DSLs.  However, I think more explicit APIs like Pthreads and OpenCL are
    relevant.  It may also be prudent to think about user-level threads, e.g. Qthreads,
    as possible back-end components for UPC.  Does Kyle Wheeler follow the UPC spec discussion?
    

    Reported by jeff.science on 2012-10-11 13:32:11

  104. Former user Account Deleted
    Just added the footnote suggested by Steve in comment 105, as SVN r174:
    
    --- upc-lib-nb-mem-ops.tex      (revision 173)
    +++ upc-lib-nb-mem-ops.tex      (working copy)
    @@ -131,7 +131,10 @@
     performed by a set of relaxed shared reads and relaxed shared writes of
     unspecified size and order, issued at unspecified times anywhere within the transfer
     interval by the initiating thread. Conflicting accesses {\em inside} the transfer
    interval
    -have undefined results, as specified in the preceding paragraphs.  
    +have undefined results, as specified in the preceding paragraphs.~%
    +\footnote{The restrictions described in the three preceding paragraphs are a direct
    consequence of 
    +[UPC Language Specifications, Section B.3.2.1], and also apply to the blocking \memstar
    functions.
    +They are explicitly stated here for clarity.}
     Here {\em inside} and {\em outside} are defined by the {\tt Precedes()} program order
    for
     accesses issued by the initiating thread; accesses issued by other threads are considered
    {\em inside}
     unless every possible and valid $<_{strict}$ relationship orders them outside the
    transfer interval.~%
    

    Reported by danbonachea on 2012-10-18 21:57:53

  105. Former user Account Deleted
    FYI, in MPI-3 One-sided communication, non-blocking puts and gets do Not pass synchronization
    points.
    
    Quoted from mpi30-report.pdf from the MPI Forum (www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf)
    
    Page 431, Line 9-14:
    
    "The end of the epoch, or explicit bulk synchronization using
    MPI_WIN_FLUSH, MPI_WIN_FLUSH_ALL, MPI_WIN_FLUSH_LOCAL, or MPI_WIN_FLUSH_LOCAL_ALL,
    also indicates completion of the RMA operations. How- ever, users must still wait or
    test on the request handle to allow the MPI implementation to clean up any resources
    associated with these requests; in such cases the wait operation will complete locally.
    "
    
    For comparison, MPI_Win_flush_all is roughly is same as upc_fence and MPI_Rput/MPI_Rget
    are the counterparts of upc_memput_nb/upc_memget_nb.
    

    Reported by yzheng@lbl.gov on 2012-11-29 05:02:01

  106. Former user Account Deleted
    Re: Comment 109
    
    It's worth adding that /all/ one-sided operations in MPI are non-blocking and all outstanding
    operations are completed by passive target flush/lock operations at the target that
    is synchronized.  Request-generating operations (added in MPI-3) are not an exception,
    however the user is still required to clean up the request object that was returned
    by MPI.
    

    Reported by james.dinan on 2012-11-29 05:42:30

  107. Former user Account Deleted
    "all outstanding operations are completed by passive target flush/lock operations" should
    say "can be completed by passive...".  Obviously, they can also be completed by active
    target operations.
    

    Reported by jeff.science on 2012-11-29 06:14:27

  108. Former user Account Deleted
    Jim and Jeff: thanks for the clarification.
    This means that MPI_Put and MPI_Get actually behave like upc_memput_nbi and upc_memget_nbi,
    which are non-blocking memcpy operations without explicit handles. 
    

    Reported by yzheng@lbl.gov on 2012-11-29 18:39:01

  109. Former user Account Deleted
    This PendingApproval change appeared in the SC12 Draft 3 release.
    It was officially ratified at the 11/29 telecon.
    

    Reported by danbonachea on 2012-11-29 20:03:22 - Status changed: Ratified

  110. Log in to comment