Progress guarantee of upc_notify and upc_wait

Issue #4 new
Former user created an issue

Originally reported on Google Code with ID 4 ``` UPC 1.2 6.6.1 defines upc_notify and upc_wait as collective statements, but does not specify whether upc_notify is permitted to wait on other threads. For example, an implementation of upc_notify for a barrier tree may wait on all child threads to check in before notifying the parent thread. The main GWU Unified Testing Suite contains a test named compound_test1 with the following pattern:

upc_lock( shared_lock ); upc_notify; upc_unlock( shared_lock ); upc_wait;

This code may deadlock with such a upc_notify implementation. If the intent of the specification is that upc_notify shall not wait for any reason, then that point should be clarified so that users will be assured that this code is legal. ```

Reported by `johnson.troy.a` on 2012-03-13 16:40:42

Comments (14)

  1. Former user Account Deleted

    ``` My opinion is that this code should not be legal, as I can't think of a case where this serves a useful purpose.

    As for the fix, my suggestion is an explicit statement in section 3.7 that states implementations may presume all threads enter each collective operation before any must leave. I think this fixes it for all collectives, not just upc_notify. I would not object to a statement in upc_lock/upc_unlock that notes that because of this a lock/unlock pair may not contain a collective operation.

    Any other views on this?

    ```

    Reported by `wwc@uuuuc.us` on 2012-03-16 10:30:31

  2. Former user Account Deleted

    ``` I propose adding to the descriptions of upc_lock and upc_lock_attempt (7.2.4.5 and 7.2.4.6 in UPC 1.2):

    7 The calling thread may not enter a collective operation while holding a lock.

    I propose modifying the collectives description (7.3 in UPC 1.2):

    5 Collective functions may not be called between upc_notify and the corresponding upc_wait, or while holding a lock. ```

    Reported by `johnson.troy.a` on 2012-03-20 19:54:09

  3. Former user Account Deleted

    ``` Some of the comments in issue 12 (http://code.google.com/p/upc-specification/issues/detail?id=12) are relevant here.

    In general I think we need to "fix" this entirely within the definition of upc_notify/upc_wait and the unspecified synchronization that may occur within any operation our document specifies as "collective". Relying upon spec language added to upc_lock and friends just means the same issue arises when the user arranges for any other kind of inter-thread synchronization around a collective op (eg using strict operations). ```

    Reported by `danbonachea` on 2012-06-15 08:46:31

  4. Former user Account Deleted

    ``` OK. Same problem as in issue 12. We don't control the people who write things like non-blocking barriers, and therefore it is wise not to expose UPC to the vagaries of any particular implementation of barrier.

    With my IBM hat on, I would have liked to continue the crusade, and could have used a stricter UPC language standard as a stick to beat up our PAMI guys. But I accept that this will not happen everywhere.

    ```

    Reported by `ga10502` on 2012-06-15 14:47:45

  5. Former user Account Deleted

    ``` Hoping for 1.3. This issue keeps coming up for us. ```

    Reported by `johnson.troy.a` on 2012-06-15 17:59:46 - Labels added: Milestone-Spec-1.3

  6. Former user Account Deleted

    ``` I believe this case is already covered by the definition of "collective", although in order to make it clearer I propose the following clarification to that definition:

    --- upc-terms-and-defs.tex (revision 80) +++ upc-terms-and-defs.tex (working copy) @@ -87,14 +87,16 @@ \sterm{collective}% constraint placed on some language operations which requires evaluation of such operations to be - matched\footnote{A collective operation need not provide any + matched across all threads.% + \footnote{A collective operation need not provide any actual synchronization between threads, unless otherwise noted. The collective requirement simply states a relative ordering property of calls to collective operations that must be maintained in the parallel execution trace for all executions of any legal program. Some implementations may include unspecified synchronization between threads within collective operations, but programs must not rely upon - such unspecified synchronization for correctness.} across all threads. + the presence or absence of + such unspecified synchronization for correctness.} The behavior of collective operations is undefined unless all threads execute the same sequence of collective operations.

    upc_notify is elsewhere defined as a "collective" operation, so the program in comment 0 is invalid because its correctness relies upon "the absence of such unspecified synchronization" in the upc_notify call, and hence the program may deadlock on some implementations (and that deadlock is the programmer's fault).

    As a closing note, the spec does NOT prohibit holding a lock while executing a collective operation, because that would be too strong a condition. For example, this code similar to comment 0 holds a lock during upc_notify:

    upc_lock_t *shared_lock = upc_all_lock_alloc(); if (MYTHREAD == 0) { upc_lock( shared_lock ); } upc_notify; if (MYTHREAD == 0) { upc_unlock( shared_lock ); }

    this code will not deadlock because it does not rely on the absence of synchronization in upc_notify for correctness - only one thread is acquiring the lock before entering the collective, and there is no cycle of dependencies preventing other threads from reaching the collective call.

    ```

    Reported by `danbonachea` on 2012-08-13 02:55:42

  7. Former user Account Deleted

    ``` Here's a more realistic example of using a lock across a barrier, to dynamically "elect" a single representative thread who arrives at the barrier early to do some additional work:

    while (1) { int first_thread = upc_lock_attempt( shared_lock ); if (first_thread) { I reached the barrier first, do some serialized work to prepare for the next phase eg read in file data for the next iteration } upc_notify; upc_wait; if (first_thread) { upc_unlock( shared_lock ); } some work upc_barrier; some more work }

    ```

    Reported by `danbonachea` on 2012-08-13 03:14:11

  8. Former user Account Deleted

    ``` I am okay with the language in Comment #6 (as opposed to Comment #2) based on the rationale in Comment #3. ```

    Reported by `johnson.troy.a` on 2012-08-13 14:46:48

  9. Former user Account Deleted

    ``` Mass-edit to mark all issues with an officially-announced resolution diff as "PendingApproval" ```

    Reported by `danbonachea` on 2012-08-17 15:25:01 - Status changed: `PendingApproval`

  10. Former user Account Deleted

    ``` May or may not be directly relevant, but during a meeting yesterday I confronted Pavan Balaji (him of MPI3 fame) with this exact problem - MPI formulation - and he was fairly sure that MPI3 would survive MPI barrier forced through a critical section.

    IIRC Thorsten Hoefler, and the IBM MPI guys, said the exact opposite :) ```

    Reported by `ga10502` on 2012-08-17 17:56:06

  11. Former user Account Deleted

    ``` Set Consensus:High on Pending Approval issues. ```

    Reported by `gary.funck` on 2012-08-19 23:31:54 - Labels added: Consensus-High

  12. Former user Account Deleted

    ``` Change proposed in comment #6 has been applied in SVN r103. ```

    Reported by `danbonachea` on 2012-09-13 22:25:03 - Status changed: `Fixed`

  13. Former user Account Deleted

    ``` For the record: Public comment period for this change was 8/14/2012 - 9/14/2012 No substantial objections were raised or recorded during that period. At the 9/7/2012 telecon, it was announced this change was imminent, feedback was explicitly solicited, and none was received. The change was integrated into a working draft distributed on 9/13/2012 for consideration and draft ratification at the 9/14/2012 telecon. ```

    Reported by `danbonachea` on 2012-09-14 07:26:14

  14. Former user Account Deleted

    ``` Changes distributed as Draft 1.1 were ratified during the 9/14/2012 teleconference, for inclusion in the next public draft. ```

    Reported by `danbonachea` on 2012-09-14 21:46:01 - Status changed: `Ratified`

  15. Log in to comment