Clarification: data tearing and read/write ordering

Issue #61 new
Former user created an issue

Originally reported on Google Code with ID 61

What guarantees should/can the UPC language specification offer with respect to "data
tearing", or the reading/writing of data that may be implemented as more than one aligned
read/write to main memory?

(The following notes are transcribed from some suggestions made by Steve Watanabe [Boostpro].)

- a normal scalar access must resolve to a single
memory operation.
- an unaligned scalar access may create multiple
memory operations.
- a bitfield access may create multiple memory operations.
- a bitfield write may read and write adjacent bitfields.
- an aggregate is accessed member-wise.
- operators that both read and write the same scalar value,
such as the increment operator create both a read
and a write memory operation.

The aggregate case might be relaxed further:

- The order in which the members are accessed
is unspecified and need not be consistent
across threads.

The problem case is shared scalars whose size
is greater than what the underlying hardware
supports. e.g. __int128_t on a 64-bit system 
or long long on a 32-bit system. From a language
consistency point of view I'd like it to be 
a single memory operation. Having something 
like the rules for signal handling in C would
be a real nightmare. (Only variables of type
volatile sig_atomic_t are guaranteed to be
valid in a signal handler.)

On the other hand, we'd also like to avoid the overhead
of implementing atomic operations for large
scalar types. At the very least, some base
set of arithmetic types needs to have atomic
load/store guaranteed.

Reported by gary.funck on 2012-07-17 17:52:22

Comments (7)

  1. Former user Account Deleted

    ``` As an implementer I loath the idea of, for instance, making access to 64-bit "double" atomic on a 32-bit platform. I also agree, however, that the C signal handling idea that ONLY one specific type is atomic is pretty much useless for any concurrent programming, including not just UPC but also pthreads, etc.

    So, I am fine with the proposal *IF* the first bullet is changed from - a normal scalar access must resolve to a single memory operation. to - a scalar access up to an implementation-defined size and with implementation-defined alignment must resolve to a single memory operation.

    Note 1: "implementation-defined" means the implementation is required to DOCUMENT the size and alignment restrictions

    Note 2: there are "broken" ABIs, such as for PPC64 on AIX, where the CPU word size is 64-bits, but 64-bit "double" and "long long" is given only 4-byte alignment! This is a platform where the "implementation-defined alignment" would be used to state what might otherwise not be obvious to the user. ```

    Reported by `phhargrove@lbl.gov` on 2012-07-17 19:36:59

  2. Former user Account Deleted

    ``` Set default Consensus to "Low". ```

    Reported by `gary.funck` on 2012-08-19 23:26:19 - Labels added: Consensus-Low

  3. Former user Account Deleted

    ``` Change Status to New: Requires review. ```

    Reported by `gary.funck` on 2012-08-19 23:37:41 - Status changed: `New`

  4. Former user Account Deleted

    ``` I will retain ownership of this issue. ```

    Reported by `gary.funck` on 2012-09-19 17:04:31

  5. Former user Account Deleted

    ``` Note that bit-fields are technically scalars (like all integer types), so it'd be nice to qualify that a bit more. I don't know that it's reasonable to require bit-field updates to be tear-free. ```

    Reported by `sdvormwa@cray.com` on 2012-09-21 19:59:36

  6. Former user Account Deleted

    ``` "The problem case is shared scalars whose size is greater than what the underlying hardware supports."

    The problem is actually worse than stated in comment 0. There are also architectures that can data tear in the opposite direction. Specifically, when performing a write of size SMALLER than the hardware word size, they do a read-modify-write of a larger size (word or even cache line) and the writeback can therefore clobber concurrent writes to the word data surrounding the small write performed at the language level. This affects bitfield writes on almost every achitecture, but can also affect byte writes on certain systems. Most architectures include a byte mask in the writeback so the memory controller only writes the actual dirty bytes, but I'm not sure we should assume that's universally available.

    Because of these competing tensions, some architectures may only support atomic, tear-free writes of only a single data size, and only when aligned. This is why C99 only requires implementations to provide tear-free updates of a single type (sig_atomic_t sec 7.14). UPC technically inherits sig_atomic_t, but C99 explicitly allows this type to be volatile-qualified (read "completely unoptimized"). Also there is no guarantee on the range of values this type can hold (read "portability problem"), and in any case it's definitely an integer type, which rules out floating-point values. Overall, this is probably not a type we should be teaching HPC users to use for their main data structures.

    I agree with Paul that we should not provide attempt to provide a universal guarantee of tear-free memory operations - such a guarantee could make UPC unimplementable on many architectures of interest. I think the best we can universally require is a single "implementation-defined" type that will be tear-free - but this basically bring us back to sig_atomic_t, which is already available.

    Overall I prefer the model of encouraging users to write programs that are properly synchronized (without data races that can expose tearing). Alternatively if they insist upon including data races in their program, then encourage them use the AMO interface, where the effects of tearing can be prevented by handling concurrent accesses in a principled manner within the library. This seems far preferable to specifying something about all concurrent accesses anywhere in the program (even of a certain size), which seems likely to imply new implementation headaches, subtle implementation bugs, and possibly global negative performance impacts. I suspect the standardization and wide availability of an AMO library will help to reduce the importance of this issue for many users.

    I move that we postpone this issue to 1.4 or later, and re-consider the issue once the standardized AMO library reaches widespread acceptance.

    ```

    Reported by `danbonachea` on 2012-09-25 00:08:48

  7. Former user Account Deleted
    deferred to 1.4 at the 11/29 telecon
    

    Reported by danbonachea on 2012-11-29 19:35:14 - Labels added: Milestone-Spec-1.4 - Labels removed: Milestone-Spec-1.3

  8. Log in to comment