Improve UPC data layout options

Issue #8 new

Former user created an issue 2012-03-19

Originally reported on Google Code with ID 8 ``` Some have observed that there seems to be little use for block sizes other than 0 and 1. Should the language be simplified by disallowing or deprecating all other block sizes?

Block size 1 is the default block size and is used widely. Block size 0 ("indefinite") frequently is used for a pointer-to-shared so that it can point to data with affinity to a single thread. The other block sizes can be emulated by using a struct with block size 1, for the common case when the block size evenly divides the array extent. For example,

shared [2] int X[10*THREADS];

versus

struct S { int data[2]; }; shared struct S Y[5*THREADS];

The first declaration is smaller and permits direct access to the elements of X. It requires understanding what block size 2 means in terms of data distribution. Changing either the block size or the array extent may require considering how they affect each other, such as whether the block size will continue to evenly divide the array extent.

The second declaration is more verbose and requires accessing the elements via two subscripts (e.g., Y[i].data[j]), but it uses the familiar default distribution. Members can be added to the struct or elements can be added to the array without as much consideration for how one will affect the other.

Additionally, moving to only block sizes 0 and 1 would have the following implementation benefits: + Elimination of block size [*] and every issue associated with it. + Zero becomes the only valid phase, so pointer-to-shared arithmetic and representation are simplified. ```

Reported by `johnson.troy.a` on 2012-03-19 16:28:28

Comments (9)

Former user Account Deleted
``` Regarding the potential perforance benefits of dropping the layout qualifier (block size specifier), some implementations (BUPC, for example) use two differing internal representations for pointers-to-shared. When the block size is 0 or 1, the phase is known to be zero, therefore no phase field is allocated. The larger internal representation with the phase field is reserved for pointers-to-shared that have a block size > 1.

The GUPC compiler always allocates the space for the phase field, but will not use the phase value for shared types with block size <= 1. Although there is some storage overhead and silght inefficiency due to ensuring that the stored phase value is zero, there is no additional computational overhead for shared pointer arithmetic involving types with block size <= 1.

Based on the above observerations, although there are some definite language simplifications derived from removing block sizes > 1, there need not be a storage efficiency of performance impact for block sizes <= 1.

```

Reported by `gary.funck` on 2012-03-19 19:56:31
- 2012-03-19T19:56:31+00:00
Former user Account Deleted
``` Cray also always allocates space for the phase, but does not consider the phase in any code generated for block sizes <= 1.

When I wrote "implementation benefits," I was not really speaking about performance, but rather the complexity of the compiler or run-time code that handles the pointer-to-shared arithmetic. The code is rather simple for block sizes <= 1, but gets to be tricky for block sizes > 1, especially when one must factor in unknown signs (at compile time) and the UPC division and mod operations that don't match the standard C ops. Compiler-generated code for ptr + n where n has an unknown sign and ptr has a block size > 1 is ugly; the code inside a compiler to implement it isn't much fun either. I suppose some implementations may push off the work to a run-time call, but using a function call for something as "simple" as pointer addition really feels wrong to me. The Cray compiler generates inline code for pointer-to-shared addition and we're interested in keeping it simple. ```

Reported by `johnson.troy.a` on 2012-03-20 15:03:52
- 2012-03-20T15:03:52+00:00
Former user Account Deleted
``` If anything is done in our current round of spec changes with respect to blocksize >1, then I would say "deprecate" is the strongest action we can take. To remove blocksize

1 completely would break too many applications.

The idea even of deprecating them bothers me significantly. As an implementer I can agree w/ Troy's desire to keep the PTS arithmetic code as simple as possible. However, the proposed alternative seems to be structs or user-provided-arithmetic (via macros perhaps). Not to be insulting to Troy or to our user base, but the idea that we are going to get higher performance/quality pointer arithmetic from a UPC end-user than from the UPC compiler seems ridiculous to me.

So, I vote to "Allow". ```

Reported by `phhargrove@lbl.gov` on 2012-05-22 00:29:08
- 2012-05-22T00:29:08+00:00
Former user Account Deleted
``` I can see both sides of the issue, but I would still like to see blocking factors gone. I have two major arguments for simplification, and a potential way to deal with Paul's argument.

Arguments for restricting blocking factors

(1) Language clarity benefit. Maybe you don't appreciate how much simpler UPC would become:
- Cleaner syntax, obviously. Well, maybe except for [0].
- No more trouble with [*] blocking factor, thread-dependent blocking factors, maximum blocking factor and so on. The UPC type system compresses to something essentially C's own type system.
- The concept of "phase" disappears from language, including upc_phaseof
- All the funky special cases in the collective definitions. gone.
- Type casts become simpler to behold. The old rule of "phase shall be zero after cast" can go. No more trouble with actual to formal parameter translations in function calls. No more trouble with writing functions that hard-code the blocking factor.
(2) Implementation benefits. What Troy said :) In addition [pure selfish thought], on the PowerPC architecture, getting rid of a modulo/integer division pair is no mean feat.

How to deal with the backwards compatibility issue

Paul rightly feels that the suggested change is drastic and will result in at least some code that will not work anymore. Oh, and he don't like deprecation either. Darn.

So how about a source-to-source translator that transforms fixed blocking factor code into BF==1 code? Gary's original message has almost the complete blue print for the transformation.

For codes with array indices the transformation would be fairly trivial. For codes with pointers-to-shared the transformation would have to generate a "pointer increment" function to allow pointer arithmetic to happen according to the original program's notions. This pointer increment function would then be inlined, essentially re-adding the complexity that Troy saved by simplifying the runtime. Thus, the runtime would be clean and high performance, but if the programmer wants to keep their hairy old code they can do that at a cost.

The source-to-source translator could transparently deal with casts to local, since the actual layout of data in memory would not have changed - only the indexing functions using pointers-to-shared would have been modified.

```

Reported by `ga10502` on 2012-05-24 03:29:16
- 2012-05-24T03:29:16+00:00
Former user Account Deleted
``` This seems the appropriate place to make this point: I think we are missing the real issue here.

The way UPC handles distributed arrays is awkward at best. This proposal and several other proposals are tinkering around the edges, rather than proposing any sort of wholesale change that actually improves the expressibility of the language. Many of the proposals are aimed at eliminating some implementation challenges, or perhaps adding restrictions to eliminate confusing cases. I agree with Paul that these do not seem to be of significant benefit, particularly to users of the language.

Perhaps, rather thank fiddling with block size related changes, people could propose new ways of specifying array geometry, perhaps that build on cyclic and indefinite pointer arithmetic, perhaps not. That might be of more benefit to users than removing existing functionality.

For the question here, I'll vote "Allow". ```

Reported by `brian.wibecan` on 2012-05-25 21:46:15
- 2012-05-25T21:46:15+00:00
Former user Account Deleted
``` Brian wrote:

I agree with Paul that these do not seem to be of significant benefit, particularly

to users of the language.

I would go so far as the say that dropping the current distributed array layouts would be creating a NEW language. What would your response be if I asked that arrays be removed from C entirely, since users can achieve the same things using only pointers? It is not a perfect analogy, of course, but my point is that distributed array layouts in UPC are too fundamental feature of the language to remove them.

I second Brian's interest in perhaps ADDING mechanism for better controlling/using array layouts. ```

Reported by `phhargrove@lbl.gov` on 2012-05-25 22:11:52
- 2012-05-25T22:11:52+00:00
Former user Account Deleted
``` I also vote for maintaining the status quo, with respect to block sizes > 1. I also support Brian's suggestion that an "out of box" proposal that supersedes and generalizes layout qualifiers might overcome the limitations of block sizes (whatever they may be) might be a more productive avenue of inquiry.

Couple of things in this regard: 1) I have heard comments to the effect that: "if you're developing a library, you can't use block sizes other that 0, because block sizes are constrained to be compile-time constants". To counter that objection, perhaps issue #40 (block sizes as an attribute of a VLA") would provide sufficient generality to meet that objection.

2) Although there has been a rather persistent stated concern that block sizes > 1 are both confusing and not very useful, apart from stated compiler/runtime implementation issues, and the library development limitation mentioned in 1 above, I am not aware of any further elucidation of why eliminating block sizes > 1 would be a good thing. If there are other than implementation issues related to block sizes > 1, I'd suggest that they should be added to this issues as comments, so that we can better understand the problem.

3) Given that Co-array FORTRAN programs also provide distributed arrays, is there anything about UPC's block sizes (layout qualifiers) that either improves the fit between UPC and Co-array Fortran, or limits inter-operability (this topic might be worth a separate issue to track the discussion).

disclaimer: I happen to like UPC array blocking factors, and think they should be used more, not less. If there are UPC language issues that limit their use, I'd prefer to see those limitations addressed rather than throwing out block sizes. That said, there is also a great deal of appeal to the minimalist argument of simplifying the language where possible/practical.

```

Reported by `gary.funck` on 2012-05-25 22:45:32
- 2012-05-25T22:45:32+00:00
Former user Account Deleted
``` Regarding Comment #7:

3) Fortran has fewer (basically one) data distribution options than UPC, so mapping a distribution from Fortran->UPC is easier than UPC->Fortran. Fewer options can be viewed as a weakness or a strength. I believe that it is a strength, partially because I think too many options are confusing and partially because of what Brian wrote in Comment #5 (i.e., the distribution options are fine but their presentation could be improved). ```

Reported by `johnson.troy.a` on 2012-06-05 20:30:06
- 2012-06-05T20:30:06+00:00
Former user Account Deleted
``` Marking 2.0 and Usability and change title to better reflect the issues being discussed. This is an issue where everyone seems to agree that something should be done but we need more time to form better proposals. ```

Reported by `johnson.troy.a` on 2012-06-15 18:09:16 - Labels added: Milestone-Spec-2.0, Usability
- 2012-06-15T18:09:16+00:00
Log in to comment

Assignee: –

Type: bug

Priority: minor

Status: new

Votes: 0

Watchers: 0