Global pointer representation and the Bootstrapping Problem

Issue #16 new
BrianS created an issue

I was looking over codes and figuring out what shared_array is doing. As pitched in the UPC world it is your working array for real data to be operated on in parallel. In that sense, it is a simple version of Global Arrays.

In UPC++ it rarely serves that purpose. Instead it is used as a mechanism to create global_ptr objects for remote data locations for one-sided communication. That is a more sophisticated use case from a UPC user space, but I realize now that it is essential for UPC++. In UPC, global space objects are named in the global scope. issuing a remote operation can get resolved because the UPC compiler goes through extra effort to pick up where the linker leaves off

http://upc.lbl.gov/docs/system/runtime_notes/static_data.shtml

the proxy unshared approach is scary to behold. But I realize that we could achieve this kind of global space binding in C++ without a special C compiler...i think. So far we have shunned file-scope variables and the global scope requirement of UPC, but the shared_array bootstrap is also awkward to program too.

I think it would be nice to have a way for every rank, or every team, to register virtual addresses with names, or a key that is application tailored, so that ranks can rapidly build the global_ptrs they want.

In AMReX, the application rank knows that it wants to do a put to a specific rank in a specific FArrayBox on that rank. The key would be two integers (which MultiFAB, which FArrayBox in that MultiFAB) and the target rank.

We could also have "static" destinations. Currently we lean on rpc to talk to remote static data structures. I'm thinking about how we create the global_ptr that we are going to target with atomic operations. We can't use rpc to get atomic behavior, not the way the direct hardware will support. I don't think we want to be issuing compare and swap inside an rpc. So we need to have a way to create a global_ptr that gets correctly virtualized on the receiver.

Comments (31)

  1. Former user Account Deleted

    I often use the pattern of creating file-scoped registries (unordered_map's from an app-specific key to hunks of data) and then using rpc's which close over a key and a payload to deliver remotely.

    unordered_map<int,farraybox> _block_data;
    
    void deliver_to(int block_id, const farraybox &data) {
      upcxx::send(
        block_to_rank(block_id),
        [=](const farraybox &data) {
          _block_data[block_id].copy_from(data);
          // now we should also leave a cookie somewhere so this rank
          // knows this data has been delivered
        }, data
      );
    }
    

    The drawback here is that we are doing packing/unpacking implicitly in the lambda's serialization. A true PGAS approach would setup contiguous buffers ahead of time and do signalling puts. But that's only a win if the data being moved is naturally contiguous and we're writing directly into its native storage. Setting up buffers to do puts that you then unpack manually saves you nothing over the rpc approach. So apps like AMR which move non-contiguous subarrays pretty much have no use for global_ptr (hence the failure for UPC to penetrate).

    A missed performance opportunity exists with this rpc example because it serializes the entire farraybox, transmits it, and deserializes it when it could have pipelined those operations by chunking up the payload. This achieves overlap of all three operations (each progressed by cpu1, network, cpu2) and uses less buffer space at the receiver side since you only need enough "chunk" buffers to cover the latency*bandwidth product as opposed to one big one for the entire payload.

    The solution is a more complicated form of rpc (built on regular rpc) that accepts a payload and a chunk-received lambda, and I do all the pipelining internally. Its a complicated API, but not unnecessarily so.

  2. BrianS reporter

    I think the VIS interface is the natural way to express moving data to remote FArrayBox objects. but then you need the g_array_ref to make the put or get call, hence the need for a bootstrapping round.

  3. Former user Account Deleted

    Please back up the vis is more natural claim. It requires an extra synchronized setup phase to coordinate all the garrayrefs. Rpcs just use the natural metadata. And the complications I described with the pipelined interface would only affect the template specializations for farraybox to work in that way. Once farraybox implements these it's just a matter of send(rank,farraybox,receivinglambda)

  4. BrianS reporter

    If VIS is not suitable for moving data around in a MutliFab, then I don't think we have a very good tool. It is probably a good idea to think of things that aren't all pushed into rpc. rpc makes "global address" pointless and we are just an extension of existing runtimes. Remote Memory access is the point of global address, so we need to think about how to make global addresses useful.

    We have structured data on multiple distributed addresses. A good VIS interface should make that a natural data structure to work with.

  5. Amir Kamil

    I'm no UPC expert, but my understanding is that shared arrays are often used for bootstrapping there as well, especially since shared arrays are limited to block/cyclic distributions.

    Here is a summary of the way bootstrapping can be done in UPC/UPC++/Titanium, and the tradeoffs:

    1. In UPC, use a shared array directly for communication. A shared array can be represented in a scalable way due to UPC's symmetric heap. Global pointer representation is more complicated, however. The main to this approach is the limited types of distribution supported by shared arrays.

    2. In UPC, use a shared array as a directory to each rank's piece of the global data structure. This requires both allocating a shared array and a separate synchronization to ensure that the array has been filled. The representation can be scalable.

    3. In UPC++, use a shared array as a directory. A synchronization is required. The current representation is not scalable because UPC++ does not have a symmetric heap. A symmetric heap is not feasible in the presence of dynamically created teams and dynamically created team-scope shared arrays.

    4. In UPC++ and Titanium, do an all-to-all to exchange pointers to each rank's piece of the data structure. This includes the required synchronization. The representation is not scalable.

    While a symmetric heap is not feasible for UPC++, a symmetric data segment is (and I think is actually currently required by the implementation). So we could introduce some sort of shared object with static storage duration that does not require replication of the metadata. Something like a coarray of size 1 on each rank. A template would allow users to include arbitrary data. The bootstrapping process would still require a synchronization, but no communication would be necessary to exchange addresses. Thoughts on this?

  6. Former user Account Deleted

    @akamil, the shared data-segment approach still doesn't work with dynamic teams (hence static), right? I think that might be a show stopper. Also, unless the underlying platform supports true virtual-address symmetry of the segment, we're still going to need at least one unscalable array per rank (or possibly just node, less unscalable by a constant) of segment addresses. That's two strikes. Do you have any knowledge about how common the second strike is?

    If we want to avoid these issues, the most scalable approach that can be used with dynamic teams would be a lazily populated table mapping rank num to global_ptr, a cache in effect. When a rank looks up the global_ptr of another rank for the first time, under the hood we send an rpc to do the lookup, and we always return a future<global_ptr> to the user since this lookup may be asynchronous. We can bound the size of this cache and use an LRU policy for temporal locality. The big drawback is that the rpc's require receiver attentiveness. Async runtimes can tolerate this well since they tend to keep threads polling for comm just for the sake of attentiveness. Legacy bulk-sync apps with long sprints of inattentive compute may struggle. They'll either need to delimit these lookups into regions fenced by a barrier, or we provide them a collective version of a multi-lookup. The API for the multi-lookup would take a list of rank-ids (keys) and return a list of global_ptrs (values) (we should abstract this to a general key-value store since our underlying cache is going to be sparse anyway). We internally intersect the user's list against the cache, only cache misses get sent out as rpc's. Well-behaved apps would experience few misses. The returned list of values could be wrapped in a future or not depending if we present an async semantics (we could do both).

    I think this lazily cached sparse map could be our one catch-all for distributed scalable metadata. We would support both the point-to-point rpc based lookup, and the collective multi-lookup. Users will just have to know which is the best form to use depending on their runtime's quality of attentiveness. And, this gives us a great place to stash a node-shared memory optimization. Just put the cache in node-shared memory and use a thread-safe hashtable optimized for fast reads. The catch-all gets even stronger! Notice that once again, global_ptr's are out of the picture. IMO, PGAS is not a general purpose thing and we should only be pitching it for niche apps like HipMer.

    void foo(upcxx::team &my_team) {
      typedef <some type> Key;
      typedef <some type> Value;
    
      // We require collective construction of dist_map's. Internally we just increment a team-local id-counter to generate a "name" for this
      // instance which gets shipped in the rpc's. So its collective, but it doesn't communicate.
      upcxx::dist_map<Key, Value> my_map{my_team};
    
      my_map.put(Key{...}, Value{...}); // populate local datapoint
    
      /* We don't need a synchronization between puts and gets. If a get arrives and no matching key is found, we assume
         the user just hasn't populated it yet and wait. Since the asker is getting back a future we can just hide this extra
         latency in there. This means we don't support key existence query, since non-existence is indistinguishable form
         doesn't exist yet. The user would have to encode non-existence as a special value. */
    
      upcxx::future<Value> val = my_map.get(some_other_rank_number, Key{...}); // either cached and fast, or a slow p2p rpc
    
      // OR
    
      vector<Value> vals = my_map.get(vector<Key>{...}); // collective, synchronous, multi-lookup
    
      // Collective destruction of my_map for free, thank you c++. Tricky detail: does the destructor just free the local map and continue,
      // or does it synchronize with a barrier to ensure all outstanding lookups get serviced. Usability vs performance. I think the right
      // solution is the fast and unsafe one. We don't want to penalize users with multiple barriers when they want multiple tables in scope.
      // They'll just have to know about this issue and place a barrier before exiting a block of code with tables destructing in it, in many
      // cases they'll need a barrier there for other reasons. Also, dist_map's need not be stack alloced, in which case the construction
      // order does not determine the destruction order (users do it explicitly with delete). Hiding a barrier in destructors could deadlock
      // the user in cases where they call deletes in non-deterministic order (think std::shared_ptr<upcxx::dist_map>). And we can detect
      // cases where get's arrive after destruction in debug-mode if we never recycle the id's and keep track of the living vs dead.
    }
    
  7. BrianS reporter

    To clarify terminology. symmetric heap means that a remote rank has the same base address for the heap as the other ranks, and hence can compute a global_ptr on it's own? my only experience with the term is from shmem, where it really means the collective heap.

  8. Former user Account Deleted

    I'm not sure about the true terminology, but I think symmetric-heap means that the heap's layout is the same on all ranks. If the shared segments share the same base pointer (lets call that aligned-segments), then the segments beginning virtual address is bitwise identical on all ranks. If the segments have different base pointers, then each rank needs access to a table of base pointers and offset-from-base become object's identity.

    1. Having aligned-segments and symmetric-heaps means we can locate a distributed object's address without communication or unscalable datastructures.

    2. Symmetric-heaps without alignment means there is only one place in which (unscalable data or slow communication) is needed.

    3. I think aligned-segments without utilizing it for symmetric-heaps buys us nothing?

    4. And having neither means we need the (unscalable data or slow comm) per object.

    Amir's proposal was like the symmetric-heap, except only for statically declared objects. Segment alignment would make it fast and scalable.

    I think there is a way to extend Amir's proposal if we restrict teams to be properly nested and scoped. We can treat the front of the segment as symmetric and LIFO allocated (stacked), and use the back of the segment for non-symmetric heap stuff (hope they never cross). Then each time a distributed object is constructed it allocates aligned storage by pushing the max memory needed for that object by any rank in the current team onto the symmetric-stack. This requires the following restrictions:

    1. Teams must be nested and constructed/destructed collectively.

    2. Dist objects must be constructed/destructed collectively.

    3. Dist objects must be constructed while the proper team is the most recently constructed and still living one.

    So if we impose all these restrictions, we can have team-local distributed objects with potentially fast global_ptr lookup. But they are only actually advantageous when on a system with aligned-segments. On unaligned-segments, I doubt the benefit of containing slow/unscalable translation to just segment translation vs per-object translation would be worthwhile. And all of this only benefits the translation of global_ptr's, global_ptr are still really slow to dereference! The remedy for slow/unscalable will tend to look like the rpc solution I outlined previously. If we just go all-in with rpc, then the rpc can deliver (and hence cache) the dereference too. Caching the dereference does impose a single-assignment nature to the datastructure, but I think this aligns well with how they are typically used.

  9. Amir Kamil

    @jbachan Yes, the aligned data segment approach would not work for team-scoped data structures. This is why I think it should be a distinct concept from shared arrays.

    UPC++ currently requires aligned code segments in order for RPC to work. The current implementation requires address-space randomization to be turned off. There isn't an easy way around this, though the Titanium implementation did have some sort of introspection to try to determine the base address of each segment, so it may be solvable. If we stick to aligned code segments, however, we'll have aligned data segments.

    Relying on RPC is not a complete solution on its own, without support from the GASNet end. RPC uses AMs, which currently use an unscalable set of buffers. This should be fixed in GASNetEx, which then essentially makes the bootstrapping problem GASNet's to solve rather than in UPC++.

    So maybe the solution is a shared array that uses RPC+caching, as John suggested. It means we won't be able to do a shared array to global pointer conversion, which is probably a good thing. I think the two concepts should remain distinct.

  10. Former user Account Deleted

    @akamil those are good points, but I think they have solutions. I know of two ways to make RPC's work without aligned code segments. Neither solution is guaranteed to work by C++, but I think either one is preferable to what we do now.

    Number 1. Encode code addresses as offsets from some fixed code location:

    // forward declare our fixed point in the code address space
    int main(int, char**);
    
    typedef void(*rpc_t)();
    
    uintptr_t encode(rpc_t f) {
      // reinterpret_cast between function pointers and integers won't actually get through the compiler, so we'll have to cheat even harder
      return reinterpret_cast<uintptr_t>(f) - reinterpret_cast<uintptr_t>(main);
    }
    
    rpc_t decode(uintptr_t x) {
      return reinterpret_cast<rpc_t>(x + reinterpret_cast<uintptr_t>(main));
    }
    

    This defeats address randomization so long as we have one big statically linked executable.

    Number 2. We can actually use templates and static initialization to tag lambdas with integer ids. It just requires that the compiler use the same order of initialization of statics for each executable instance. The language won't guarantee this, but it works.

    I think that relying on gasnet to support AM's well is fine. They know they have to fix that. But even if AM mediums of decent size don't scale, sending only really small AM's (like just a few words) would still work. RPC's that don't fit into these small AM's can be sent with a rendezvous protocol. Let's consider RPC implementation not an issue.

  11. BrianS reporter

    Here is a possible set of compromises:

    1. symmetric data segment assumption. This might require a restriction on memory Kind (we cannot likely assume the GPU has the same data segment layout as a CPU). The tricky part is that we would need the UPC compiler technology to register a location in the data segment with GASNet. This can be the "lazy proxy" as outlined in the earlier link I provided, but that might need compiler help. Perhaps there is a way we can have the C++ compiler perform the requisite processing to land a data segment in the GASNet data region. That would give us the static holders and a very scalable global pointer building capability.

    2. For heap objects, we can use lazy evaluation of remote address. Instead of strict team scoping to determine the context we can try different designs. I'd like a richer space of bootstrapping mechanisms. Some codes pretty static. and simple codes might like a "Hello World" based on file scope data (not sure this looked pretty under the covers though). But we could go with the library maintaining an explicit cache for users to register names with objects when they allocate them, then using that string or other key in the global_ptr constructor. We need to provide enough machinery to overcome the lack of compiler/linker tricks to get unique global variable symbols.

  12. Former user Account Deleted

    @bvstraalen, I would like to see the opposite to a richer space of bootstrapping mechanisms. The single-assignment, distributed, sparse, cached, key-value store based on RPC I named dist_map above should cover all needs. The keys can be meaningful application specific identifiers, and the values can be global_ptr's or any other locally generated content.

    // In this example we aren't going to use the key of the map since we'll only have once item per rank
    // So arbitrarily lets let the key type be int, and the agreed value zero. We would probably introduce
    // another data structure for this case that drops the key assuming exactly one item per rank.
    dist_map<int,global_ptr<char>> my_map;
    
    // everyone allocates a buffer in their segment and shares it
    my_map.put(/*key*/0, upcxx::allocate<char>(1<<20/*1MB*/));
    
    // go fetch my neighbor's buffer
    future<global_ptr<char>> p = my_map.get(upcxx::rank_me()+1, /*key*/0);
    

    That's pretty straight-forward! Enhancing the global_ptr constuctor to do interesting lookup via string or otherwise sounds really gross.

    The whole point of my previous post was to steer us away from tackling symmetric anything.

    1. It's a lot of work. Needs a custom allocator for the two-sided stack/heap segment (if we care about teams).

    2. It benefits us only on platforms with aligned segments (how common is that?).

    3. It requires semantic restrictions on how we present teams.

    4. For the case when the user wants to immediately dereference the global_ptr on a machine that isnt aligned, it can't merge the global_ptr lookup with the dereference like the RPC can.

    That's a lot of strikes against pursuing symmetry.

    The UPC proxy pointers page you shared roughly describes what UPC++ is doing with shared_var and shared_array during upcxx::init(). This is the cause for their restriction to file-scope declarations. I think we should start with something untethered by restrictions and see if we need more later. dist_map is unrestricted and easy to use.

  13. Amir Kamil

    @jbachan That's essentially what Titanium does, though IIRC the problem was trickier since it wasn't just for the code segment. Also, shouldn't static_cast be sufficient to convert between a function pointer and uintptr_t?

  14. BrianS reporter

    the one downside I can think of is debugging things, but that is already a pain in UPC or Titanium. At least this method, if dist_map is not too complicated internally, would let you attach a debugger like gdb or ddt to a parallel job and reason about local pointers and remote keys.

    when the future global_ptr is ready, would it then hold the remote virtual address?

    we still have the problem of safe upcast. A user creates a data structure with our team allocator (the T* inside BaseFab, or the elements inside a std::vector). They then need to become valid target of a remote put or get call.

  15. Dan Bonachea

    I'm chiming in to advocate tabling the node segment-table scalability issue (John's dist_map) for now.

    First I should note that the aligned-segments assumption from GASNet-1 is effectively dead (for both data segments even and code segments on modern kernels). By which I mean we should really design for the assumption that code and data segment base addresses provided by the OS in virtual memory are different on every node. I think John's proposal for encoding/decoding lambda function pointers as offsets into the code segment makes sense and should be sufficient for our purposes (however I'll note that scheme will probably break if the lambda code can appear in dynamically-loaded libraries, but we should probably prohibit that for many reasons). Note that GASNet AM handlers are already specified using an integer index (not an address), so they have never suffered from this issue.

    From here on I'm talking about handling shared data segment addresses.

    GASNet-EX will introduce the ability to perform communication where you specify the remote memory by providing an offset into a remote segment (similar to MPI-RMA on a window created with MPI_Win_allocate). So for example you can call GASNet-EX put with a local source virtual address, and a remote endpoint index and size_t offset into the remote segment. Assuming UPC++ goes with a scheme where shared arrays are placed in a symmetric segment, this allows the runtime to access remote copies using only the local offset + remote rank id. Carving up the memory within the shared segment is still the client's business, so John's proposal for managing a symmetric data segment (similar to how UPCR handles the same problem) is still entirely UPC++'s purview (however as he also noted, this may have some negative implications when combined with team allocation).

    GASNet clients of course are not forced to use offset-based addressing, but there are several motivations for using them to handle the node segment-table scalability problem within GASNet:

    1. We solve the problem once and potentially benefit multiple GASNet clients.
    2. We centralize the algorithm so that any cross-over between a dense and sparse representation can be made in one place, with full knowledge of the system and conduit infrastructure.
    3. Some GASNet conduits will internally need some representation of the node segment-table, and letting the client use GASNet's copy avoids duplicating the table in the client runtime.
    4. Some conduits will be capable of transmitting offsets directly on the wire, potentially avoiding the need to ever instantiate the remote virtual address on the initiator (ie no table needed at all).

    Of course this only solves your problem if you go with a symmetric heap, at least for storing shared arrays. Also if you do allow construction of global_ptr to a shared array element, then you may need one bit in the representation to encode whether the embedded address is a segment offset (ie into a shared array) or a virtual address (pointing at shared memory in a non-symmetric segment) - however, I believe you will probably need the offset representation for efficient communication using GPU memory anyhow (and probably also a field for device id).

  16. BrianS reporter

    If a global_ptr is represented as a rank,endpoint,offset then we would map to GASNet-Ex well, but we still have the bootstrapping problem. You need a way to turn a local symbol or key into a remote endpoint and offset.

    Perhaps there is only one GASNet segment? so number of endpoints=1 in UPC++? then perhaps the offset can be calculated instead of looked up? the endpoints proposal in GASNetEx seems to refer to a "team" as the natural holder of an endpoint. so perhaps the user can specify what "team" a shared_array or remote object is associated with, and thus provide a unique endpoint.

    if the user can specify the [rank,team] of their target global_ptr, then there is the question of accepting an offset they compute for us (I can see cases where an application would know the remote offset by how their algorithm is constructed), or UPC++ provides the dist_map to do the one-time look up, probably batched up.

    How do GPUs figure into endpoints?

  17. Former user Account Deleted

    I don't think it's wise for us to use the offset feature of GASNetEx. Conduits tend to support virtual addresses natively, not offsets. GASNetEx is giving us the offset abstraction at the expense of the additional overhead of maintaining the unscalable segment address table (only on segment aligned machines will this overhead disappear). Only programming models leveraging datastructure symmetry have a need for offsets, in which case it makes sense for GASNetEx to support them since it can leverage conduit specific optimizations in implementing the unscalable parts. We aren't pursuing symmetry now, and virtual address are generally faster, so our global_ptrs should use those. Can a GASNet'er (@PHHargrove, @bonachea) confirm my suspicion that virtual addresses are indeed the more natural choice on most conduits?

    I suggest we move the discussion of multiple segments and endpoints to its own thread. That is completely new territory for upc++ and will require a lengthy discussion.

  18. Paul Hargrove

    There are some conduits where the native interfaces use offsets and others that use virtual addresses.
    So GASNet-EX will sometimes need to maintain tables on some systems regardless of which you choose.
    For a language w/ symmetric allocation (where global pointers would use offsets naturally), one can use GASNet-EX's offset-based addressing avoid round-trip offset->va->offset on offset-based networks (e.g. Berkeley UPC on BG/Q is doing the round-trip right now w/ GASNet-1). The hope is that at scale GASNet-EX will do this scalable (on demand the first time it is needed), though for small runs we'd likely build the tables at initialization.

  19. Former user Account Deleted

    Thanks Paul. I assume gasnetex provides a macro telling us which addressing mode the underlying conduit uses. With that, we can make global_ptr always use either offsets or virtaddrs (no need for a bit to indicate which) internally and then ifdef the corresponding put/get API calls for that addressing mode. From the outside, global_ptr's can always be constructed (upcast) from user specified virtaddr's (in which case we'll subtract off the local segment base if the conduit is offset-based), or returned our in-segment allocator which will just know to produce addresses or offsets into global_ptrs per the conduit. If we allow users to ask a global_ptr for its virtual address, then we may need gasnet's help in offset-based conduits to do the translation.

  20. BrianS reporter

    Let's take few simple cases and see what interface would make sense.

    1. Each rank has a Vector<T> that they are working on. Vector<T> is using an allocator from the global team. How do we put and get values in other teams? If the Vector is file scoped?, if the Vector is a static member of a class, if the Vector is on the stack, if the Vector is a member of an object on the stack, if the Vector is a member of a heap-allocated object that used the allocator for the Vector but not itself allocated with our allocator?

    2. a team of ranks create a Vector<T>. The Vector itself is placement built in the shared allocator. Other teams wish to place put/get operations into this Vector with VIS interface calls. The shared memory allocator should allow the shared ranks in the team to use load/store semantics if possible.

    3. each rank has a Vector<array_ref<T> > , Vector<shared_ptr<T> > where array_ref describes the shape of the data and the shared_ptr is the actually memory allocated with the global team allocator and the shared_ptrs have been given a Deleter that returns the memory to our library. perhaps the Vector<array_ref<T> > is know for remote ranks in by construction of the algorithm, or they need some form of scatter or lazy scatter. For now, assume the dimensions are knowable for the global_array_ref, and where they would land in each rank's Vector.

    I can map most of this to use cases in AMReX

  21. Former user Account Deleted

    I don't know what "an allocator from the global team" is. There is the rank-local allocator that allocates in this rank's segment, and there is our brand-new node-local allocator that pulls from a shared memory window between on-node processes. Note that the node-local allocator returns a special pointer type distinct from vanilla T* or global_ptr<T>, let's call this new pointer type shm_ptr<T>. It must be this way since global_ptr does not conform to the "pointer" concept needed by std algos, and T* don't work in shm windows. global_ptr<T> will be upcastable from both T* and shm_ptr<T>.

    Also, the rank-local allocator might need two interfaces: one that returns T* and another returning global_ptr<T>. The reason for the T* flavor is so that it can be used by std containers (which can't use global_ptr) such that the user is allowed to share the addresses of container elements with remote peers for puts/gets.

    1. The Vector is using the rank-local T* allocator, so the user may take the address of any element in the vector, upcast it to a global_ptr, and put that in a dist_map named by some key. Peers fetch that global_ptr via a matching key.

    2. A node-local team creates a vector backed by the node-local allocator. Taking the address of a vector element will return shm_ptr<T> which is upcastable to global_ptr. Again the users shares the global_ptr in dist_map.

    3. Same basic recipe. Allocate buffers in the segment. If remote peers know how to derive strided views of those buffers than we can just share global_ptr<T> of the buffer via a dist_map of type dist_map<Key,global_ptr<T>>. If peers don't know how to setup the array view, then the producer creates a gobal_array_ref<T> and puts that as the value in a dist_map<Key,global_array_ref<T,Dim>>. Again some application specific key is always needed for matching.

  22. Dan Bonachea

    I think the question of the right internal representation for global_ptr<T> to maximize performance and scalability (possibly not both at once) is still very much open, and the topic of much of the discussion above. We might need to rename this issue.

  23. Paul Hargrove

    Since the Memory Kinds work to be delivered in Mar 2019 will almost certainly involve some changes to global_ptr<T>, I believe that is the most practical time frame to resolve this issue. It is also possible (but not certain) that GASNet-EX will begin to provide offset-based addressing in the same timeframe.

  24. Log in to comment