UPC++ team object creation

Issue #27 resolved
Mathias Jacquelin created an issue

We are going to support team shared memory allocation. I am wondering whether we would like to provide a utility function for doing a placement new on a single rank of a team for shared memory object creation.

The tricky part of placement or new or destruction in a team is the progress as well as first touch.

A possibility is to make the allocate collective…. I don't see a strong case against having a collective allocate within a team

Another option is that for the single item new you could specify which rank does the allocate and execute the new. But then that rank might need to call delete. Which also means delete needs to be collective, or provide a way to synchronize (future)

As we are trying to minimize blocking operations, and even barrier returns a future in the spec, I think that in practice it should be ok for any member of the team to perform the constructor or destructor…as long as just one rank does it….

What do you guys think ?

Comments (7)

  1. Former user Account Deleted

    From Team.tex, under shared memory:

    global_ptr<T> allocate(team&);
    global_ptr<T[]> allocate(size_t, team&);
    

    This first form has a straightfowrad implementation. Each call round-robins over team ranks to pick which segment to allocate from. Each segment allocator is protected by a pthread mutex in the shared memory window. How portable is that?

    The second form is dangerous to expose. The user likely wants the array elements to be evenly scattered over the segments in the team, but we can't do that since arrays have to be contiguous. The array will have to be allocated all in one segment. I think many users will be mislead and get a hard to debug allocation failures because rank 0 is all full. We could fix this misunderstanding by making the peer segment explicit:

    global_ptr<T> allocate(team&, intrank_t peer);
    global_ptr<T[]> allocate(size_t n, team&, intrank_t peer);
    

    This forces the round-robin on the user but is free of the misinterpretation that arrays are magically scattered. Given the questionable portability of locks in shm and the fact that it can't just do the most useful thing in an easy way, I'm beginning to dislike this idea.

    We've been dodging spec'ing dist_array since its operator[] isn't scalable without incurring comm, but I think that's the right tool here. We could make the dist_array implementation know to just eagerly cache the base address global_ptr's of on-node peers. But this still incurs runtime checks for each element lookup of whether its node-local or remote. Given that truly global block-cyc arrays aren't really that great, we should make this class work for node-local teams only. Now we can omit the checks.

    template<typename T>
    struct dist_local_array {
    private:
     std::vector<T*> _bases;
    public:
     // truly collective. upon return everyone has created this array and all elements are accessible.
     dist_local_array(team &team, size_t n) {
      // precond: team must be all on-node
      // 1. everybody allocates their chunk
      // 2. shares base pointers with peers as a global_ptr
      // 3. converts peer's gp's to T* and stores in _bases
     }
    
     T& operator[](size_t index) {
      // not doing block-cyclic but should?
      intrank_t rank_n = team().rank_n();
      return _bases[index % rank_n][index / rank_n];
     }
    
     T const& operator[](size_t index) const {...same thing... }
    };
    

    And when I say on-node or node-local, I really mean sets of ranks that have shared memory access to each others segments. On conduits that don't support this the only supported team becomes the singleton team of self.

  2. BrianS

    We do the check for the allocate at allocate time and fail if it can't be done. After that there is no runtime checking. If every member of the team can downcast, then allocation works. In my version, the call is not collective. The user decides the placement. If it is collective, then I would still ask for a designated rank to be the allocator. This is also why I do not provide new/delete. since I do not want to control who calls constructors. That is a likely first-touch and placement.

    dist_array is expected to work across nodes. It is distributed. distributed_local is mixing concepts. In fact, antonyms.

  3. BrianS

    This allocate is just granting global_ptr the same powers as Boost shared memory pointer. If people make heavy use of this feature and global_ptr augmentation to make ::local() impedes performance for critical applications then we have a case for gasnet to provide common mmap address for our ranks. Without this feature in play, we will remain without chicken or egg.

  4. Dan Bonachea

    If people make heavy use of this feature and global_ptr augmentation to make ::local() impedes performance for critical applications then we have a case for gasnet to provide common mmap address for our ranks

    The case of cross-mapped PSHM segments is a case where GASNet fundamentally cannot provide aligned virtual memory segment addresses for each peer - ie virtual address X can only be the segment base pointer for one of the PSHM peers, the others have to be mapped in at different VM locations.

    However I'll once again re-iterate that GASNet-EX will provide offset-based addressing for naming payload addresses (eg in put/get/etc), so if distributed arrays are allocated from a symmetric shared-memory heap (as in UPC) then a single offset can represent the location of the array in every segment, regardless of where those segments are mapped in VM space. However this approach probably requires a global_ptr representation that can store an offset instead of a raw address (which I've been arguing for all along, as it will likely be mandatory for GPU support). The main benefit of this design is it ensures there is at most one potentially-nonscalable baseptr table per segment (instead of one per object as suggested above), and it lives inside GASNet which might already need it for other purposes anyhow. On some platforms that can operate on offsets natively, there may be no table required anywhere.

  5. Former user Account Deleted

    @bvstraalen

    Was my guess at your intended implementation correct that it it is non-collective and internally round-robins over which of the peer segments to victimize?

    The intended use case for team allocation is unifyng replicated metadata, isn't it? If so, do you agree that the array-form of allocate is misleading since it can't spread the elements over the ranks' segments?

    Your stuff does no runtime checking after allocate because you only went as far as spec'ing allocation. We have to consider that what the user wants is savings for replicated data. Thus they would appreciate it if we actually provided them a whole datastructure. A regular dist_array would be a pain to work with because its operator[] has to return global_ptr. The good thing about my dist_local_array is that presents T&'s so users can get element access just like a regular array. It's also built on top of the core upcxx, so doesn't require fancy support from our segment allocator. Win and win.

  6. BrianS

    The array form can be removed. It make construction ambiguous.

    dist_local_array would be a form of shared vector. we could just call it upcxx::shared_vector It would need a way to fail at construction if the team cannot provide a T& for any team member (no implicit communication).

  7. Dan Bonachea

    I believe this issue is resolved in Spec 1.0 draft 2.

    Shared memory allocation is performed explicitly and non-collectively, and is placed in the shared segment with affinity to the calling rank. This memory is shared with all ranks, and is load/store accessible by members of upcxx::local_team(). upcxx::broadcast, upcxx::dist_object and other similar mechanisms are available to conveniently publish a global_pointer to the new object across members of any team.

  8. Log in to comment