Memory kinds (and device memory) support for UPC++

Issue #30 resolved
Khaled Ibrahim created an issue

Memory kind support is still under discussion. In following text, I will try to summarize the possible scenarios that were discussed by Paul, Dan, Bryce, and Khaled. I hope to reach some agreement on these issue to craft the APIs accordingly.

A- For shared data:

Who allocate and how data are managed?

• Option 1: UPCXX manages device memory to allow allocating a memory arena from which smaller allocations could be serviced. This is consistent with the current main memory support where the shared state allocation is solely managed by the UPCXX runtime. The allocation function will need to be overloaded with versions that have additional arguments for the memory kind and possibly the domain, if multiple memory is split between multiple domains.

• Option 2: User allocates device memory then use an up-cast function to create a global pointer to the device memory. This avoids adding complexity to the current allocation function. It may create a challenge if the application stresses UPCXX runtime by making many small device allocations and try to upcast all of them. Each of these allocation could potentially be treated as an individual segment, right?

• Option 3: Host data allocation is dealt with different from other memory kinds (devices), e.g, UPC++ manages only one kind of shared memory (Host) and the application manages the rest. This option is equivalent to no support for other memory kinds, which is the original UPC++ support level.

In all cases, the shared segment metadata could carry the kind information, which may be expensive to query using for instance current generation nVidia system calls (up to multiple us on summitdev power8/nvidia GPUs under high concurrency).

The use of the upcast method is restricted to UPCXX predefined/supported memory kinds.

B- For private data:

This is a bit more complicated because typically the application is free to use any allocator. The question is how private data are presented to communication APIs.

• Option 1: Private pointer should first be up-casted to a global pointer before being passed to the data transfer APIs. In this case both src and destination of put/get are global pointers. Current APIs are not following such signature, probably to avoid confusing users about third party initiation of data transfer.

• Option 2: An additional kind argument will be added to communication APIs put/get to disambiguate private pointers. This approach is likely to impact the signatures of all APIs accepting private pointers.

• Option 3: Private pointers are required to satisfy UVA (unified virtual addressing) and system facility for disambiguation. The UPCXX API, could internally query the private pointer type, with the possibility of caching per page data to avoid the query cost for non-OS managed (migratory) pages. In such case, if a device does not such UVA, it will not supported! (This solution is not favored by multiple people in the group, although it is implicitly assumed in the MPI case.)

The choice between these alternatives has multiple consequence for performance, consistency with the rest of the design, and complexity of the API. Please feel free to add options and to clarify weaknesses of strength of various options.

Comments (11)

  1. Dan Bonachea

    My current thoughts on UPC++-level memory kinds semantic design:

    Goals:

    • independent of CUDA - extends naturally to other devices (eg memory-mapped files)
    • independent of UVA (eg upcast passes a device handle)

    Approach:

    • global_ptr is the main data abstraction for users to deal with memory kinds in UPC++
    • With this change, global_ptr gains new responsibilities. Any global_ptrs can now be:
      • global_ptr to main memory shared heap (as currently)
      • global_ptr to device memory, with caveats ….
        • TBD: how the device memory is allocated by the software layers (needs to include device identification)
        • TBD: how to “bless” regions of device memory as “shared” (possibly part of allocation)
    • Implementation: new field in global_ptr contains device id

    Casting

    • there are downcasts to convert global_ptr -> (device_ptr, device_id)
      • this is a new method on global_ptr
      • device_ptr is a void * with device-specific encoding (eg CUDA ptr)
    • there are upcasts to convert (device_ptr, device_id) -> global_ptr
      • probably a new global_ptr constructor
      • possibly allow global_ptr.local() for UVA CUDA devices
    • non-UVA devices would return !is_local(), even for a “local” GPU because it lacks load/store access

    Data transfers

    • These are the workhorse for moving data across spaces that may include devices
    • They transparently invoke device-specific transfer mechanisms on one or both sides as appropriate
      • ie CUDA_memcpy() for the immediate milestone
      • eventually GPUDirect once we have GEX-level support

    RMA-Like Data transfers (design 1):

    template < typename T, typename Completions = decltype ( operation_cx::as_future ()) >
    RType upcxx::rput(global_ptr<T const> src, global_ptr <T> dest, std::size_t count, Completions cx=Completions{});
        Precondition: local_team_contains(src.where())
    
    template < typename T, typename Completions = decltype ( operation_cx::as_future ()) >
    RType upcxx::rget(global_ptr <T> src , global_ptr<T const> dest, std::size_t count, Completions cx=Completions{});
        Precondition: local_team_contains(dest.where())
    
    • Preconditions above require affinity to local team
    • These could possibly be strengthened to require affinity to calling rank:
      • ie ptr.where() == upcxx::rank_me()
    • PROBLEM: Applications need to encode site-specific information about the process-to-device connectivity within a node
      • ie the restrictions on what constitutes a valid "local side" of the RMA
      • some systems may give all cpus uniformly efficient access to all devices on the node, others may not

    Shared-Memory-Like Data transfers (design 2):

    template < typename T, typename Completions = decltype ( operation_cx::as_future ()) >
    RType upcxx::copy(global_ptr<T const> src, global_ptr <T> dest, std::size_t count, Completions cx=Completions{});
        Precondition: none!
    
    • Unified function allows copying between arbitrary devices anywhere on the system
    • Obviously would specialize the implementation for various cases of locality
    • Also provide overloads where src or dest is a void * that corresponds to main memory that may be in the private heap
    • PRO: user needn’t worry about the details of device connection for correctness
      • Implementation stores the device connectivity map for the local team and uses it to make decisions
  2. Dan Bonachea

    Introduction

    Below is a strawman spec proposal for UPC++ handling of GPU memory allocation, based on the high-level consensus from our Monday discussion (Dan, Paul, John, Max). I've also transcribed some of the meeting notes here for context.

    Overview

    This is a conservative initial design. Our goal is to deliver something correct and NON-performant in six weeks, that can be seamlessly improved in future releases without breaking user code.

    Design Goals include:

    • Correctness
    • Portability across devices and networks
    • Don’t rely upon UVA, Managed Memory, or network-specific features like ODP

    Primary restriction:

    User needs to create “one big segment” on the GPU at startup

    • Either the user tells us how big and we cudaMalloc(big_sz)
    • or the app does cudaMalloc(big_sz) and gives us the base address and size at startup

    This is primarily to ensure that GASNet-EX has the information necessary to establish efficient offload access to objects in this GPU segment.

    Other details

    • All mem copies that might involve GPU memory use a separate upcxx::copy() call
      • Automatically handles all cases of local-vs-remote and host-vs-device for both sides of the copy
      • Might need to include some query overheads
      • but hopefully all the relevant device info is cached in the global_ptr representation

    Future releases

    • Will use GASNet-EX memory kind support to offload copies involving devices and the network to GPUDirect and similar offload features
    • Maybe someday:
      • Relax the primary restriction and allow any cudaMalloc’d object to be "blessed" to a global_ptr without any pre-registration
      • This relaxed approach would probably only be “fast” for:
        • new Mellanox InfiniBand cards with ODP support and GPUs with UVA support
        • Notably does NOT cover current Cray Aries systems, and possibly future ones (on info yet about Slingshot support)
        • Probably also excludes ECP Aurora - Intel network, so prob also not Mellanox compatible

    Strawman spec proposal for UPC++ handling of GPU memory allocation

    namespace upcxx {
    
    class MemoryKind {
      // abstract base class
      protected:
      virtual void *allocate_segment(std::size_t sz_in_bytes) = 0;
      virtual void release_segment(void *base_ptr, std::size_t sz_in_bytes) = 0;
      // possibly other stuff..
    };
    
    class CUDAKind : public MemoryKind {
      // DefaultConstructible, MoveConstructible, Destructible
    
    public:
      CUDAKind();
      // default constructor for an invalid object
    
      CUDAKind( int device_id ); // possibly other args TBD
      // This is a collective call over world() that establishes UPC++ connection 
      // from the calling process to a given CUDA device.
      // Usually called once per device at startup.
      // Probably include special arguments to allow a given process to "opt-out",
      // if it does not want to open any GPUs (eg if processes exceed devices), 
      // resulting in an invalid CUDAKind object on that process.
    
      void destroy(); // collectively destroy this MemoryKind object,
                      // invalidating any connected objects
    
      ~CUDAKind(); // prereq: destroy has been called or the library is uninit
    
      friend class DeviceAllocator;
    };
    
    class DeviceAllocator {
      // DefaultConstructible, MoveConstructible, Destructible
    public:
      DeviceAllocator();
      // default constructor for an invalid object
    
      template<typename MemoryKind>
      DeviceAllocator(MemoryKind &kind, std::size_t sz_in_bytes); 
        // allocate a sz_in_bytes segment on the device and
        // create an allocator object to service requests for space in that segment
        // destructor will free the device segment
    
      template<typename MemoryKind>
      DeviceAllocator(MemoryKind &kind, void *device_ptr, std::size_t sz_in_bytes);
        // as above, but accepts a pointer to a device segment allocated by the user
        // This object "owns" the entire contents of that segment until destruction,
        // when ownership returns to the user.
    
      ~DeviceAllocator();
        // works even after library has been uninit
    
      template<typename T, typename ...Args>
      global_ptr<T> new_(Args &&...args);
      template <typename T, typename ...Args>
      global_ptr <T> new_(const std::nothrow_t &tag, Args &&...args);  
      template <typename T>
      global_ptr <T> new_array(std::size_t n);
      template <typename T>
      global_ptr <T> new_array(std::size_t n, const std::nothrow_t &tag);
      template <typename T>
      void delete_(global_ptr <T> g);
      template <typename T>
      void delete_array(global_ptr <T> g);  
      template <typename T, size_t alignment = alignof(T)>
      global_ptr <T> allocate(std::size_t n=1);
      template <typename T>
      void deallocate(global_ptr <T> g);
      // These work exactly analogously to their shared segment counterparts,
      // except they allocate device memory from the segment managed by this object
      // Note all variants operate on global_ptr (no void * variants)
      // The returned global_ptr to device memory have .where() of upcxx::rank_me(),
      // but .is_local is not guaranteed to return true on ANY process.
    
    };
    } // namespace
    

    Example use:

    #include <upcxx/upcxx.h>
    
    upcxx::CUDAKind gpu_device;
    upcxx::DeviceAllocator gpu_alloc;
    
    int main() {
       upcxx::init();
    
       gpu_device = upcxx::CUDAKind( 0 ); // open device 0 (or other args TBD)
       std::size_t segsize = 256*1024*1024;
       gpu_alloc = upcxx::DeviceAllocator(gpu_device, segsize); // allocate a 256MB GPU segment
    
       // alloc an array of 1024 doubles on GPU and host
       global_ptr<double> gpu_array = gpu_alloc.new_array(1024);
       global_ptr<double> host_array = upcxx::new_array(1024);
    
       // copy data between them
       upcxx::copy(host_array, gpu_array, 1024).wait();
    
       upcxx::finalize();
    }  
    
  3. Amir Kamil

    We discussed Thrust compatibility on Slack, and I'm adding some relevant details here.

    Thrust currently provides a single STL-like container: device_vector<T, Alloc>. While we don't intend to write our own adapter layer for interoperating with device_vector at this time, we also don't want to come up with a design that precludes someone else from writing an adapter. However, there is a fundamental limitation on Thrust's vectors: while they are parameterized on allocator type, they do not provide constructors that take in allocator objects. This requires the allocator type to be DefaultConstructible and effectively stateless, which isn't something we can support.

    Thrust's vectors are implemented as deriving from thrust::detail::vector_base<T, Alloc>, which uses thrust::detail::contiguous_storage<T, Alloc> as its underlying storage abstraction. The latter does have constructors that optionally take in allocator objects. Thus, I believe it would only require straightforward modifications to device_vector and vector_base to support allocator objects: constructors that take in an allocator object and pass it to the constructor for contiguous_storage.

  4. Dan Bonachea

    @akamil wrote:

    there is a fundamental limitation on Thrust's vectors: while they are parameterized on allocator type, they do not provide constructors that take in allocator objects. This requires the allocator type to be DefaultConstructible and effectively stateless, which isn't something we can support.

    I agree this seems like an oversight in the design of Thrust device_vector constructors that should probably be fixed on their end.

    On a related topic, how does thrust::device_vector select the device for placement in multi-device processes? I assume it's relying (perhaps implicitly) on the CUDA context provided by cudaGetDevice for the constructing thread.

    Given we only plan to support one upcxx::DeviceAllocator per device (per process), someone writing an adapter could potentially workaround the "stateless" allocator problem you mention by writing a DefaultConstructible adapter type whose allocation method calls cudaGetDevice to retrieve an index used to lookup the real upcxx::DeviceAllocator object in a process-wide table.

  5. Amir Kamil

    I was unable to find any Thrust documentation on multiple devices. However, I did find an old StackOverflow post that says you can call cudaSetDevice to set the device before making Thrust calls, which implies that it does use cudaGetDevice under the hood.

  6. Log in to comment