device_allocator constructor spec has several problems

Issue #190 resolved
Dan Bonachea created an issue

There are several problems with the current spec of the two non-default device_allocator<Device> constructors:

Dualing constructors

The most fundamental problem is we specify there are two non-default constructors - one that allocates a segment ("allocating constructor") and the other that accepts a client-allocated segment ("baseptr constructor"). Both of these are specified as collective. By a strict interpretation of collective this means it's erroneous for some ranks to call the allocating constructor while others call the baseptr constructor.

This is a constraint on the caller that seems unnecessary, and furthermore is not required by the current implementation. In reality the current implementation has one internal constructor that is used to implement both with an algorithm sufficiently general to handle either behavior, and allows freely mixing allocating vs baseptr inputs across ranks.

The difference we're discussing here is whether the UPC++ runtime allocates a segment from the device API or accepts an existing one from the calling application. For GPU devices this comes down to a process-local decision during local segment construction, and afterwards has no effect whatsoever on the operation of the collective device. The corresponding gex_Segment_Create() that locally registers the device segment with GASNet is not even a collective operation (and although it also provides a GASNet-level "allocation" option we don't even use that currently).

I'm confident this freedom will remain for our implementation for other GPU devices. It's harder to speculate about future non-GPU devices; but assuming the devices are still some kind of node-local hardware, it's difficult to imagine why all the processes in a distributed job would need to agree about which software layer performed the device API allocation call. If somehow that ever arises we always have the option to later add a kind-specific constraint on the device_allocator specialization for such a device when we add it.

Exception behavior

As mentioned above the implementation allows "mixing" calls to the allocating vs baseptr constructors. Only the allocating constructor has the possibility to initiate a upcxx::bad_segment_alloc exception. However the exception throw is guaranteed to be single-valued, so in a "mixed" call this means callers to the baseptr constructor can actually end up throwing upcxx::bad_segment_alloc due to an allocation failure in the allocating constructor called by a different rank.

Assuming we decide to specify that we allow "mixing" both modes of construction, that will need to be cleaned up.

Handling of pre-existing device data

The baseptr constructor allows the client to build a device_allocator around provided device memory. One of the main use cases for this is to allow UPC++ to "bless" device memory allocated by a different library. However an important sub-case of this is where that device memory already contains valid objects and data that the client wishes to communicate using UPC++. It would be very nice to gracefully handle that case, and I don't currently see a reason not to.

What's missing:

  1. We should specify some guarantee that when using the baseptr constructor, at least construct/destroy/destruct don't overwrite any preexisting data in the client's provided device segment (a property provided by the current implementation). We could optionally extend that guarantee to allocate/deallocate (a property also maintained by the current implementation), but I don't see a use case for that scenario and it's a freedom I'd rather reserve for the implementation absent a compelling motivation.

  2. The current spec language in device_allocator::local(gp) and device_allocator::device_id(gp) are very "allocator-centric", requiring gp to be the result of a call to allocate(). However in the proposed use case one never calls allocate() - you instead call to_global_ptr() on the device pointers to the preexisting objects in the provided device segment. That text should be massaged to allow this (which the implementation already allows).

Comments (14)

  1. Paul Hargrove

    Dueling Constructors

    I think that simplification to one constructor with two behaviors (based on polymorphism, or nullptr address) sounds like an improvement. IF that is reasonable, then the two current ones can become "deprecated spellings" for use of the new unified one.

    Exception Behavior

    While I find it odd to think about an application which would need to "mix", I cannot see any reason to prohibit it uniformly (though a given device specialization might still do so if for some reason only one construction mode is supported for some reason I cannot think of). I agree collective exception behavior is required for sanity, and the spec will need to clarify that the exception (which may already be due to a non-local condition) can occur even if the local call did not request allocation. Merging the two constructor into one would make this specification more natural.

    Handling of pre-existing device data

    I agree that we really want to support wrapping our RMA around another library's allocation (and initialization). So, what is missing (IMO) is a guarantee that in the absence of a call to allocate() we will not perturb the data. IOW: we create an allocator which permits, but does not require, its use for managing the memory.

    While it may be true that the current allocate/deallocate do not perturb the memory content, I don't see sufficient value to the end user of knowing that memory at an unpredictable location is preserved by a given allocate call. So, lets preserve the implementation freedom to someday have an allocator which would not preserve the contents.

  2. Dan Bonachea reporter

    Current dueling constructor signatures:

    template <typename Device>
    device_allocator <Device>::device_allocator(Device &device, size_t sz_in_bytes);
    
    template <typename Device>
    device_allocator <Device>::device_allocator(Device &device, Device::pointer<void> device_memory, size_t sz_in_bytes);
    

    I don't recall exactly why we specified this argument order for the second overload, but in retrospect I think that was a design mistake; sz_in_bytes is mandatory and we've stuck the optional argument "in the middle" where it isn't directly amenable to function argument defaulting.

    The unified approach I've proposed for the related new make_gpu_allocator() factory in PR 82:

    template<typename Device = gpu_default_device>
    device_allocator<Device> make_gpu_allocator(size_t sz_in_bytes, 
                Device::id_type device_id = Device::auto_device_id,
                void *device_memory = nullptr);
    

    puts the device_memory argument last, since it only makes sense to provide in cases that explicitly provide everything else.

    Two design approaches to solve the "dueling constructors" and exception behavior problems:

    1. Leave the existing collective constructor signatures, possibly merge the specification sections, and add verbiage to clarify that both can be intermixed across ranks in a single collective device creation. That verbage would clarify that in "mixed" calls the baseptr constructor has the potential to throw the collective exception generated by a matched allocating constructor.

    2. Merge the two collective constructors into a single function serving both purposes, and remove the old allocating overload and deprecate the old baseptr overload. The new merged signature would look something like the following. The new constructor would need to describe all the behaviors, but the question of "can I mix collective constructors" would no longer arise since there would be only one collective constructor. This deprecation would affect codes using the old baseptr overload (none that I'm aware of?), and become a "breaking change" as soon as we actually remove the deprecated signature, which would become mandatory as soon as we specify a Device::pointer that is ambiguous with size_t.

    template <typename Device>
    device_allocator <Device>::device_allocator(Device &device, size_t sz_in_bytes, Device::pointer <void> device_memory = Device::null_pointer());
    

    Both approaches permit the desired "mixing", but also preserve the freedom to restore the "no mixing" restriction for some future Device specialization. Under option 2 this would be as simple as requiring a single-valued device_memory argument for certain Devices, under option 1 it would probably require messier verbiage.

    I'm currently favoring approach 2 because it incurs some short-term upheaval (with probably negligible impact on real codes), but leaves us with a cleaner design long-term. There's also a good chance this entire interface eventually becomes deprecated in the future in favor of the simpler make_gpu_allocator() approach that subsumes it (so far only for GPUs).

    Thoughts @Amir Kamil @Colin MacLean ?

  3. Colin MacLean

    Approach 2 works for me, although a more generic version of make_gpu_allocator() as I described in a comment on PR 82 would be ideal.

  4. Dan Bonachea reporter

    I now have spec and implementation work that resolves this issue along the lines agreed-upon above.

    PRs coming soon..

  5. Dan Bonachea reporter

    Here is my current working spec for the new "merged" device_allocator constructor, which is the most important part of my work to resolve the "dueling constructors" and "exception behavior" problems discussed in the original post.

    device_alloc-merge.png

  6. Colin MacLean

    Do we really need to have a Device::null_pointer? Why not something like

    template<typename Device>
    template<typename DevicePointer>
    device_allocator<Device>::device_allocator(Device& device, size_t sz_in_bytes, DevicePointer device_memory = nullptr)
    

    This would allow for a uniform constructor in the spec, but in implementation it would allow us to handle the case of nullptr_t statically.

  7. Dan Bonachea reporter

    Do we really need to have a Device::null_pointer?

    What you're proposing sounds like an orthogonal change to a design decision we made years ago, namely to not assume that nullptr_t is convertible to the representation of raw Device pointers (these happen to be T* for GPUs, but we expect they will be offsets or possibly something more complicated for other future devices).

    If you want to pursue a discussion on this orthogonal topic, then please open a NEW issue for that proposal.
    FWIW minor serial performance overheads in device_allocator construction will never be relevant in practice, because the constructor runtime will always dominated by the globally collective communication.

  8. Colin MacLean

    I was assuming nullptr_t wouldn’t be converted to a Device::pointer and treated as a type that’s explicitly null in the type system, since that’s unambiguously null no matter what the representation. That’s why I specified DevicePointer ptr = nullptr as a template rather than doing Device::pointer ptr = nullptr, since this keeps the type as nullptr_t. I was thinking that Device::null_pointer was more for going the other way around and checking if a variable of Device::pointer type is null.

    I was considering this as more of a means of keeping a clean separation in the source code between the two modes of operation for maintainability, not for performance. The allocating constructor would still be separate in our source code as a template specialization of the constructor rather than combined into one big one.

  9. Dan Bonachea reporter

    I was considering this as more of a means of keeping a clean separation in the source code between the two modes of operation for maintainability

    Ok, I misunderstood your goal.

    However if you look at the actual implementation you'll find our implementation is (and has always been) dispatch to a single collective constructor (even while the public API had two separate constructors). The detail of whether or not the constructor is allocating the segment is actually a very insignificant part of all the other crap the constructor has to do, so regardless of the public API it's much better for maintainability to keep the code merged internally, with a small branch in the middle to allocate as needed.

  10. Amir Kamil

    I’m not in favor of templating the pointer argument. I think there is sufficient possibility of someone passing the wrong pointer type, and I’d prefer if that were caught with a type error rather than a static assert or something else.

    if segment allocation fails for any callers participating in this collective with an actual device and device_memory == null

    I think we should replace device_memory == null with prose, as it isn’t valid code.

  11. Log in to comment