device_allocator constructor spec has several problems
There are several problems with the current spec of the two non-default device_allocator<Device>
constructors:
Dualing constructors
The most fundamental problem is we specify there are two non-default constructors - one that allocates a segment ("allocating constructor") and the other that accepts a client-allocated segment ("baseptr constructor"). Both of these are specified as collective. By a strict interpretation of collective this means it's erroneous for some ranks to call the allocating constructor while others call the baseptr constructor.
This is a constraint on the caller that seems unnecessary, and furthermore is not required by the current implementation. In reality the current implementation has one internal constructor that is used to implement both with an algorithm sufficiently general to handle either behavior, and allows freely mixing allocating vs baseptr inputs across ranks.
The difference we're discussing here is whether the UPC++ runtime allocates a segment from the device API or accepts an existing one from the calling application. For GPU devices this comes down to a process-local decision during local segment construction, and afterwards has no effect whatsoever on the operation of the collective device. The corresponding gex_Segment_Create()
that locally registers the device segment with GASNet is not even a collective operation (and although it also provides a GASNet-level "allocation" option we don't even use that currently).
I'm confident this freedom will remain for our implementation for other GPU devices. It's harder to speculate about future non-GPU devices; but assuming the devices are still some kind of node-local hardware, it's difficult to imagine why all the processes in a distributed job would need to agree about which software layer performed the device API allocation call. If somehow that ever arises we always have the option to later add a kind-specific constraint on the device_allocator
specialization for such a device when we add it.
Exception behavior
As mentioned above the implementation allows "mixing" calls to the allocating vs baseptr constructors. Only the allocating constructor has the possibility to initiate a upcxx::bad_segment_alloc
exception. However the exception throw is guaranteed to be single-valued, so in a "mixed" call this means callers to the baseptr constructor can actually end up throwing upcxx::bad_segment_alloc
due to an allocation failure in the allocating constructor called by a different rank.
Assuming we decide to specify that we allow "mixing" both modes of construction, that will need to be cleaned up.
Handling of pre-existing device data
The baseptr constructor allows the client to build a device_allocator
around provided device memory. One of the main use cases for this is to allow UPC++ to "bless" device memory allocated by a different library. However an important sub-case of this is where that device memory already contains valid objects and data that the client wishes to communicate using UPC++. It would be very nice to gracefully handle that case, and I don't currently see a reason not to.
What's missing:
-
We should specify some guarantee that when using the baseptr constructor, at least construct/destroy/destruct don't overwrite any preexisting data in the client's provided device segment (a property provided by the current implementation). We could optionally extend that guarantee to allocate/deallocate (a property also maintained by the current implementation), but I don't see a use case for that scenario and it's a freedom I'd rather reserve for the implementation absent a compelling motivation.
-
The current spec language in
device_allocator::local(gp)
anddevice_allocator::device_id(gp)
are very "allocator-centric", requiringgp
to be the result of a call toallocate()
. However in the proposed use case one never callsallocate()
- you instead callto_global_ptr()
on the device pointers to the preexisting objects in the provided device segment. That text should be massaged to allow this (which the implementation already allows).
Comments (14)
-
-
reporter Current dueling constructor signatures:
template <typename Device> device_allocator <Device>::device_allocator(Device &device, size_t sz_in_bytes); template <typename Device> device_allocator <Device>::device_allocator(Device &device, Device::pointer<void> device_memory, size_t sz_in_bytes);
I don't recall exactly why we specified this argument order for the second overload, but in retrospect I think that was a design mistake;
sz_in_bytes
is mandatory and we've stuck the optional argument "in the middle" where it isn't directly amenable to function argument defaulting.The unified approach I've proposed for the related new
make_gpu_allocator()
factory in PR 82:template<typename Device = gpu_default_device> device_allocator<Device> make_gpu_allocator(size_t sz_in_bytes, Device::id_type device_id = Device::auto_device_id, void *device_memory = nullptr);
puts the
device_memory
argument last, since it only makes sense to provide in cases that explicitly provide everything else.Two design approaches to solve the "dueling constructors" and exception behavior problems:
-
Leave the existing collective constructor signatures, possibly merge the specification sections, and add verbiage to clarify that both can be intermixed across ranks in a single collective device creation. That verbage would clarify that in "mixed" calls the baseptr constructor has the potential to throw the collective exception generated by a matched allocating constructor.
-
Merge the two collective constructors into a single function serving both purposes, and remove the old allocating overload and deprecate the old baseptr overload. The new merged signature would look something like the following. The new constructor would need to describe all the behaviors, but the question of "can I mix collective constructors" would no longer arise since there would be only one collective constructor. This deprecation would affect codes using the old baseptr overload (none that I'm aware of?), and become a "breaking change" as soon as we actually remove the deprecated signature, which would become mandatory as soon as we specify a
Device::pointer
that is ambiguous withsize_t
.
template <typename Device> device_allocator <Device>::device_allocator(Device &device, size_t sz_in_bytes, Device::pointer <void> device_memory = Device::null_pointer());
Both approaches permit the desired "mixing", but also preserve the freedom to restore the "no mixing" restriction for some future Device specialization. Under option 2 this would be as simple as requiring a single-valued
device_memory
argument for certain Devices, under option 1 it would probably require messier verbiage.I'm currently favoring approach 2 because it incurs some short-term upheaval (with probably negligible impact on real codes), but leaves us with a cleaner design long-term. There's also a good chance this entire interface eventually becomes deprecated in the future in favor of the simpler
make_gpu_allocator()
approach that subsumes it (so far only for GPUs).Thoughts @Amir Kamil @Colin MacLean ?
-
-
I’m in favor of approach 2.
-
Approach 2 works for me, although a more generic version of
make_gpu_allocator()
as I described in a comment on PR 82 would be ideal. -
reporter -
assigned issue to
Option 2 appears to be the consensus, so I'll work on prototyping the implementation side of the proposed change.
-
assigned issue to
-
reporter I now have spec and implementation work that resolves this issue along the lines agreed-upon above.
PRs coming soon..
-
reporter Here is my current working spec for the new "merged"
device_allocator
constructor, which is the most important part of my work to resolve the "dueling constructors" and "exception behavior" problems discussed in the original post. -
Do we really need to have a
Device::null_pointer
? Why not something like
template<typename Device> template<typename DevicePointer> device_allocator<Device>::device_allocator(Device& device, size_t sz_in_bytes, DevicePointer device_memory = nullptr)
This would allow for a uniform constructor in the spec, but in implementation it would allow us to handle the case of
nullptr_t
statically. -
reporter Do we really need to have a Device::null_pointer?
What you're proposing sounds like an orthogonal change to a design decision we made years ago, namely to not assume that
nullptr_t
is convertible to the representation of raw Device pointers (these happen to beT*
for GPUs, but we expect they will be offsets or possibly something more complicated for other future devices).If you want to pursue a discussion on this orthogonal topic, then please open a NEW issue for that proposal.
FWIW minor serial performance overheads indevice_allocator
construction will never be relevant in practice, because the constructor runtime will always dominated by the globally collective communication. -
I was assuming
nullptr_t
wouldn’t be converted to aDevice::pointer
and treated as a type that’s explicitly null in the type system, since that’s unambiguously null no matter what the representation. That’s why I specifiedDevicePointer ptr = nullptr
as a template rather than doingDevice::pointer ptr = nullptr
, since this keeps the type asnullptr_t
. I was thinking thatDevice::null_pointer
was more for going the other way around and checking if a variable ofDevice::pointer
type is null.I was considering this as more of a means of keeping a clean separation in the source code between the two modes of operation for maintainability, not for performance. The allocating constructor would still be separate in our source code as a template specialization of the constructor rather than combined into one big one.
-
reporter I was considering this as more of a means of keeping a clean separation in the source code between the two modes of operation for maintainability
Ok, I misunderstood your goal.
However if you look at the actual implementation you'll find our implementation is (and has always been) dispatch to a single collective constructor (even while the public API had two separate constructors). The detail of whether or not the constructor is allocating the segment is actually a very insignificant part of all the other crap the constructor has to do, so regardless of the public API it's much better for maintainability to keep the code merged internally, with a small branch in the middle to allocate as needed.
-
I’m not in favor of templating the pointer argument. I think there is sufficient possibility of someone passing the wrong pointer type, and I’d prefer if that were caught with a type error rather than a static assert or something else.
if segment allocation fails for any callers participating in this collective with an actual device and
device_memory == null
…I think we should replace
device_memory == null
with prose, as it isn’t valid code.
-
reporter Proposed resolution now in spec PR 84 and impl PR 405
-
reporter - changed status to resolved
Resolved via merge of spec PR 84 and impl PR 405
- Log in to comment
Dueling Constructors
I think that simplification to one constructor with two behaviors (based on polymorphism, or
nullptr
address) sounds like an improvement. IF that is reasonable, then the two current ones can become "deprecated spellings" for use of the new unified one.Exception Behavior
While I find it odd to think about an application which would need to "mix", I cannot see any reason to prohibit it uniformly (though a given device specialization might still do so if for some reason only one construction mode is supported for some reason I cannot think of). I agree collective exception behavior is required for sanity, and the spec will need to clarify that the exception (which may already be due to a non-local condition) can occur even if the local call did not request allocation. Merging the two constructor into one would make this specification more natural.
Handling of pre-existing device data
I agree that we really want to support wrapping our RMA around another library's allocation (and initialization). So, what is missing (IMO) is a guarantee that in the absence of a call to
allocate()
we will not perturb the data. IOW: we create an allocator which permits, but does not require, its use for managing the memory.While it may be true that the current allocate/deallocate do not perturb the memory content, I don't see sufficient value to the end user of knowing that memory at an unpredictable location is preserved by a given allocate call. So, lets preserve the implementation freedom to someday have an allocator which would not preserve the contents.