Destruction of atomic_domains

Issue #133 resolved
Amir Kamil created an issue

The spec currently states the following for the destructor of atomic_domain:

Precondition: All operations initiated on the atomic domain must be complete prior to any process making this call (or the behavior is undefined). In practice, this means
completing (syncing) all atomic operation at their initiators, followed by a barrier prior to calling this function.

Destructs this atomic domain object.

This destructor is a collective function over the team used in the constructor.

I think that some of the same concerns in Issue #71 (collective destruction of teams) also apply here.

Can we get away with non-collective destruction of atomic_domains? If not, then I suggest we add an explicit destroy() function like on teams.

Comments (8)

  1. Dan Bonachea

    Amir - Thanks for pointing this out. You're correct that the same issues apply to collective destruction of all types of distributed resources, so we should probably be consistent with their semantic treatment (and spec wording), or have a good motivation for divergence.

    I see at least two important semantic decisions:

    1. Destruction mechanism:
      • Implicit collective destructor, versus
      • Explicit collective destroy() and non-collective implicit destructor
    2. Quiescence preconditions and barrier guarantees on entry to destroy/destructor.

    I think the distributed resource destructions we currently specify include :

    1. user-created team destruction - in pull request #16
    2. user-created atomic_domain destruction - currently prerequires global quiescence and collective destructor
    3. upcxx library finalize - currently requires local injection quiescence and guarantees an entry barrier
    4. dist_object destructor - currently non-collective destruction

    The AD text already in place has taken the "more restrictive" quiescence semantics, ie requiring global quiescence of the AD before any rank calls. However we could weaken this to only require local injection quiescence and guarantee an entry barrier (matching upcxx::finalize()) if we want that to be our general approach.

    Note that dist_object is a special case with non-collective destruction, because it has no dedicated underlying GEX object, and hence no associated network-level resources, so it's possible to destroy non-collectively (although use-after-free at any process still yields undefined behavior).

  2. john bachan

    Do we agree dist_object has the right destruction semantics? I think so, since it lends to them being dirt cheap.

    Otherwise, the rest which are all "gex" wrappers, should have the same destruction semantics/signature. I would be fine with the heavier weight but friendlier "locally quiesced" kind and explicit destruction. But I would really like to see ".destroy()" be optional. If a ninja programmer knows they can align implicit destructor calls collectively, I wouldn't want to stop them. We can make sure all of our prog guides use explicit destroys and never mention that its optional, but I would like it if the spec implied that aligned implicit destructors is safe.

  3. Dan Bonachea

    I'm not advocating any change to dist_object destruction semantics. I think the difference there is well-justified.

    FWIW we might also be able to loosen the global quiescence precondition and collective destruction semantics on gex_AD_Destroy, but I'd want @phargrove to weigh in on that potential change. This particular issue is closely tied up with the one of atomic phasing, which has not yet been full resolved, so that's why we initially went with the maximally restrictive semantics. My guess is we'll at least want to see native implementation on a second network (IB) before committing to semantic weakening for AD destruction.

    We can make sure all of our prog guides use explicit destroys and never mention that its optional, but I would like it if the spec implied that aligned implicit destructors is safe.

    I'm opposed to this semantic wishy-washyness. One of the primary reasons for explicit destroy() to exist is to make it a required precondition for the object destructor that is enforced by runtime checks in DEBUG mode. This allows us to catch the subtle mistakes/deadlocks that become possible when implicit collective destruction is permitted, and issue a helpful diagnostic for that case. Anything weaker significantly dilutes the point of this semantic.

    If a ninja programmer knows they can align implicit destructor calls collectively, I wouldn't want to stop them.

    Can't such a "ninja programmer" write their own trivial wrapper class (possibly even as a subclass) that automatically calls .destroy() in the destructor? That accomplishes the reduction in keystrokes (which is presumably your goal?) while still following the rules of required explicit destroy()..

  4. john bachan

    I see artificial constraints on an API just to make mistakes easier to diagnose as a last resort to be reserved when more robust diagnostics are unattainable. I would be in favor of enhancing our implementation's collective alignment checking in the general case. Something that failed like:

    UPC++ collective alignment failure:
      2 ranks calling: upcxx::barrier()
      3 ranks calling: upcxx::brodcast(root=5)
      1 ranks calling: upcxx::broadcast(root=9)
      1 ranks calling: upcxx::team::~team()
    

    That wouldn't catch everything, but it would catch a lot more than an API with split destroy and destructors.

    There is an argument that a constrained API makes buggy code harder to write by forcing users to think about when to destroy. I buy that, which is why I still think having a destroy method is a good thing, and its all we should talk about in the guides.

  5. Amir Kamil reporter

    My preference is to require an explicit destroy() call. I think we should be catering to most users, not ninja programmers.

    I'll also point out that collective alignment can only be checked when the code is run in parallel, while a missing destroy() can be detected with a single rank.

  6. john bachan

    I will concede since this is the sort of thing that could always be relaxed (though realistically it won't).

    Amir, in response to your second point: the reason for the split destroy() is to make collective alignment less of an issue. Cases (non-parallel cases) where collective alignment is of no consequence don't strengthen the argument for destroy()'s existence since then it's really contributing nothing but keystrokes.

  7. Amir Kamil reporter

    I disagree that an explicit destroy() is useless in sequential mode. I think it's useful to be be able to detect errors in sequential mode, since I expect most users to develop sequentially before they test in parallel.

  8. Log in to comment