finalize can complete without invoking progress, leading to obscure leaks

Issue #384 resolved
Dan Bonachea created an issue

The runtime often defers actions until the next user-level progress. This is normal and expected, and usually there are associated user completions which ensure a correct program cannot reach local quiescence and enter finalize() until such actions have been drained from the queues.

However valgrind has discovered at least one situation where this property does not hold. Construction of a upcxx::dist_object internally registers/creates a non-ready promise which is used to service calls to create a future via the corresponding dist_id::when_here(). This non-ready promise is scheduled for fulfillment during the next user-level progress, ensuring that incoming rpc's referencing that dist_object (including those which may have arrived previously) are scheduled to run at the correct time. In normal programs this is not a problem, as normal programs will eventually do something to trigger user-level progress.

However if there are no such rpcs or calls to dist_id::when_here(), then the program can be technically quiesced and free to enter finalize() without any further user-level progress. Consider the following degenerate program (based on issue312.cpp):

#include <upcxx/upcxx.hpp>
#include <iostream>
#include <assert.h>

int main() {
  upcxx::init();

  upcxx::dist_object<int> dobj(1);
  upcxx::dist_id<int> di = dobj.id();

#if SHOW // activates non-compliant behavior
  upcxx::future<upcxx::dist_object<int>&> f = di.when_here();
  assert(!f.ready());
  f.then([](upcxx::dist_object<int> const &r) { std::cout << upcxx::rank_me() << ": callback!"
 << std::endl; });
  // future deliberately dropped here
#endif

  upcxx::finalize();
  return 0;
}

When run on a single rank (and intermittently for higher ranks) with no defines, valgrind confirms this program leaks the internal promise object associated with the dist_object mentioned above, eg:

 72 bytes in 1 blocks are possibly lost in loss record 81 of 107
    at 0x4C2A553: operator new(unsigned long) (vg_replace_malloc.c:342)
    by 0x405ADB: upcxx::detail::future_header_promise<upcxx::dist_object<int>&>::operator new(unsigned long) (core.hpp:578)
    by 0x40540A: upcxx::detail::future_header_promise<upcxx::dist_object<int>&>* upcxx::detail::registered_promise<upcxx::dist_object<int>&>(upcxx::digest, int) (team.hpp:20)
    by 0x404FC9: upcxx::dist_object<int>::dist_object<int>(upcxx::team const&, int&&) (dist_object.hpp:92)
    by 0x404E0E: upcxx::dist_object<int>::dist_object(int) (dist_object.hpp:113)
    by 0x4047F0: main (issue312b.cpp:8)

This occurs because the deferred activation that would lead to reaping the orphaned object is still sitting in the queue when the runtime shuts down (and the queues destroyed). The problem is that while upcxx::finalize() is permitted to invoke user-level progress, the current implementation does not guarantee it's actually invoked. In cases where the exit-barrier is satisfied locally without polling (always the case for single-rank and probabilistically in multi-rank runs), finalize() can currently exit on some ranks without ever invoking user-level progress, leading to this leak.

Compiling with -DSHOW causes the program to break the quiescence rules, but can be used to demonstrate that user-level progress is sometimes entirely skipped by finalize.

I believe the correct fix here is a small tweak to finalize() to ensure it always invokes user-level progress, allowing the runtime the opportunity to execute any pending cleanups that were deferred to the next user-level progress, most importantly those that happen to have no associated client-level dependency.

This is technically a memory leak, but is NOT expected to arise to any noticable extent in normal programs. However it can cause intermittent valgrind warnings for programs that deviate from normal use of dist_object.

Comments (3)

  1. Dan Bonachea reporter

    fix issue #384: finalize can complete without invoking progress, leading to obscure leaks

    Ensure that any call to finalize() that tears down the library invokes user-level progress at least once.

    → <<cset b5e8d96fd3ea>>

  2. Log in to comment