Unfulfilled promise leaks memory if it has a dependent future created by then() or when_all()

Issue #536 open
Dan Bonachea created an issue

Test code, based on issue478.cpp:

#include "../util.hpp"

using namespace upcxx;

int main(int argc, char **argv) {
  upcxx::init();
  print_test_header();

  future<> ready_empty = make_future();
  future<int> ready_nonempty = make_future(3);

  int N = 1;
  if (argc > 1) N = std::atoi(argv[1]);

  for (int i=0; i<N; i++) {
    promise<> prom1;
    future<> nonready_empty = prom1.get_future();
    promise<int> prom2;
    future<int> nonready_nonempty = prom2.get_future();
    promise<int,int> prom3;
    future<int,int> nonready_nonempty2 = prom3.get_future();

    #if !SKIP_WHENALL
      // nonready + nonready
      future<> f1 = when_all(nonready_empty, nonready_empty);
      future<int> f2 = when_all(nonready_empty, nonready_nonempty);
      future<int> f3 = when_all(nonready_nonempty, nonready_empty);
      future<int,int> f4 = when_all(nonready_nonempty, nonready_nonempty);
      future<int,int,int> f5 = when_all(nonready_nonempty, nonready_nonempty2);

      // two ready, one nonready
      future<int> f6 = when_all(ready_empty, ready_empty, nonready_nonempty);
      future<int> f7 = when_all(nonready_empty, ready_empty, ready_nonempty);
      future<int> f8 = when_all(ready_empty, ready_nonempty, nonready_empty);
      future<int> f9 = when_all(ready_nonempty, nonready_empty, ready_empty);
    #endif
  }

  print_test_success();
  upcxx::finalize();
}

The program above, run with develop, leaks about 1KiB of memory every loop iteration, as reported by valgrind 3.17.0. The leak only occurs for runs executing when_all, and appears to correspond to the underlying promise objects. Even without valgrind, one can see the problem by running the program above with a command-line argument like ./a.out 1000000 and the leak is obvious via ps/top.

when_all was recently optimized in pull request 345 (which also added the test that initially detected this leak). However based on manual testing it appears this leak was introduced somewhere between 2020.3.2 and 2020.3.8, most likely during pull request 241.

My guess is we're missing a dropref() somewhere.

Official response

  • Amir Kamil

    Further investigation reveals that 1) this only affects promises whose dependency count never reaches zero, and 2) this is not limited to when_all().

    More specifically, the following does not leak memory:

    promise<> p1;
    future<> f1 = p1.get_future();
    future<> f2 = when_all(f1,f1);
    p1.fulfill_anonymous(1);
    

    I’ve verified that putting in fulfill_anonymous()/fulfill_result() calls also fixes the leak in the original example.

    The following that uses then() rather than when_all() leaks memory:

    promise<> p1;
    future<> f1 = p1.get_future();
    future<> f2 = f1.then([](){});
    

    Again, adding p1.fulfill_anonymous(1) fixes the leak.

    My guess is that the following change to core.hpp in commit 27b7692 is what results in the problem:

           future_header_dependent() {
    -        this->ref_n_ = 1;
    +        // dependents carry an extra reference until they become ready
    +        this->ref_n_ = 2;
             this->status_ = future_header::status_active;
             this->sucs_head_ = nullptr;
           }
    

    I suspect that this extra reference in a non-ready future_header_dependent keeps the cell on which it depends alive, even when there are no longer any user-level references to it. However, I have not yet confirmed this, nor do I have a fix yet.

    I’ll keep digging.

Comments (11)

  1. Dan Bonachea reporter

    Further poking with Valgrind reveals this simpler code is sufficient to demonstrate the leak:

    promise<> p1;
    future<> f1 = p1.get_future();
    future<> f2 = when_all(f1,f1);
    

    whereas the following does NOT leak: (the only difference being the declared vs inferred type of f2)

    promise<> p1;
    future<> f1 = p1.get_future();
    auto f2 = when_all(f1,f1);
    

    According to the spec these two pieces of code are equivalent, but due to issue 288 the implementation diverges, and apparently the leak arises within that divergence. This strongly suggests the leak arises during the conversion of when_all's "spooky future" return into future<>.

  2. Amir Kamil

    I’ve traced this regression to commit 27b7692, which was part of PR 178. Unfortunately, it’s in a part of the logic that I don’t have any experience with, so I haven’t made any headway after poking at it for a few hours. I suspect that the statement in the commit message, “Most importantly there are no longer any possible cases requiring polymorphic destruction of future state,” might be incorrect, in which case this won’t be an easy fix.

  3. Amir Kamil

    Further investigation reveals that 1) this only affects promises whose dependency count never reaches zero, and 2) this is not limited to when_all().

    More specifically, the following does not leak memory:

    promise<> p1;
    future<> f1 = p1.get_future();
    future<> f2 = when_all(f1,f1);
    p1.fulfill_anonymous(1);
    

    I’ve verified that putting in fulfill_anonymous()/fulfill_result() calls also fixes the leak in the original example.

    The following that uses then() rather than when_all() leaks memory:

    promise<> p1;
    future<> f1 = p1.get_future();
    future<> f2 = f1.then([](){});
    

    Again, adding p1.fulfill_anonymous(1) fixes the leak.

    My guess is that the following change to core.hpp in commit 27b7692 is what results in the problem:

           future_header_dependent() {
    -        this->ref_n_ = 1;
    +        // dependents carry an extra reference until they become ready
    +        this->ref_n_ = 2;
             this->status_ = future_header::status_active;
             this->sucs_head_ = nullptr;
           }
    

    I suspect that this extra reference in a non-ready future_header_dependent keeps the cell on which it depends alive, even when there are no longer any user-level references to it. However, I have not yet confirmed this, nor do I have a fix yet.

    I’ll keep digging.

  4. Paul Hargrove

    NOTE: due to this leak, the issue478 test is currently ban-listed in the dirac-gcc_valgrind GitLab CI job. This is a reminder to remove that when this issue is resolved.

  5. Amir Kamil

    Summary of what I’ve determined so far from poking at this:

    How Dependencies Work

    The following describes at a high level how dependencies are tracked and satisfied. If future B depends on future A, we refer to B as the "successor" and A as the "predecessor".

    A future_header has several pieces of state that are relevant to dependency tracking:

    • The ref_n_ field keeps track of how many references there are to the header.
    • The status_ field is a count of how many unfulfilled dependencies the header currently is waiting on.
    • The sucs_ field is a singly-linked list of successor future_header_dependents that depend on the header.

    A dependent (successor) future has a future_body that keeps track of its dependencies. Each dependency is tracked via a future_dependency parameterized by the kind of future the predecessor is. For a dependency on a ref-counted future cell, the corresponding future_dependency holds a reference to the predecessor's header. It is through this future_dependency that the predecessor is kept alive and its result made available to the successor.

    In the other direction, a predecessor keeps track of its successors through the sucs_ list, which contains references to the successor future_header_dependents. These do not individually increment the ref-count of the successor; rather, the predecessors of an individual future_header_dependent hold a "shared" reference to that header, so that its ref-count is higher by one no matter how many dependencies it is waiting on. When a predecessor becomes ready, it first decrements the status_ of each successor -- recall that the status_ field tracks how many unfulfilled dependencies a header is waiting on. The predecessor's successor list is also cleared out, as it's no longer needed. If the status_ count of a successor reaches zero, then the successor is itself ready to fire, so it is added to an "active" list. The active list is then processed, with leave_active() invoked on each future_header_dependent on the active list. This decrements the ref-count of the header, eliminating the "shared" reference that the header's predecessors held. leave_active() fires the action of the header if there is one (e.g. for a then()) and cleans up whatever state is no longer needed. This includes deleting the header itself if its ref-count has reached zero.

    The Memory Leak

    As can be deduced from the above, as long as a predecessor header has not reached the ready state, there is a circular chain of references between the predecessor and each of its successors. The predecessor holds part of a reference that is "shared" between all the predecessors of a successor, and the successor holds a real reference to the predecessor through a future_dependency.

    For a header corresponding to a promise, if the user drops all references to the promise without fulfilling the initial dependency, then there is no way for the promise to ever become ready. If the header has any successors, then it and all its successors are leaked -- due to the cyclical reference chain, none of their ref-counts will reach zero, so they will all be leaked (issue 536).

    Note that it is valid for a user to drop all references to a successor -- it's perfectly okay to chain some work on a dependency via then() and drop the resulting header. A fix for issue 536 needs to ensure that this continues to work as expected.

    The prototype solution in branch issue536b adds a successor count (sucs_n_) to a header. When the reference count matches the successor count, then all references to the header are through a future_dependency. If the header is not ready, then it would be impossible for it to become so, as all references are held by successors of the header. The prototype places the header into a new "canceled" state and recursively cancels the successors. This process cleans up the internals of the successor that are no longer needed:

    • Any successors of the successor header are in turn canceled.
    • The future_body of the dependent is destroyed, including any future_dependency contained within. This eliminates the references that the successor holds on any of its predecessors.
    • The future_header_dependent corresponding to the successor is removed from all the sucs_ lists of its predecessors. The "shared" reference held by the predecessors is eliminated. If the ref-count reaches zero, the future_header_dependent is deleted.

    This cleanup process causes the ref-count of the predecessor to go down to zero, allowing it to be deleted.

    While the prototype does appear to fix the memory leak, it raises semantic issues with respect to canceled headers. Previously, the only canceled header was the "nil" header used for default-constructed futures. We have now introduced a canceled state for other headers.

    • What happens if a user chains work on a header that has already been canceled? (The current implementation has an assertion to detect if the user chains something on the "nil" header.)
    • Should we expose the canceled state through a query of a future?
    • Should we have an API for canceling a promise?
    • Related to the previous item, should we require explicit cancellation of a promise rather then implicitly doing so when the ref-count drops? Should we then make it an error if the ref-count drops to the successor count without the header having been canceled?

    Due to these semantic issues, we are deferring this issue until after the 2022.3.0 release. Spec issue 191 has been opened to discuss these semantic questions.

  6. Log in to comment