provide support for linking in different allocators (eg tcmalloc, etc)

Issue #257 on hold
Steven Hofmeyr created an issue

With certain orderings of allocations and deallocations, global shared memory is not freed when it should be. Here’s an example:

#include <upcxx/upcxx.hpp>
using namespace upcxx;

int main(int argc, char **argv) {
  init();
  global_ptr<int> p1 = new_array<int>(500*1024*1024/sizeof(int));
  //delete_array(p1);                                                                                                                                                                 
  global_ptr<int> p2 = new_array<int>(10);
  delete_array(p1);
  global_ptr<int> p3 = new_array<int>(10);
  //delete_array(p2);                                                                                                                                                                 
  //delete_array(p3);                                                                                                                                                                 
  global_ptr<int> p4 = new_array<int>(500*1024*1024/sizeof(int));
  delete_array(p2);
  delete_array(p3);
  delete_array(p4);
  finalize();
}

The above code demonstrates the problem when run with upcxx-run -shared-heap 650M -n1on a Linux server (g++ (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0).

The issue is that after the deletion of the large p1 array, there should be ample space for allocating the large p4 array. But in the code layout above, it fails. There are several ways in which it can be made to succeed:

  1. free p1 before allocating p2 (line 7 instead of line 9)
  2. free either p2 or p3 before allocating p4 (lines 11 or 12)
  3. don’t allocate either p2 or p3 (lines 8 or 10)

Note that both p2 and p3 are tiny allocations.

We suspect (although we are not totally sure) that this issue is causing memory use to explode in the metahipmer code.

Comments (18)

  1. Dan Bonachea

    Steve - this looks to me like a classic fragmentation problem.

    To summarize the code, you allocate 75% of the heap in a big object, then allocate small object A, free big object, then allocate small object B, then fail to reallocate a new big object of 75% size.

    The fact there is over 75% free space available in the heap overall doesn't help if the location of the two allocated small objects fragments that free space into contiguous chunks that are all smaller than 75%. In this case, the allocator would need future knowledge in order to know it needs to place the second small allocation B in such a way as to leave a 75% size free chunk available. Simple first fit allocation (a common heuristic) can (and probably is) breaking the freed 75% block into smaller chunks to accommodate the second small object.

  2. Paul Hargrove

    Printing each global-ptr immediately following allocations confirms Dan's believe that this a simple matter of fragmentation:

    p1(gp: 0, 0x119c473c8, dev=-1)
    p2(gp: 0, 0x1390473d8, dev=-1)
    p3(gp: 0, 0x119c473c8, dev=-1)
    

    As you can see, p1 was definitely freed, because p3 is allocated the same address it had occupied.
    Consequently, as Dan suspected, there is no longer any contiguous block of sufficient length for p4.

  3. Steven Hofmeyr reporter

    What concerns me is that we don’t see this kind of fragmentation in glibc malloc (using ulimit to constrain available memory to the process), nor do we see it in upc (upc_alloc). It seems a pretty poor memory manager that can’t reuse free chunks of the same size. I’m worried that repeated iterations of small alternating size allocations could cause severe fragmentation because existing freed chunks of adequate size are never reused. This could be a real problem with rpc using the rendezvous protocol, right?

    How does this allocator differ from the upc allocator? Would it be possible to make the upcxx allocator behave more like the upc allocator? Even better, would it be possible to enable it so that a third party allocator (tcmalloc/jemalloc) could be used?

  4. Dan Bonachea

    @Steven Hofmeyr : I don't think you're understanding the behavior we're describing - fragmentation is an effect which degrades the efficiency of all allocators to some extent in a fixed-size arena with a stream of alloc/free calls of varying size. Allocators lack knowledge of future allocation requests, so they use heuristics that try and improve the chances they can service any future request that would have fit with an optimal bin-packing strategy, but they have to balance that against the computation used to service the current request. Even an allocator with perfect future knowledge cannot guarantee 100% efficiency for an arbitrary set of requests of differing size in a fixed arena.

    Comparing against the system allocator with a ulimit is not a fair comparison, because there it's not a fixed-size arena - for sufficiently large objects glibc is mapping fresh virtual memory out of the effectively infinite 64-bit VM space and the OS scavenges whatever fixed-size physical pages are available to back it, with no requirement of contiguity. This is fine for an allocator that is only registering new VM mappings with the local page table, but is problematic when mapping changes affect RDMA behavior of remote processes (which is why we use a fixed-size arena that is pre-registered with the NIC).

    Regarding your questions about the UPC allocator: When you run a UPC/UPC++ hybrid app with the default UPCXX_USE_UPC_ALLOC=yes mode, then UPC++ shared heap alloc/free are implemented directly using UPCR's"umalloc" allocator (ie in that case they are both the same UPC shared memory allocator, in the local partition). In other words, you can get the UPC++ allocator to behave exactly like the UPC allocator by simply linking with upcc. However note that if you switch to non-default UPCXX_USE_UPC_ALLOC=no mode, then the UPC++ shared heap is carved from the UPCR shared heap as one large object and then the normal UPC++ "dlmalloc" allocator manages that arena.

  5. Paul Hargrove

    It seems a pretty poor memory manager that can’t reuse free chunks of the same size.

    At the time p4 is allocated there is no block "of the same size", since the one released when p1 was free has been split to satisfy another request (p3). Without knowledge of the future allocation patterns, the implementation cannot know that any allocation of that same size will ever occur.

    Would it be possible to make the upcxx allocator behave more like the upc allocator?

    The dl_malloc implementation used currently is well know for its speed, and for having no license encumberments.
    I seem to recall its speed being one of the strongest selling points for John.

    I am not familiar enough with it to answer to the question of if/how it might be made to act more like the one in Berkeley UPC (heavily modified version of umalloc from an old gdb).

    Even better, would it be possible to enable it so that a third party allocator (tcmalloc/jemalloc) could be used?

    To be realistic, there is no perfect memory allocator.
    If you require one with specific properties (perhaps those in tcmalloc/jemalloc) then perhaps you should allocate a large pool of memory from UPC++ and use the allocator of your choice to manage that chunk (with appropriate application of to_global_ptr()). That cannot help with memory used by the internals, but John is already working to reduce the amount of shared memory the runtime uses.

    I don't intend my responses above to be "dismissive" of your concerns, though I can clearly see how they could be interpreted that way. They are just constrained by the limits of what I know.

    When @john bachan has time, he may have additional thoughts on this issue, including configurability of dl_malloc or the possibility of replacing it.

  6. Steven Hofmeyr reporter

    Paul, you are of course correct - if p3 is allocated in the space freed by p1, then there is not enough space for p4. All allocators suffer from fragmentation, but how much depends on the allocator and the application’s memory usage pattern. So I do believe the best thing would be for the user to be able to choose the allocator that is used. I’ll change this issue to a enhancement request, and maybe John can comment on it when he gets a chance.

  7. Paul Hargrove

    @Steven Hofmeyr Please pick a more accurate title for this issue, since we have established that memory is freed.

  8. Paul Hargrove

    @Steven Hofmeyr
    Regarding your statement "we don’t see this kind of fragmentation in glibc malloc (using ulimit to constrain available memory to the process)":

    As Dan indicated, using ulimit alone does not yield an apples-to-apples comparison because the glibc malloc() has access to multiple discontiguous "arenas" of virtual memory. However, if we eliminate that difference by using mallopt() to instruct glibc not to mmap() additional arenas, then we have a fair playing field. By that I mean that the allocators in both glibc and UPC++ will be constrained to management of a single virtually-contiguous memory region.

    As shown below, under these conditions (on Ubuntu 18.04 with the same compiler you report using) a simple C++ code shows the same behaviors as UPC++:

    • p3 is allocated the address previously occupied by p1
    • With a 650MB data size limit (or even 900MB to be generous), allocation of p4 fails
    phargrov@ftg4:~$ lsb_release -a
    No LSB modules are available.
    Distributor ID: Ubuntu
    Description:    Ubuntu 18.04.2 LTS
    Release:        18.04
    Codename:       bionic
    
    phargrov@ftg4:~$ g++ --version
    g++ (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
    Copyright (C) 2017 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
    
    phargrov@ftg4:~$ cat oom.cpp
    #include <iostream>
    #include <malloc.h>
    int main(int argc, char **argv) {
      mallopt(M_MMAP_MAX, 0);
      int *p1 = new int[500*1024*1024/sizeof(int)];   std::cout << "p1 = " << p1 << '\n';
      int *p2 = new int[10];                          std::cout << "p2 = " << p2 << '\n';
      free(p1);
      int *p3 = new int[10];                          std::cout << "p3 = " << p3 << '\n';
      int *p4 = new int[500*1024*1024/sizeof(int)];   std::cout << "p4 = " << p4 << '\n';
      return 0;
    }
    
    phargrov@ftg4:~$ g++ oom.cpp -o oom
    
    phargrov@ftg4:~$ ./oom
    p1 = 0x556dc6b42e70
    p2 = 0x556de5f43290
    p3 = 0x556dc6b42e70
    p4 = 0x556de5f432c0
    
    phargrov@ftg4:~$ ( ulimit -d $((650*1024)); ./oom )
    p1 = 0x55e7f8c50e70
    p2 = 0x55e818051290
    p3 = 0x55e7f8c50e70
    terminate called after throwing an instance of 'std::bad_alloc'
      what():  std::bad_alloc
    
    phargrov@ftg4:~$ ( ulimit -d $((900*1024)); ./oom )
    p1 = 0x563bf4e53e70
    p2 = 0x563c14254290
    p3 = 0x563bf4e53e70
    terminate called after throwing an instance of 'std::bad_alloc'
      what():  std::bad_alloc
    
  9. john bachan

    In defense of dlmalloc, from dlmalloc's own self description:

    This is not the fastest, most space-conserving, most portable, or
    most tunable malloc ever written. However it is among the fastest
    while also being among the most space-conserving, portable and tunable.
    Consistent balance across these factors results in a good general-purpose
    allocator for malloc-intensive programs.
    

    Generally when allocators tout being "fast" they refer to small object allocations (< ~10KB, esp < ~1KB), since with bigger objects, whatever work you're going to do with them will put the allocation overhead in the noise. In our case, allocations serving RDMA tend to be big, since you can only perform small RDMA's with small objects, and that isn't the best usage of the hardware. So in choosing a good allocator to service our audience, dlmalloc sells itself well. tcalloc and jemalloc are not appropriate for us, they waste gobs of address space to be fast in multithreaded execution. Our address space is not virtual and therefor precious to us, so an allocator that contends for a lock slowly but keeps space tight is much more attractive, I think.

    Allocator pluggability is a nice idea, unfortunately there isn't a standard API I'm aware of for our restrictive scenario where we need to embed within a given address range. dlmalloc has this with its "mspace_" API, and the existence of that is actually probably the biggest motivator for us using it. tcalloc/jemalloc just assume their role is to replace system malloc.

    So far I'm not seeing a lot of places this issue can go except offering alternative allocators, with the implementation of each a hard won achievement. Its worth noting that I believe a purely best-fit allocator would not have failed Steve's test, and that we have a hand-rolled one in the repo to service CUDA allocations. It's conceivable I could throw that under upcxx::allocate to let interested parties try out.

  10. Dan Bonachea

    I agree with all of @john bachan's points, except this one:

    In our case, allocations serving RDMA tend to be big, since you can only perform small RDMA's with small objects, and that isn't the best usage of the hardware.

    For an irregular application, the ability to perform small RMAs and AMOs without remote attentiveness can be crucial, which is another important use case. These are often performed on elements of a larger object, but one can also build graphs of small objects - we should not assume that objects in the shared heap tend to be big for all applications.

    wrt: "API ... for our restrictive scenario where we need to embed within a given address range"

    UPCR defines such an API (which is in the same spirit as dlmalloc's mspace API), and uses it to interface with its embedded umalloc allocator - see the "Interface to Memory Allocators" section of this design document. This was originally done to allow UPCR to plug-in different allocators, although we never exercised that flexibility in production (and consequently over the years I think the interface has creeped a bit beyond this strict API, but it's still close to this). However as I mentioned in this comment, UPC++ already has the ability to use the UPCR allocator (directly) via linking with upcc, so there's not a strong argument for us to embed umalloc until/unless that behavior proves crucial to a customer. I'm still waiting to hear clarification from @Steven Hofmeyr regarding how linking UPC's allocator into UPC++ affects behavior in his real application.

  11. john bachan

    I disagree with @Dan Bonachea disagreeing with me. I only described a tendency of RDMA apps. What he says is true, there (may) exist irregular applications that want to do lots of small allocations. But I assert that those are a minority of HPC/PGAS applications, therefor my description of the trend remains unimpeachable.

  12. Dan Bonachea

    What he says is true, there (may) exist irregular applications that want to do lots of small allocations. But I assert that those are a minority of HPC/PGAS applications, therefor my description of the trend remains unimpeachable.

    @john bachan : Speculation on hypothetical applications isn't really relevant.

    My understanding is that Steve's new kcount code in HipMer exhibits exactly this behavior for storing kmers in the shared heap with the DHT. So not only does the application exist, it's also one of our most important UPC++ applications. The code currently in their master branch is currently using the UPC allocator due to being a hybrid app, but if/when they excise the UPC code from the rest of the application, this would presumably revert to the UPC++ shared heap allocator. Of course they still have the option to continue linking UPCR to get umalloc to manage the heap, but it seems cumbersome to continue requiring their users to install all of Berkeley UPC just for that.

  13. Steven Hofmeyr reporter

    Yep, our communication pattern involves lots of random rpcs of varying size (some large) so, given the nature of the rendezvous protocol, fragmentation is a real concern. But maybe a different allocator is not really the best fix. Probably, what we'll do is keep the rpcs very small and use global memory to explicitly transfer the larger amounts of data. That way we can restrict and reuse the global memory allocated, which should help a lot with fragmentation. The only downside is that this is more complicated code, compared to the elegance of pure rpcs.

    Dan: we will definitely end up with pure upcxx at some point, maybe even this year.

  14. Dan Bonachea

    This was discussed in the 2020-02-12 meeting and deferred to next release milestone.

    It was noted that we could potentially meet HipMer's requested behavior with something less ambitious than a fully general allocator interface. In particular, we could import umalloc into the UPC++ tree and deploy it as an alternative to dl_malloc (selected either at configure time or at init via envvar). This would provide allocation behavior identical to UPCR's current local shared heap, without requiring them to link in all of UPCR to get that. The availability of an alternative independent allocator implementation might also be useful for debugging certain types of problems.

  15. Paul Hargrove

    The HipMer team has indicated that their approach to memory management has changed such that their requirements no longer provide us with motivation for this feature request.

    So, this is "on hold" until/unless there is sufficient motivation to pursue it.

  16. Log in to comment