Unnecessary (and leaked?) memory usage of deleted documents

Issue #492 new
James Howe
created an issue

I started investigating this when a single search on my 250MB index used 14GB of memory and didn't release it afterwards.

Line in question: https://bitbucket.org/mchaput/whoosh/src/a16ebac/src/whoosh/reading.py#lines-819

Every leaf node in a query matcher is allocated a copy of the set of deleted doc ids in that segment. If there are many nodes (e.g. with the current implementation of prefix search) and many deletions then this can rapidly grow. There should instead be a single copy of the set, shared between all nodes.

The memory of these frozensets appears to be leaked, even though the sets themselves are not. Using various instrumentation techniques (gc, resource, weakref, pympler) I determined that this line was responsible for the allocation of multiple GB of RSS, and that this memory was not freed after the search, even though the sets themselves (and any other likely objects I could find) had been collected. I checked this with multiple versions of CPython 3.5 and 3.6. Even more confusingly, the memory was freed when running under valgrind.

Some output (different queries, so different sized leaks).

pympler.tracker.ObjectTracker:

RSS: +3541640 KiB
Added objects:
                                        types |   # objects |   total size
============================================= | =========== | ============
                          <class '_io.BytesIO |          30 |    270.82 MB
                                  <class 'set |          13 |    153.84 KB
                                  <class 'int |        4567 |    124.88 KB
                          <class 'array.array |           9 |    116.53 KB
                                 <class 'dict |         230 |     59.33 KB
                                 <class 'list |          72 |     36.38 KB
                               <class 'method |         226 |     14.12 KB
                                  <class 'str |         203 |     12.17 KB
                           <class 'memoryview |          28 |      5.25 KB
                      frame (codename: inner) |           9 |      4.92 KB
                   frame (codename: __call__) |           9 |      4.06 KB
                        <class 'managedbuffer |          28 |      3.72 KB
              <class 'collections.defaultdict |          14 |      3.39 KB
                                <class 'tuple |          52 |      3.14 KB
  <class 'whoosh.filedb.structfile.BufferFile |          28 |      1.53 KB

valgrind massif:

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 26 157,971,851,075    4,444,310,584    4,443,650,050       660,534            0
99.99% (4,443,650,050B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->99.94% (4,441,529,249B) 0x4214BE: _PyMem_RawMalloc (obmalloc.c:73)
| ->99.94% (4,441,528,413B) 0x4213B7: _PyMem_DebugRawAlloc (obmalloc.c:1856)
| | ->99.94% (4,441,528,413B) 0x4213F1: _PyMem_DebugRawMalloc (obmalloc.c:1879)
| | | ->99.94% (4,441,528,413B) 0x4221B1: PyMem_RawMalloc (obmalloc.c:386)
| | |   ->99.94% (4,441,516,507B) 0x4234CE: _PyObject_Alloc (obmalloc.c:1427)
| | |   | ->99.94% (4,441,429,932B) 0x423608: _PyObject_Malloc (obmalloc.c:1437)
| | |   | | ->99.94% (4,441,429,932B) 0x4213B7: _PyMem_DebugRawAlloc (obmalloc.c:1856)
| | |   | |   ->99.94% (4,441,429,932B) 0x4213F1: _PyMem_DebugRawMalloc (obmalloc.c:1879)
| | |   | |   | ->99.94% (4,441,429,932B) 0x421522: _PyMem_DebugMalloc (obmalloc.c:1972)
| | |   | |   |   ->93.27% (4,145,121,592B) 0x4226A1: PyMem_Malloc (obmalloc.c:418)
| | |   | |   |   | ->93.27% (4,145,055,040B) 0x4A4CE0: set_table_resize (setobject.c:344)
| | |   | |   |   | | ->93.27% (4,145,033,472B) 0x4A52BC: set_add_entry (setobject.c:239)
| | |   | |   |   | | | ->93.27% (4,145,032,896B) 0x4A5389: set_add_key (setobject.c:425)
| | |   | |   |   | | | | ->93.27% (4,145,002,176B) 0x4A582C: set_update_internal (setobject.c:1011)
| | |   | |   |   | | | | | ->93.26% (4,144,653,120B) 0x4A591E: make_new_set (setobject.c:1064)
| | |   | |   |   | | | | | | ->93.26% (4,144,651,968B) 0x4A6607: frozenset_new (setobject.c:1109)
| | |   | |   |   | | | | | | | ->93.26% (4,144,651,968B) 0x4B4147: type_call (typeobject.c:895)
| | |   | |   |   | | | | | | |   ->93.26% (4,144,651,968B) 0x44A490: _PyObject_FastCallDict (abstract.c:2331)
| | |   | |   |   | | | | | | |     ->93.26% (4,144,651,968B) 0x44AB92: _PyObject_FastCallKeywords (abstract.c:2496)
| | |   | |   |   | | | | | | |       ->93.26% (4,144,651,968B) 0x52E122: call_function (ceval.c:4854)
| | |   | |   |   | | | | | | |         ->93.26% (4,144,651,968B) 0x5381DD: _PyEval_EvalFrameDefault (ceval.c:3328)
| | |   | |   |   | | | | | | |           ->93.26% (4,144,651,968B) 0x52D37C: PyEval_EvalFrameEx (ceval.c:754)
| | |   | |   |   | | | | | | |             ->93.26% (4,144,651,968B) 0x52DE2B: _PyEval_EvalCodeWithName (ceval.c:4159)
| | |   | |   |   | | | | | | |               ->93.26% (4,144,626,048B) 0x52E070: fast_function (ceval.c:4971)
| | |   | |   |   | | | | | | |               | ->93.26% (4,144,626,048B) 0x52E34E: call_function (ceval.c:4851)
| | |   | |   |   | | | | | | |               |   ->93.26% (4,144,626,048B) 0x5382E8: _PyEval_EvalFrameDefault (ceval.c:3344)
| | |   | |   |   | | | | | | |               |     ->93.26% (4,144,626,048B) 0x52D37C: PyEval_EvalFrameEx (ceval.c:754)
| | |   | |   |   | | | | | | |               |       ->93.26% (4,144,626,048B) 0x52DE2B: _PyEval_EvalCodeWithName (ceval.c:4159)
| | |   | |   |   | | | | | | |               |         ->93.26% (4,144,626,048B) 0x52E070: fast_function (ceval.c:4971)
| | |   | |   |   | | | | | | |               |           ->93.26% (4,144,626,048B) 0x52E34E: call_function (ceval.c:4851)
| | |   | |   |   ->06.67% (296,308,340B) 0x4227C5: PyObject_Malloc (obmalloc.c:479)

Comments (9)

  1. James Howe reporter

    Nope, I gave up investigating the leak after three days. I now make sure I optimise the index frequently to remove the deleted doc ids.

    The optimisation of having only a single copy of the set would reduce the impact. Redesigning how prefix queries are implemented would as well.

  2. James Howe reporter

    No it isn't, there are too many variables. How many documents, how many terms in each document, how each document field is configured, how many terms in each field, how many unique terms, how big each term is, how often documents are updated, what types of queries are used, how many queries are used, memory usage, runtime, results accuracy, etc. etc.

  3. Log in to comment