Issue #225 open

Smarter merge policies

gcc111
created an issue

It's possible to have just one segment with less than fib(i+5) documents in it. It can still be quite large, for example my 20th segment, in order of increasing size, had 21224 documents in it, while fib(25) is 121393.

MERGE_SMALL would always rewrite this segment, and only this segment, which is a complete waste of time.

I decided to implement a merge policy which combines all segments that are less than a certain size, unless there's only one of them. This seems to perform quite well:

{{{

!python

def CUSTOM_MERGE_SMALL(writer, segments): """This policy merges small segments, where "small" is defined using a fixed number of documents. Unlike whoosh.filedb.filewriting.MERGE_SMALL, this one does nothing unless there's more than one segment to merge. """

from whoosh.filedb.filereading import SegmentReader
unchanged_segments = []
segments_to_merge = []

for segment in segments:
    if segment.doc_count_all() < 10000:
        segments_to_merge.append(segment)
    else:
        unchanged_segments.append(segment)

if len(segments_to_merge) > 1:
    for segment in segments_to_merge:
        with SegmentReader(writer.storage, writer.schema, segment) as reader:
            writer.add_reader(reader)
else:
    # don't bother merging a single segment
    unchanged_segments.extend(segments_to_merge)

return unchanged_segments

}}}

Comments (4)

  1. gcc111 reporter

    Sorry, in case it's not clear to the reader: rewriting just one segment creates a new segment of the same size, at great cost, which will simply be rewritten again on the next commit(), and again, and again... until someone adds another segment (or six in this case?) which is also too "small", with which this one can be merged.

  2. Matt Chaput repo owner

    Currently the merge policy naively uses the number of documents in a segment as the measure of how large it is, based on the assumption that individual documents are a roughly equal, reasonable size. Very large documents break the policy.

    1. Change or add new merge policies to use segment file size instead of number of documents as a more accurate proxy for segment size.
    2. Add a safeguard against remerging the same segment over and over again (as in Lucy's merge code).
  3. Log in to comment