Reduce index size

Issue #47 resolved
Matt Chaput
repo owner created an issue
  • Compress ids, weights, values in posting block using zlib <<changeset 715eb93e2663>>.
  • Encode term keys using a short code for the field name, saving the fieldname-code map with the term file.
  • Encode term info using bytes for tf and df and int for offset when possible. When tf == df == 1 just store the offset.
  • Encode stored fields as a list instead of a dict, storing the fieldname-pos map with the file.

Comments (5)

  1. Matt Chaput reporter
    s1 = Struct("!BBI")
    s1pack, s1unpack = s1.pack, s1.unpack
    def enc1(w_off_count):
        w, offset, postcount = w_off_count
        if offset < mb4:
            if w == 1 and postcount == 1:
                return pack_int(offset)
            elif w < 256 and postcount < 256:
                return s1pack(w, postcount, offset)
        return encode_terminfo(w_off_count)
    def dec1(v):
        if len(v) == 4:
            return (1, unpack_int(v), 1)
        elif len(v) == 6:
            return s1unpack(v)
            return decode_terminfo(v)

    This also involves switching the first argument back to tf instead of the sum of weights.

  2. Matt Chaput reporter

    Changes to reduce index size, see issue #47. Miscellaneous fixes and improvements. Fixes to posting compression. More space-efficient coding of term info. Write stored fields as a list instead of a dictionary. Fixed speed of StructFile.write_array() on little-endian machines. Added Reuters 21578 benchmark.


  3. Log in to comment