Reduce index size

Matt Chaput avatarMatt Chaput created an issue
  • Compress ids, weights, values in posting block using zlib 715eb93e2663 .
  • Encode term keys using a short code for the field name, saving the fieldname-code map with the term file.
  • Encode term info using bytes for tf and df and int for offset when possible. When tf == df == 1 just store the offset.
  • Encode stored fields as a list instead of a dict, storing the fieldname-pos map with the file.

Comments (5)

  1. Matt Chaput
    s1 = Struct("!BBI")
    s1pack, s1unpack = s1.pack, s1.unpack
    def enc1(w_off_count):
        w, offset, postcount = w_off_count
        if offset < mb4:
            if w == 1 and postcount == 1:
                return pack_int(offset)
            elif w < 256 and postcount < 256:
                return s1pack(w, postcount, offset)
        return encode_terminfo(w_off_count)
    def dec1(v):
        if len(v) == 4:
            return (1, unpack_int(v), 1)
        elif len(v) == 6:
            return s1unpack(v)
            return decode_terminfo(v)

    This also involves switching the first argument back to tf instead of the sum of weights.

  2. Matt Chaput

    Changes to reduce index size, see issue #47. Miscellaneous fixes and improvements. Fixes to posting compression. More space-efficient coding of term info. Write stored fields as a list instead of a dictionary. Fixed speed of StructFile.write_array() on little-endian machines. Added Reuters 21578 benchmark.


  3. Log in to comment
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.