Clone wiki

saturnalia / Tile_Storage_Design

Tile Storage

The initial version of Saturnalia tile storage is designed to implement most of the low hanging fruit for tile compression.

While we feel there are is a significant amount of future optimization that can be done in this area, there are some easy optimizations that we can perform today that are still within 10-15x better compression than existing systems such as RRD or in-house systems based on SQL/NoSQL and/or traditional databases.

General Design

All integers in tiles are signed and represented as ZigZag encoding. This encoding has a number of advantages including easy representation and easy parsing by the protocol buffer library (and anyone using varint parsers).

With ZigZag encoding we can represent integers with:

 1 byte  = [-64,63]  2^6
 2 bytes = [-16384,16383] 2^14
 3 bytes = [-4194304,4194303] 2^22
 4 bytes = [-2147483648,2147483647]

Tiles are scaled down so that the min() value is subtracted from each slot value. The slots are then gap/delta encoded as derivatives.

For example, if we had the following values:

100, 101, 102, 103

This would be first reduced to:

0, 1, 2, 3.

We then pick a reference point within this set, in this case the median, and make all values stored from this reference point.

reference_point: 1
-1, 0, 1, 2

We pick a median reference point so that we can represent these efficiently using zigzag encoding. If the value were to become to large, it would require more byte storage but negative values are also represented in the same byte range in ZigZag encoding.

Forward Extension

All tiles in Saturnalia have a type associated with them. Rigth now we support dense and sparse metrics.

{{{ saturnalia_pb2.EventTile.DENSE_METRIC saturnalia_pb2.EventTile.SPARSE_METRIC }}

Future Optimizations

Fixed width binary integers

Right now we use varints to specify all integers but there might be options to specify integers as EXACTLY the set number of bits. For example, instead of just 8 or 16 bits, we could use 14 bits.

The protocol buffer would have a field which specifies the number of bits used to represents integers.

The one downside to this right now is that the gzip/zlib/bzip2 run length encoding options provide a significant space savings for us and if we use a custom length integer these algorithms wouldn't see repeated runs.

We could implement this but we would probably have to implement our own form of run length encoding.

Region Encoding

In practice we might see some contiguous ranges of positive integers (including zero) and then large regions of negative integers and then some regions of mixed. In these situations it might be possible to encode these with the exact storage we we require (in bits) but note that the set of integers are all positive or all negative. I would want to see this in production though on production values to see what they look like and this requires a large test suite of data to play with.

Eight Byte (long) DEFLATE Algorithm

In the following example we store two positive signed integers as zigzag encoded varints:

0101 0000 0001 1000 = 769 0001 0000 = 1

There are three unique bytes here:

0101 0000 0001 1000 0001 0000

A huffman tree would need two bits to encode these messages:

00 = 0101 0000 01 = 0001 1000 11 = 0001 0000

We would then end up with a bit stream of:


However, if we were to use a DEFLATE implementation which used integers instead of bytes we could end up with a huffman tree of:

1 = 769 0 = 1

And only need to bits to represent these items:


An we would also use the run length encoding present in DEFLATE which means common runs would be removed.

Memory Usage

For normal documents, using only byte encoding is the best way to go since the document could be hundreds of gigs in length and could comprise billions of unique 8 byte integers/longs.

The memory storage requirements for this would be insane and potentially require hundreds of gigs of memory.

However, in our usage we would only need to store 1024 (and certanly less than 100k in the worse case scenario).

It would be best to store a huffman tree based on integers (not bytes) and encode this directly which could avoid the varint penalty and get tighter compression.