Wiki

Clone wiki

memoria / Motivation

Motivation

The main idea of Memoria was inspired by Ulrich Drepper's What every programmer should know about memory article. Due to the physical limitations (finite speed of light) random access memory performance will always be 1-2+ orders of magnitude slower than ordered access. So to achieve the highest performance the data have to be organized in memory in predictable way. This principle lies behind the Data-Oriented Design paradigm.

There are two main ways to optimally organize data in memory: cache-friendly data structures exploiting explicit or hidden spatio-temporal locality of the data, and succinct or compressed data structures.

Cache friendliness is just a generic property of a data layout that improves CPU cache real throughput. Array of Structures vs Structure of Arrays is an example of such a kind of data design.

Succinct data structures try to achieve the lowest possible overhead in data representation. For example succinct representation of ordinary tree achieves 2 bits per node of spaces usage for tree structure. Compare it with about 20-40 bytes for linked-list based trees. As a result much more application data can be packed in the same memory block in a succinct design.

Of course space efficiency comes with cost. Succinct design uses CPU very intensively to pack and unpack application data. But with modern memory design it is no more a limiting factor. In the case of memory cache miss hundreds of CPU cycles will be lost. These cycles are the hidden computational resource that can be exploited to improve performance of succinct and compressed data structures. If data volumes grow faster than storage and processing capabilities then succinct and compressed design becomes actual.

Today memory is relatively cheap but fast memory will always be expensive because is is necessary to place memory and processing units closer together as much as possible to achieve highest performance. Hence the amount of fast memory is limited by volume (or area for 2D structures) around the processing unit. So the more memory we want, the slower it is. It is hierarchical memory model.

Hierarchical Computer Memory

Just a few numbers. Top level of memory hierarchy is processor registers. It is a few hundred of bytes in size but operates at full speed of processor at picoseconds scale. L1 cache, the next level of memory, operates on nanosecond scale. L2 and other levels of memory hierarchy are about order of magnitude slower then the previous one.

But these numbers describe random access, the most universal access pattern. Ordered access can be much faster than sequential one. If access goes in predictable way then memory can prefetch data that is expected to be requested in near future. And for deterministic ordered access the probability of this expectation to be true is very high. Just a few numbers...

Memory Operation Latency Gnuplot script, Random access benchmark, Sequential access benchmark, LMbench

Our benchmark PC Box has Intel Q9400 CPU @ 3GHz with 32K L1D cache, 2x3M L2 cache, 8G DDR2 RAM @ 750 MHz, Fedora Core 16 with GCC 4.6.3.

CPU cache boundaries (32K and 3M) are clearly marked on this plot. For ordered case L1 access takes 0.5 ns and DRAM access takes about 1.2 ns - not much greater. Accurate measuring of random access memory latency with LMbench3 gives us 1 ns, 5 ns and 80 ns respectively for L1, L2 and RAM. Pseudo-random access latency (red line) lies something in between ordered and random one. Pseudo-random access is faster because of memory prefetching. Because of hidden regularities in pseudo-random access patterns, memory controller is able to predict next memory access and preload the memory data. And for this test memory prefetching gives us a significant improvement in memory latency.

Of course most access patterns are random or unpredictable, but exploiting of hidden regularities is the only way to achieve high performance in distributed environment. This exploiting can be implicit, explicit or combined. An example of implicit exploiting is hardware memory prefetching in CPU memory hierarchy. It works as background invisible process in hardware and usually does not required any changes in software to operate with.

Explicit exploiting of access pattern regularities implies that program data is organized in memory in a certain way. Memory is read with relatively big blocks (32-128 bytes) and if 'next data' is in the 'current' block then it will be accessed at full speed. And even if this access requires some computational work to be done (for example a linear search in the block), it can be faster than direct random access to RAM. The time CPU waits for memory is our hidden reserve of performance.

Today we know two kind of such algorithms those are cache-aware algorithms and cache-oblivious algorithms (Mit lectures). Simply speaking cache-aware algorithms explicitly exploit knowledge about cache structure such as total size and cache line size. Cache-oblivious algorithms don't depend on this knowledge. Usually cache-aware algorithms are faster than cache-oblivious ones but they are not portable between architectures with completely different memory structures and even between different levels of memory hierarchy. Cache-oblivious algorithms don't have these limitations that is especially important for external memory level of the hierarchy.

Although these kinds of algorithms have been known for a relatively long time they are not in wide use. It can be explained partially with rapid grow in memory capacity and CPU performance in the last decade. So regular performance improvements faded weakness of software. But once resource of implicit parallelism is exhausted we will have to work with memory hierarchy explicitly.

Weakness of Linked List Based Data Structures

Data structures which are currently in use are based on linked lists. Liked lists was pretty attractive many decades ago when memory was faster than CPU. Pointer access operations take O(1) time and are very fast if random memory access is fast. However, in modern hierarchical memory, this is not true if the data structure doesn't fit into fast memory. Moreover pointers are quite large in size (4-8 bytes). Such important data structures as trees have significant memory overhead if represented with linked lists. For example std::set<Int64Bit> takes 40 bytes per RB-tree node (Linux GCC 4.6) and 5 times of raw data size for the whole data structure. If your data is 1M in size then it will take about 5M in memory. The situation is even worse for the in-memory representation of structured documents (XML, HTML, ODF etc.) where the document representation consumes 100+ times more memory than the raw data (check your Firefox memory usage).

Linked lists based data structures are simple and flexible, they just do their job. But how well does they perform on the modern hardware? Let's check performance of two different implementations of balanced search tree – one of the most fundamental data structures.

Let's consider balanced partial sums tree representing an ordered sequence of numbers (it is used in Memoria).

An Array-packed Partial Sums Tree

And instead of implementing it with well-known linked lists, lets pack it into an array: PackedTree and PackedSet. We are not going to limit ourselves with binary tree case. PackedTree is multiary tree. Let's check its performance under various conditions and compare with std::set<> (that is based on the binary RB-tree).

PackedTree Random Performance Gnuplot script, PackedTree benchmark, std::set benchmark

PackedTree is an example of *cache-aware* data structure. When packed tree fits into the CPU cache, trees with low fanout perform better. But when data is in memory, trees with high fanout perform better (except for the 64-children one). This is because balanced tree with high fanout has less levels that means less random access operations. And search for the next child in a node is sequential and quick. Having benchmark results we can see that the PackedTree with 32 children perfrom better than others. Each node takes 32*8=256 bytes of memory that is about 8 cache lines. Our CPU can linearly scan 16 (32/2) values in average in a duration of single averaged cache miss.

Linked list-based STL std::set<> is much faster than PackedTree if the tree fits into CPU cache because random access if fast here. However if data doesn't fit into L2 std::set<> performance is up to 3 times slower.

As you can see, this benchmark compares PackedTree and str::set at the same memory. PackedTree layout doesn't use pointers to children, they are implicitly provided with mathematical equations. Only data values of internal nodes are required to be stored explicitly. And for large fanout PackedTree memory layout has much smaller overhead than a linked-list-based one with explicit pointers. This means that for the same number of leafs (elements in the data structures) PackedTree takes less memory.

PackedTree is faster and smaller. Why don't we use it instead of std::set? First of all it is static. This means that insertion of a new element requires a full rebuild of the tree that might be very expensive. Second, it is much faster for 'small' data sets that is important real life usage case. But as real life data grow we need new data structures that performs well in hierarchical memory in addition to the classic ones.

The situation is even worse with external memory, which is so slow at random access that each such external IO operation should be taken into account. No size fits all here. Efficient data structures for external memory are always application-specific or workload-specific (think NoSQL). It is true not only for external memory. Intelligent data processing requires such complex data structures as searchable sequences and bit vectors. Giant volumes of hot structured data need succinct and compressed versions of data structures. This is the reason why application-specific data structures are black magic of computer science. Each of them incorporates significant amount of hardcore knowledge.

The only way to accelerate the progress in this area is solution sharing. To make this possible we need a development framework or toolkit for data structures in generalized hierarchical memory (from CPU registers to data grids).

What is Memoria?

While statistically good data layouts (cache-aware, cache-oblivious) are known, no single data layout in hierarchical memory is sufficient to achieve highest performance for all workloads. This is the same effect that triggered rapid evolution of virtual machines with run-time code profiling. Eventually we will see 'virtual machines' capable infer and customize optimal data layouts for current workload.

Memoria for data structures is like Qt for GUI. It provides basic building blocks which are common for different designs and types. The fundamental data structure of Memoria is a balanced search tree with variable-sized nodes. It is something like B-Tree but with many differences. Balanced tree of Memoria is generalized by design and can be customized to various types of balanced search trees.

The word "memoria" is Latin, and can be translated as "memory". Memoria was the term for aspects involving memory in Western classical rhetoric discourse.

Memoria is written in C++ with C++11 extensions and relies heavily on template metaprogramming for data structure construction from basic building blocks. It provides STL-like API for client code. GCC 4.6 and Clang 3.0 are supported. It uses modular design with separation of concerns. Physical memory management is separated from data processing logic and isolated behind Allocator interface. Default implementation of Allocator (SmallInMemoryAllocator) provides serialization of the allocator's state to a stream and copy-on-write based transactions.

The project core provides templated balanced search tree as well as several basic data structures such as Map, Set, Vector and VectorMap over it.

Framework Overhead

From the first view Memoria's core data structure is a balanced dynamic search tree with array-packed static search tree in its nodes (yes, trees in a tree). Node size are not limited with some structural constant so they can contain arbitrary-complex data structures. The main balanced search tree is transactional through copy-on-write algorithm. So internal memory management is quite complex and data access is indirect. How well does it perform?

We compared Memoria Set it with lightweight array-packed PackedSet and with already well-known std::set on the data sets with the same size.

Memoria Set Random Read Performance Gnuplot script, PackedTree benchmark, std::set benchmark

As we can see, Memoria Set is much slower than std::set for small data sets. But eventually for large enough data set Memoria Set becomes several times faster than std::set. Moreover, Memoria Set is only about 3 times slower than lightweight PackedSet on large data sets. This is how the hidden reserve of performance can be exploited. Memoria is a quite complex framework but it does not introduce significant run-time overhead for large data sets. And there is a room for further optimizations and improvements.

Memoria Update Performance

A cool feature of Memoria balanced search tree is batch update operations. If multiple updates are performed in a batch then some amount of computational work can be shared between individual update operations. There are two kinds of batching: batch insertions/deletions and transactional grouping. There is no upper limit on batch or transaction size.

In our benchmarks random insertion rate went from about 450K/sec for single insert to 20M/sec for 128-elements batch. And finally reached more than 80M/sec for batches of size 10K+ elements.

Placing several updates in one transaction does not introduce such giant performance improvement. Only 3-5 times. Check it...

Memoria Set Sequential Read Performance Gnuplot script, Memoria Set benchmark, std::set benchmark

Memoria Set Batch Insert Performance Gnuplot script, random insert benchmark, sequential append benchmark

Memoria Set Commit Performance Gnuplot script, random insert benchmark, sequential append benchmark

Update Rate Limits

How fast can updates be? Memoria uses 4K memory blocks for search tree nodes by default and for these benchmarks.

Let's Memmove() data in randomly selected 4K blocks at randomly selected indexes (emulating insertions into blocks). For arrays which don't fit in the CPU cache we get about 2M moves/sec and about 4GB/sec of memory throughput (for our Q9400 PC Box). So 2M insertions per second is an upper bound for random individual insertion rate of our balanced tree.

From the previous benchmark, 450K insertions/sec is not very far from this limit. Memoria core data structure does not introduce significant overhead over hardware memory level. And there is a room for improvements.

memmove() performance Gnuplot script, memmove() benchmark

Dynamic Vector

Dynamic Vector is like well-known std::vector but supports generic insert/remove operations with O(Log N) worst-case time complexity.

Let we have a partial sums tree based mapping from integer key to a value. It is much like well-known std::map but can shift all keys at specified distance with O(Log N) worst-case time complexity. Dynamic Vector is based on partial sums key-value map where key is an offset of the fixed size data block in the vector's index space and value is an ID of this block.

Dynamic Vector

It is not a replacement of std::vector<> if fast random access is required. But sequential read throughput is limited only with main memory. The benchmark shows that it comes quite close to the limit (4 GB/ses) even for 4K blocks. Random access throughput for 4K blocks comes close to 1.5 GB/sec that is very high in absolute numbers :) Random access performance for 128 byte blocks is about 850K reads/sec. Check it:

Memoria Vector Read Performance Gnuplot script, Vector random read benchmark, Vector sequential read benchmark

Vector Update Performance

Memoria Vector Insert Performance Gnuplot script, Vector random insert benchmark, Vector append benchmark

In Dynamic Vector insertions are slower then reads mainly because it is necessary to move data in data blocks and update index tree for the new data. In absolute numbers, if sequential read comes close to 4GB/sec then sequential append reaches only 1.2 GB/sec for our PC Box.

Random insert memory throughput for 16K blocks comes close to 600MB/sec. That is much higher than current HDD/SSD are able to consume.

Random insert performance for 128 byte blocks is about 550K writes/sec that is not far from 2M/sec practical limit.

Dynamic Vector does not introduce significant overhead over hardware memory for random insertions.

VectorMap

VectorMap is a mapping from BigInt (64 bits) to dynamic vector. It can be viewed as a file system without directory structure and with integer numbers as file names. But unlike most file systems VectorMap allows insert/remove of data in the middle of a file, it is used to associate together raw binary data.

It is a combination of a dynamic vector and a set of pairs (Key, Offset) represented with two-key partial sum tree.

VectorMap Internal Structure

It is relatively succinct – only 8 or 16 bytes per entry (this value doesn't include internal search tree nodes). For 256 byte values total overhead is less than 10%. 300K random reads and 200K random writes of 128 byte values. 3.8/1.2 GB/sec memory read/wright throughput for 256K byte values.

Up to 2/0.3 GB/sec sequential read/write throughput for 256 byte values.

And it also supports batch updates.

Memoria VectorMap Random Performance Gnuplot script, VectorMap random insert benchmark, VectorMap random read benchmark

Memoria VectorMap Sequential Performance Gnuplot script, VectorMap append benchmark, VectorMap sequential read benchmark

Updated