Wiki

Clone wiki

memoria / Persistent Data Structures

Persistent data structures (PDS) introduce concurrent point-in-time semantics: each update operation produces a new version of the data structure that may coexist concurrently and independently. Point-in-time semantics simplifies algorithm design because data is "frozen": ongoing update operations do not change already established facts. Then, for instance, iteration on a table becomes repeatable.

Concurrent semantics, implemented though persistent versioning, simplifies further implementation of replicated state machines on top of persistent data structures, making data strong consistency in distributed environments much easier than with classic data structure designs.

Copy-on-write-based persistent data structures have very attractive property of wait-free data access: writers never interfere with readers and vice versa on concurrent access. In this case CoW efficiently reduces persistent data structure to messaging, when updates are just specifically organized messages monotonically accumulating in a key-value store. Efficiently enabling a limited form of snapshot consistency over eventually consistent memory.

Because of immutability of already established facts in persistent data structures, efficient distributed caching becomes possible for such data structures. Once some Key->Value mapping is established within such immutable data structure, it will never change, so all reads either get correct latest Value or no value at all. Pairs that are no more in use will be eventually evicted from all caches automatically. No special cache management is necessary.

Regardless these so appealing properties, there are not that many practical examples of persistent data structures, and for a reason. Some notable examples are:

  • ZFS/Btrfs file systems provide snapshoting feature that can freeze certain files and directories in time, with on-demand rollback. Certain application like relational database can use this feature for managing complex updates. Ideally, with such feature it is not necessary to mane very expensive database dump and restore in case of complex schema change. If such change fails for some reason, database state can be reverted back by returning fylesystem's state to previously made snapshot. Unfortunately, this feature requires some level of support from applications like RDBMSes, that is not necessary the case. Changesets over snapshots are transferrable, making some form of filesystem replication possible.

  • Git distributed version control system is used to manage change-sets made concurrently and independently on text files, most commonly -- program source codes. Perforce may be used for large binary files versioning.

  • Immutable.js is used in React.js for UI change management, optimizing HTML page rendering with transaction-like semantics. Persistent data structure allows easy determining what was changed in page layout and to apply those changes selectively.

  • Noms is a rare example of a decentralized database based on persistent data structures.

Complexities

Concurrent semantics comes with cost. There are three main complexities usually associated with persistent data structures:

  1. CPU-efficiency. CoW-based structures are usually backed by persistent search trees, either binary ones or b-trees. For some classical data structures like arrays, providing fast O(1)-bound read access, corresponding persistent data structure will provide only O(Log N)-bound read access. That can be much slower in read-heavy applications. From the other side, dynamic vectors, already based on search trees, will have the same asymptotic estimations for operations.

  2. Memory efficiency. Keeping versions consumes memory. Sometimes much more memory than it is intuitively expected. This memory overhead is an inevitable price we must pay for automatic concurrency.

  3. Implementation complexity. Persistent data structures can be much more complicated than their classical counterparts from the engineering perspective. Some operations on PDS can't be implemented at all, or their implementation will be much more complicated. For example, persistent search trees can have only child links, but no parent or sibling links. Without parent/sibling links leaf iteration is still possible via keeping the full path from the root to the current leaf, adding to complexity of iterators.

As a result, we don't have many PDS frameworks around there. Functional programming languages contain them as first-class citizens, mostly because it is the only way to implement dynamic data structures in an immutable memory. Because of implementation complexity, they are mostly used in narrow, highly-concurrent parallel environments, where locking is inefficient but point-in-time semantics is required.

Memoria's approach to PDS

There are two main differences between Memoria and other PDS frameworks:

  1. Memoria is focused on data structures over distributed memory abstracted through eventually-consistent Key/Value model, that is easily virtualizable. Because of many additional indirection layers, Memoria's containers will be somewhat slower than their main-memory-only counterparts.

  2. Instead of implementing each data structure as a persistent one over Key-Value memory, Memoria provides persistent indirection layer, efficiently resolving pair of (Snapshot_ID, Key) to some Global_Key for a given Value. This mapping is done with regular CoW-based b-tree, that is actually the only persistent data structure in Memoria.

Unlike other PDS frameworks, where each update makes new separate version for every data structure, in Memoria all data structures behaves like imperative ones within a snapshot, that is a unit of persistency. It's possible, however, to envelop each update in a separate snapshot, resulting in exactly the same behavior we have with classical PDS.

Memoria's basic approach to PDS trades runtime efficiency for implementation simplicity. By introducing persistent blocks address (key) translation layer, Memoria enables much simpler designs of imperative data structures. The cost is additional B-tree lookup, that can be amortized with caches in some use cases.

Architecture of Memoria

All data structures in Memoria are block-based, that means they can be represented as a graph of arrays of fixed size. Block are identified with arbitrarily integers, which can be block's fixed address in memory, UUID or hash code. Block's size is also arbitrary but limited. Different applications works best with blocks of different size. Write-heavy applications like OLTP databases prefer short blocks like 8KB, read-heavy applications like analytical databases prefer longer blocks, say, 128KB in size. Given this, Memoria can operate on any key/value store like embedded or cloud ones. Core data structures are powerful enough to build such stores right on top of raw block devices. Memoria does't even need a filesystem to store data.

Such data model is easily virtualizable, and in the first step is to build persistent key-value translation layer or Allocator in local terminology. As it has been already said persistent data structures have very appealing properties, enabling concurrency, parallelism, fault tolerance, operations timeline semantics...

Memoria's data architecture

Other Design Goals of Memoria

Memoria provides both decent CoW-based persistent data structures and execution environment for them that has three main components:

  1. Fiber-based AIO subsystem with one CPU thread core per worker and lock-free P2P messaging between workers, that scales well to tens of cores.

  2. Persistent data structures enabling efficient wait-free data sharing between workers.

  3. Heterogeneous actor-like programming environment unifying different technologies like C++, Java, JavaScript, CUDA, OpenCL, CPU, GPU and ASICs through messaging.

Reconfigurable Memoria Node

Updated