The system architecture of Riak
Simple at the Core
At its heart, Riak is a decentralized key/value store, strongly influenced by Amazon's Dynamo and lessons learned from real-world application of the CAP Theorem to other distributed systems. It supports high availability at low cost by allowing applications to tune their relative needs for durability, partition-tolerance, and other business constraints.
A Riak cluster is generally run on a set of well-connected physical hosts. Each host in the cluster runs one Riak node. Each Riak node runs a set of virtual nodes, or "vnodes", that are each responsible for storing a separate portion of the key space.
Nodes are not clones of each other, nor do they all participate in fulfilling every request. The extent to which data is replicated, and when, and with what merge strategy and failure model, is configurable at runtime and flexible to meet the needs of many different applications.
one ring to find them
Riak uses the technique of consistent hashing to organize data storage. Central to any Riak cluster is a 160-bit integer space which is divided into equally-sized partitions. Each vnode is responsible for one of these partitions, and each document is stored in a set of partitions that can be determined statically depending on its key. This allows a client node to determine the "owners" of a given piece of data locally, without having to ask any central authority.
In the default configuration, each physical node of a Riak cluster will attempt to run roughly an equal number of vnodes. In the general case, this means that each node in the cluster is responsible for 1/(number of nodes) of the ring, or (number of partitions)/(number of nodes) vnodes. For example, if two nodes define a 1024-partition cluster, then each node will run 512 vnodes. By default, nodes claim their partitions at random intervals around the ring, which usually provides a sufficiently even distribution.
coordination and gossip
When a value is being stored in (or retrieved from) the cluster, any node may participate as the coordinator for the request. The coordinating node consults the ring state to determine which vnode owns the partition in which the value's key belongs, then sends the request to that vnode, as well as the vnodes responsible for the next N-1 partitions in the ring, where N is a bucket-configurable parameter that describes how many copies of the value to store. A put request may specify that at least W (=< N) of those vnodes reply with success, and that DW (=< W) reply with success only after durably storing the value. The request will only be considered successful to the client when both W and DW have been satisfied by the nodes in question. (A get request is similar except that it only has one such value, called R.)
The ring state is shared around the cluster by means of a gossip protocol. Whenever a node changes its claim on the ring, it announces its change via this protocol. Each node also periodically sends its current view of the ring state to a randomly-selected peer, in case any nodes missed previous updates.
causality and versioning
With any node able to drive any request, and not all nodes needing to participate in each request, it is necessary to have a method for keeping track of which version of a value is current. This is where vector clocks ("vclocks") come in.
When a value is stored in Riak, it is tagged with a vclock, establishing its initial version. When a value is updated in Riak, the client provides the vclock of the object being modified so that this vclock can be extended to reflect the update. Riak can compare vclocks on different versions of the object and determine:
- Whether one object is a direct descendant of the other.
- Whether the objects are direct descendants of a common parent.
- Whether the objects are unrelated in recent heritage.
Using this knowledge, Riak can auto-repair out-of-sync data, and in worse cases can provide a client with an opportunity to reconcile divergent changesets in an application specific manner.
Riak attempts to move data toward a consistent state across nodes, but it doesn't do so by comparing each and every object on each node. Instead, nodes needing to possibly update many values will exchange a merkle tree, which allows them to quickly decide which values need comparing.
pluggable data backends
Sharing data among nodes, on rings, etc. is all well and good, but at some point, it has to actually be stored somewhere - like on disk! Because Riak is relevant to a wide variety of applications, its "backend" storage system is a pluggable one.
Each node may be configured with a different module for managing local storage. This module only needs to define "get", "put", "delete", and "list keys" functions that operate on binary blobs. The backend can consider these binaries completely opaque data, or examine them to make decisions about how best to store them.
Four backends come pre-packaged with Riak:
- riak_fs_backend, which stores data directly to files in a nested directory structure on disk
- riak_ets_backend, which stores data in ETS tables (which makes it volatile storage, but great for debugging)
- riak_dets_backend, which stores data on-disk in DETS tables
- riak_osmos_backend, which stores data in Osmos tables
It is easy to create additional backends to suit application needs.
building on the Web
Riak provides its primary programming interface over RESTful HTTP, in JSON encoding. This is enabled by embedding the Webmachine server, and has two major benefits:
- Ease of use for developers in any programming language
- Taking advantage of the Web's architecture for caching, validation and more