1. Justin Sheehy
  2. riak


riak / www / arch.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
	<meta name="author" content="Basho Technologies" />
	<meta name="description" content="riak - a decentralized key value store - basho technologies" />
	<meta name="keywords" content="riak nosql decentralized distributed key value store" />
    <meta http-equiv="content-type" content="text/html;charset=utf-8" />
	<link rel="stylesheet" href="css/style.css" type="text/css" />
	<title>Riak - System Architecture</title>
	<div id="content">
		<h1><span class="hr"></span><a href="/">riak</a></h1>
		<ul id="top">
			<li><a href="/">Home</a></li>
			<li><a href="http://bitbucket.org/justin/riak/">Source Code</a></li>
                        <li><a href="edoc/index.html">API Docs</a></li>
			<li><a href="faq.html">FAQ</a></li>
			<li><a href="contact.html">Contact</a></li>
		<div id="intro">
			<p>The system architecture of Riak</p>
		<div id="left">
			<h3>Simple at the Core</h3>
			<p>At its heart, Riak is a decentralized key/value store, strongly influenced by <a href="http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html">Amazon's Dynamo</a> and lessons learned from real-world application of the <a href="cap.html">CAP Theorem</a> to other distributed systems.  It supports high availability at low cost by allowing applications to tune their relative needs for durability, partition-tolerance, and other business constraints.</p>

A Riak cluster is generally run on a set of well-connected physical
hosts.  Each host in the cluster runs one Riak node.  Each Riak node
runs a set of virtual nodes, or "vnodes", that are each responsible
for storing a separate portion of the key space.

Nodes are not clones of each other, nor do they all participate in
fulfilling every request.  The extent to which data is replicated, and
when, and with what merge strategy and failure model, is configurable
at runtime and flexible to meet the needs of many different applications.

<h3>one ring to find them</h3>
Riak uses the technique of <a href="http://portal.acm.org/citation.cfm?id=258660">consistent hashing</a> to organize data storage.  Central to any Riak cluster is a 160-bit integer space which is divided into equally-sized partitions.  Each vnode is responsible for one of these partitions, and each document is stored in a set of partitions that can be determined statically depending on its key.  This allows a client node to determine the "owners" of a given piece of data locally, without having to ask any central authority.

In the default configuration, each physical node of a Riak cluster will
attempt to run roughly an equal 
number of vnodes.  In the general case, this means that each node in
the cluster is responsible for 1/(number of nodes) of the ring, or
(number of partitions)/(number of nodes) vnodes.  For example, if two
nodes define a 1024-partition cluster, then each node will run 512 vnodes.
By default, nodes claim their partitions at random intervals around the
ring, which usually provides a sufficiently even distribution.

<h3>coordination and gossip</h3>
When a value is being stored in (or retrieved from) the cluster,
any node may participate
as the coordinator for the request.  The coordinating node consults
the ring state to determine which vnode owns the partition in which
the value's key belongs, then sends the request to that vnode,
as well as the vnodes responsible for the next N-1 partitions in the
ring, where N is a bucket-configurable parameter that describes how
many copies of the value to store.  A put request may specify
that at least W (=&lt; N) of those vnodes reply with success, and that DW
(=&lt; W) reply with success only after durably storing the value.  The
request will only be considered successful to the client when both
W and DW have been satisfied by the nodes in question.
(A get request is similar except that it only has one such value, called R.)

The ring state is shared around the cluster by means of a gossip
protocol.  Whenever a node changes its claim on the ring, it
announces its change via this protocol.  Each node also periodically
sends its current view of the ring state to a randomly-selected
peer, in case any nodes missed previous updates.

<h3>causality and versioning</h3>
With any node able to drive any request, and not all nodes needing to
participate in each request, it is necessary to have a method for
keeping track of which version of a value is current.  This is where
<a href="http://portal.acm.org/citation.cfm?id=359563">vector clocks</a>
("vclocks") come in.

When a value is stored in Riak, it is tagged with a vclock,
establishing its initial version.  When a value is updated in Riak,
the client provides the vclock of the object being modified so that
this vclock can be extended to reflect the update.  Riak can compare
vclocks on different versions of the object and determine:

 <li> Whether one object is a direct descendant of the other. </li>
 <li> Whether the objects are direct descendants of a common parent. </li>
 <li> Whether the objects are unrelated in recent heritage. </li>

Using this knowledge, Riak can auto-repair out-of-sync data,
and in worse cases can provide a client with an opportunity to reconcile
divergent changesets in an application specific manner.

Riak attempts to move data toward a consistent state across nodes,
but it doesn't do so by comparing each and every object on each node.
Instead, nodes needing to possibly update many values will exchange a
<a href="http://portal.acm.org/citation.cfm?id=704751">merkle tree</a>,
which allows them to quickly decide which values need comparing.

<h3>pluggable data backends</h3>
Sharing data among nodes, on rings, etc. is all well and good, but at
some point, it has to actually be stored somewhere - like on disk!
Because Riak is relevant to a wide variety of applications, its
"backend" storage system is a pluggable one.

Each node may be configured with a different module for managing local
storage.  This module only needs to define "get", "put", "delete", and
"list keys" functions that operate on binary blobs.  The backend can
consider these binaries completely opaque data, or examine them to
make decisions about how best to store them.
Four backends come pre-packaged with Riak:

 <li> riak_fs_backend, which stores data directly to files in a nested
    directory structure on disk</li>
 <li> riak_ets_backend, which stores data in ETS tables (which makes it
    volatile storage, but great for debugging)</li>
 <li> riak_dets_backend, which stores data on-disk in DETS tables</li>
 <li> riak_osmos_backend, which stores data in
               <a href="http://code.google.com/p/osmos/">Osmos</a> tables</li>

It is easy to create additional backends to suit application needs.
<h3>building on the Web</h3>
Riak provides its primary programming interface over RESTful HTTP, in JSON encoding.  This is enabled by embedding the 
<a href="http://bitbucket.org/justin/webmachine">Webmachine</a> server, and has two major benefits:

<li>Ease of use for developers in any programming language</li>
<li>Taking advantage of the Web's architecture for caching, validation and more</li>

			<br />
		<div id="right">
        <img src="images/splash250.gif" alt="Riak" />
        <img src="images/halfblankbox.gif" alt="" />
        <img src="images/chash.gif" alt="consistent hashing" />
        <img src="images/halfblankbox.gif" alt="" />
        <img src="images/gossip4.gif" alt="gossip" />
        <img src="images/halfblankbox.gif" alt="" />
        <img src="images/vclock.gif" alt="vclocks" />
		<div id="footer">

<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-10051263-1");
} catch(err) {}</script>