Source

riak / www / mapreduce.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
	<meta name="author" content="Basho Technologies" />
	<meta name="description" content="riak - a decentralized key value store - basho technologies" />
	<meta name="keywords" content="riak nosql decentralized distributed key value store" />
    <meta http-equiv="content-type" content="text/html;charset=utf-8" />
	<link rel="stylesheet" href="css/style-1c.css" type="text/css" />
	<title>Riak - Map/Reduce brings computation to you data</title>
</head>
<body>
	<div id="content">
		<h1><span class="hr"></span><a href="/">riak</a></h1>
		<ul id="top">
			<li><a href="/">Home</a></li>
			<li><a href="http://bitbucket.org/justin/riak/">Source Code</a></li>
                        <li><a href="edoc/index.html">API Docs</a></li>
			<li><a href="faq.html">FAQ</a></li>
			<li><a href="contact.html">Contact</a></li>
		</ul>
		
		<div id="intro">
			<p>Map/Reduce in Riak</p>
		</div>
		<div id="left">
			
			<h3>The map/reduce programming model</h3>
<p>Riak provides a data processing implementation based on the
<a href="http://labs.google.com/papers/mapreduce.html">MapReduce</a>
model popularized by <a href="http://www.google.com/">Google</a> and
since adapted by <a href="http://hadoop.apache.org/">Hadoop</a> as
well as others.  Like every known map/reduce implementation, ours is shaped to best take advantage of our own approach to distributed data organization.
</p>

<h3>bring the computation to the data</h3>
<p>
One of the main reasons to use a map/reduce style of programming is to exploit data locality.  For data processing in a networked environment, it is generally understood that making the computation run where the data is already located will often perform much better (and be much more practical to manage) than moving all the data to the systems that will perform the computation.  Since Riak is <a href="arch.html">designed</a> around algorithms to manage and find data efficiently, the shape of our map/reduce implementation follows naturally from the rest of the system.

</p>

<p>
A map/reduce query or "flow" is simply a sequence of map and reduce phases, each feeding the next, and together providing an aggregate result.
</p>

<p>
A "map phase" is essentially just a function ("F") and an argument ("A") that is defined as part of a series of phases making up a given map/reduce query.  The phase will receive a stream of inputs ("I"), each of which consists of a key identifying a Riak object and an optional additional data element to accompany that object.  As each input is received by the phase, a node that already contains the document ("D") corresponding to "I" will run <code>F(D,A)</code> and stream along the results to the next phase.  The point here is that your function can be executed over many data items, but instead of collecting all of the data items in one place it will execute wherever the data is already placed.
</p>

<p>
A "reduce phase" is conceptually simpler.  As it receives inputs from the preceding phase, it collates those inputs along with the ones already received and continually "reduces" the input set until it receives notification that it will receive no further data, at which point the entire reduced set will be streamed to the next phase. Note that this makes a reduce phase a concurrency barrier, as opposed to map phases which can be processing in parallel.  In order for this process to make any sense, a reduce phase's function must be commutative, associative, and idempotent.  Good examples are <code>sum</code> and <code>set-union</code>.  As Riak's core focus is on decentralized data storage and not on compute farming, reduce phases are generally run on a single node -- there is no data-locality gain to be had in reduce.
</p>

<h3>A perfect fit for the Web</h3>
<p>
Since the original published use of map/reduce was for processing Web search indices, it should come as no surprise that this model is a great fit for the general problem of processing linked data.  That is, if your data set consists of many mostly-independent documents, loosely coupled by way of links inside the documents, then map/reduce is very likely to be a good approach to querying that data collection.  To make this even easier, we have added a superficial third type of phase to the model, the "link" phase.  In fact, link phases are just map phases, parameterized ahead of time such that the map function will be a function that knows enough about your document types to extract links matching a given pattern or tag.  While such map phases can of course be written manually, we saw this operation so frequently that we made a shorthand for it -- and now those phases are by far the most common use of our map/reduce engine.
</p>

<h3>not just for bulk processing</h3>
<p>
While the most popular map/reduce systems out there are generally used to apply a map/reduce flow over an entire body of data, Riak provides a more focused approach.  A query is instead "seeded", or provided with the explicit list of inputs that will be used by the first phase in the query.  This approach, combined with the link-following convention, allows for an entirely new set of uses for the map/reduce programming paradigm.
</p>

<h3>the gory details</h3>
<p>
A map/reduce query is initiated with two arguments.  The first is simply the list of values (usually bucket/key pairs as the first phase is almost always a map phase) that will be sent to the first phase in the flow.  The second argument is a list of terms declaring the flow of phases for this query.
</p>

<p>
A map phase is declared as:
</p>

<pre>
{map, FunTerm, Arg, Accumulate}
</pre>

<p>
<code>FunTerm</code> is a reference to the function that will compute the map of
    each value.  A function referenced by a FunTerm must be arity-3,
    accepting the arguments <code>(Value, Data, Arg)</code> as follows:
</p>
<p>
<code>Value</code> is the value found at a key.  This will either be a Riak object structure (accessed via the riak_object module) or else the tuple <code>{error, notfound}</code>.
</p>
<p>
<code>Data</code> is an optional piece of data attached to the bucket/key pair that initiated this execution.  If instead of <code>{Bucket, Key}</code>, <code>{{Bucket, Key}, Data}</code> is passed as an input to a map step, that <code>Data</code> will be passed to the map function in this argument.  <code>Data</code> will be the atom <code>'undefined'</code> if the former form is used.
</p>
<p>
<code>Arg</code> is the argument by the same name that was passed to the overall map phase declaration.
</p>
<p>
The <code>FunTerm</code> may take one of two forms: Either <code>{modfun, Module, Function}</code> where <code>Module</code> and <code>Function</code> are atoms that name an Erlang function in a specific module, or <code>{qfun, Function}</code> where <code>Function</code> is a callable fun term.
</p>
<p>
<code>Accumulate</code> should be set to <code>true</code> for all phases whose output is desired in the final result of the map/reduce execution, and <code>false</code> for all others.  The most common pattern is to set this to <code>true</code> in only the very last phase, but some interesting queries can be produced by setting it earlier as well.
</p>
<p>
Note that a map function must return a <strong>list</strong> of values, each of which will be an input to the next phase.
</p>
<p>
A reduce phase is declared as:
</p>

<pre>
{reduce, FunTerm, Arg, Accumulate}
</pre>

<p>
Where the terms are essentially the same as for map, with the exception that the function referenced by <code>FunTerm</code> must be arity 2.  It takes <code>(ValueList, Arg)</code> with <code>Arg</code> playing the same role as before and <code>ValueList</code> being a round of (possibly already processed) inputs to reduce.
</p>
<p>
Much like with map, a reduce function must return a list of values.  This list will be combined with the next input list the next time the reduce function is called, which is why the reduce function must be commutative, associative, and idempotent.
</p>
<p>
The third and final type of phase is a link phase, declared as <code>{link, Bucket, Tag, Accumulate}</code>.  For this kind of phase to work, there must already be a <code>linkfun</code> property set on <code>Bucket</code> in the cluster, which must return a <code>FunTerm</code>.  This will be translated into a map phase with that term, looking like: <code>{map, FunTerm, {Bucket,Tag}, Accumulate}</code>.
</p>
			
			
		</div>

		<div id="footer">

		</div>
	</div>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-10051263-1");
pageTracker._trackPageview();
} catch(err) {}</script>
</body>
</html>
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.