Commits

Bob Ippolito  committed b138414

Backed out changeset d6798e2fcd1c

  • Participants
  • Parent commits d6798e2

Comments (0)

Files changed (12)

File .hgignore

-testdata.db
 
 slides.html: slides.txt includes/*.html
 	rst2s5.py --theme-url ui/mochikit slides.txt $@
-	./bin/fixup_definitions.py $@
-
-testdata.db: bin/make_testdata.py
-	./bin/make_testdata.py $@
 
 .PHONY: all clean

File bin/fixup_definitions.py

-#!/usr/bin/env python
-import sys
-
-def main():
-    fn = sys.argv[1]
-    f = open(fn, 'rb')
-    txt = f.read()
-    f.close()
-    txt = txt.replace(
-        '<th class="field-name" colspan="2">',
-        '<th class="field-name">').replace(
-        '</tr>\n<tr><td>&nbsp;</td><td class="field-body"></td>',
-        '<td class="field-body"></td>')
-    f = open(fn, 'wb')
-    f.write(txt)
-    f.close()
-
-
-if __name__ == '__main__':
-    main()

File bin/make_testdata.py

-#!/usr/bin/env python
-import sys
-import random
-import sqlite3
-
-def main():
-    db = sqlite3.connect(sys.argv[1])
-    cur = db.cursor()
-    COLS = 'timestamp', 'user_id', 'bucket', 'dollars'
-    cur.execute('CREATE TABLE testdata(' +
-                ','.join(COLS) + ')')
-    EPOCH = 1262322000
-    for i in xrange(1000000):
-        bucket = random.randrange(2)
-        row = [
-            EPOCH + random.randrange(14 * 86400),
-            (0,10000)[bucket] + random.randrange(10000),
-            ('control','test')[bucket],
-            random.randrange(10 * (1 + bucket)),
-        ]
-        cur.execute('INSERT INTO testdata VALUES(?,?,?,?)',
-                    row)
-    db.commit()
-
-if __name__ == '__main__':
-    main()

File images/649px-Bloom_filter.png

Added
New image

File images/i-has-minions.jpg

Added
New image

File images/mochi_ad_sales.jpg

Added
New image

File images/sc4_pub_ss_cassandra003_copy.jpg

Added
New image

File images/we-await-ur-instrucsions.jpg

Added
New image
 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
 <meta name="generator" content="Docutils 0.5: http://docutils.sourceforge.net/" />
 <meta name="version" content="S5 1.1" />
-<title>Analysis: The Other Kind of Testing</title>
-<meta name="author" content="Bob Ippolito (&#64;etrepum)" />
-<meta name="date" content="February 2010" />
+<title>Drop ACID and think about data</title>
+<meta name="author" content="Bob Ippolito" />
+<meta name="date" content="March 2009" />
 <style type="text/css">
 
 /*
 
 </div>
 <div id="footer">
-<h1>Analysis: The Other Kind of Testing</h1>
+<h1>Drop ACID and think about data</h1>
 
 </div>
 </div>
 <div class="presentation">
 <div class="slide" id="slide0">
-<h1 class="title">Analysis: The Other Kind of Testing</h1>
+<h1 class="title">Drop ACID and think about data</h1>
 <table class="docinfo" frame="void" rules="none">
 <col class="docinfo-name" />
 <col class="docinfo-content" />
 <tbody valign="top">
 <tr><th class="docinfo-name">Author:</th>
-<td>Bob Ippolito (&#64;etrepum)</td></tr>
+<td>Bob Ippolito</td></tr>
 <tr><th class="docinfo-name">Date:</th>
-<td>February 2010</td></tr>
-<tr class="field"><th class="docinfo-name">Venue:</th><td class="field-body">PyCon 2010</td>
+<td>March 2009</td></tr>
+<tr class="field"><th class="docinfo-name">Venue:</th><td class="field-body">PyCon 2009</td>
 </tr>
 </tbody>
 </table>
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Mochi Media:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name" colspan="2">Startup with lots of data:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>CTO/Cofounder of Mochi Media</li>
-<li>Platform for Flash games</li>
-<li>Virtual currency and game discovery for gamers</li>
+<li>Cofounded Mochi Media in 2005</li>
+<li>MochiBot analytics platform (for Flash)</li>
+<li>MochiAds ad serving platform (for Flash games)</li>
+<li>Other cool services for game developers</li>
 </ul>
 </div>
-<div class="slide" id="analysis">
-<h1>Analysis</h1>
+<div class="slide" id="mochi-ad-sales">
+<h1>Mochi Ad Sales</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Definition:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">Hard Sell:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<img alt="images/mochi_ad_sales.jpg" src="images/mochi_ad_sales.jpg" />
+</div>
+<div class="slide" id="what-s-acid">
+<h1>What's ACID?</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name" colspan="2">A promise ring your DBMS wears:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
+</tr>
+<tr class="field"><th class="field-name">Atomicity:</th><td class="field-body">all or nothing</td>
+</tr>
+<tr class="field"><th class="field-name">Consistency:</th><td class="field-body">no explosions</td>
+</tr>
+<tr class="field"><th class="field-name">Isolation:</th><td class="field-body">no fights</td>
+</tr>
+<tr class="field"><th class="field-name">Durability:</th><td class="field-body">no lying</td>
+</tr>
+</tbody>
+</table>
+</div>
+<div class="slide" id="acid-trips">
+<h1>ACID Trips</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name" colspan="2">Scalability and reliability:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Separation of a whole into its component parts</li>
-<li>An examination of a complex, its elements, and their relations</li>
+<li>Downtime is unacceptable</li>
+<li>Reliable is &gt;= 2 nodes</li>
+<li>Scalable is ... more</li>
+<li>Networks make it hard</li>
+<li>Networks make it hard</li>
+<li>Networks make it hard</li>
 </ul>
 </div>
-<div class="slide" id="why-do-i-care">
-<h1>Why do I care?</h1>
+<div class="slide" id="what-can-i-have">
+<h1>What can I have?</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">SCIENCE:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name" colspan="2">CAP theorem says pick two:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Be creative when exploring new ideas</li>
-<li>Not when measuring them</li>
+<li>Consistency</li>
+<li>Availability</li>
+<li>Partition tolerance</li>
 </ul>
 </div>
-<div class="slide" id="who-does-this">
-<h1>Who does this?</h1>
+<div class="slide" id="turn-up-the-base">
+<h1>Turn up the BASE</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">A-Z of online:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name" colspan="2">Write smarter applications:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Amazon</li>
-<li>Facebook</li>
-<li>Google</li>
-<li>Microsoft</li>
-<li>Zynga</li>
+<li>Basically Available</li>
+<li>Soft state</li>
+<li>Eventually consistent</li>
 </ul>
 </div>
-<div class="slide" id="how-it-works">
-<h1>How it works</h1>
+<div class="slide" id="base-jumping">
+<h1>BASE jumping</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Scientific Method:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name" colspan="2">Everyone else is doing it:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Create a hypothesis</li>
-<li>Design an experiment</li>
-<li>Run experiment</li>
-<li>Analyze the results</li>
+<li>Google</li>
+<li>Amazon</li>
+<li>eBay</li>
+<li>Yahoo!</li>
+<li>Facebook</li>
+<li>...</li>
 </ul>
 </div>
-<div class="slide" id="hypothesis">
-<h1>Hypothesis</h1>
+<div class="slide" id="bigtable">
+<h1>BigTable</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Your Great Idea:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">Google:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Divergent from what you have</li>
-<li>Simple</li>
-<li>Testable</li>
+<li>Paxos (Chubby)</li>
+<li>Single-master</li>
+<li>Distributed tablets via GFS</li>
+<li>Row/Column db hybrid</li>
+<li>Compression (BMDiff, Zippy)</li>
+<li>Versioned (Row, Column, Timestamp)</li>
+<li>Bloom filters</li>
 </ul>
 </div>
-<div class="slide" id="testable">
-<h1>Testable</h1>
+<div class="slide" id="bigtable-pros">
+<h1>BigTable Pros</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Fitness Metric:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">Pros:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Quantifiable</li>
-<li>Business relevant</li>
-<li>Key Performance Indicator</li>
-<li>Overall Evaluation Criteria</li>
+<li>Compression = Awesome</li>
+<li>Clients are probably simple</li>
+<li>Integrates with map/reduce</li>
 </ul>
 </div>
-<div class="slide" id="fitness-metric-examples">
-<h1>Fitness Metric Examples</h1>
+<div class="slide" id="bigtable-cons">
+<h1>BigTable Cons</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Your Metric May Vary:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">Cons:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Click-through rate (CTR)</li>
-<li>Cost Per Acquisition (CPA)</li>
-<li>Avg $ per user (ARPU)</li>
-<li>% of gamers who beat the last level</li>
+<li>Proprietary to Google</li>
+<li>Single-master</li>
 </ul>
 </div>
-<div class="slide" id="experiment">
-<h1>Experiment</h1>
+<div class="slide" id="bigtable-diagram">
+<h1>BigTable Diagram</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">&quot;To Try Out&quot;:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">Single-master:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<img alt="images/i-has-minions.jpg" src="images/i-has-minions.jpg" />
+</div>
+<div class="slide" id="dynamo">
+<h1>Dynamo</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">Amazon:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Method of investigation</li>
-<li>Many ways to do this</li>
-<li>Easiest to follow a template</li>
+<li>Key/Value store</li>
+<li>Consistent hashing</li>
+<li>Vector clocks</li>
+<li>Read repair</li>
 </ul>
 </div>
-<div class="slide" id="split-testing">
-<h1>Split Testing</h1>
+<div class="slide" id="dynamo-pros">
+<h1>Dynamo Pros</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">A/B Testing:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">Pros:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>50% control group gets no changes</li>
-<li>50% test group sees changes</li>
-<li>Group selection method is important</li>
+<li>No master</li>
+<li>Highly available for write</li>
+<li>Knobs to make it fast to read</li>
+<li>&quot;Simple&quot; (lots of half-baked clones!)</li>
 </ul>
 </div>
-<div class="slide" id="group-selection">
-<h1>Group Selection</h1>
+<div class="slide" id="dynamo-cons">
+<h1>Dynamo Cons</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">random.choice:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">Cons:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Group should be randomly selected</li>
-<li>Pin user to group during experiment</li>
-<li>Easiest with logins, but can be done without (cookies, etc.)</li>
+<li>Proprietary to Amazon</li>
+<li>Clients need to be smart</li>
+<li>No compression</li>
+<li>Not suitable for column-like workloads</li>
+<li>Just a Key/Value store</li>
 </ul>
 </div>
-<div class="slide" id="selection-pseudocode">
-<h1>Selection Pseudocode</h1>
-<pre class="literal-block">
-from random import choice
-
-def view_page():
-    if get_bucket(user) == 'test':
-        test_group_page()
-    else:
-        control_group_page()
-
-def get_bucket(user):
-    if user.bucket is None:
-        user.bucket = choice(('test', 'control'))
-        user.save()
-    return user.bucket
-</pre>
-</div>
-<div class="slide" id="logging">
-<h1>Logging</h1>
+<div class="slide" id="dynamo-diagram">
+<h1>Dynamo Diagram</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Log Everything:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">Smart client:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<img alt="images/we-await-ur-instrucsions.jpg" src="images/we-await-ur-instrucsions.jpg" />
+</div>
+<div class="slide" id="cassandra">
+<h1>Cassandra</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name" colspan="2">Facebook -&gt; Apache:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Log everything you can</li>
-<li>Recording the bucket is very important</li>
-<li>Flat files are easy</li>
-<li>JSON is convenient (can be newline delimited)</li>
+<li>Open source!</li>
+<li>No master like Dynamo</li>
+<li>Storage model more like BigTable</li>
 </ul>
 </div>
-<div class="slide" id="logging-pseudocode">
-<h1>Logging Pseudocode</h1>
+<div class="slide" id="cassandra-pros">
+<h1>Cassandra Pros</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Just Kidding:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">Pros:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>It's not actually easy</li>
-<li>Multiple threads/processes/machines -&gt; :(</li>
-<li>Options include database, syslog or message queue</li>
-<li>Gets harder at scale (e.g. Facebook's Scribe)</li>
+<li>OPEN SOURCE</li>
+<li>Incrementally scalable</li>
+<li>Minimal administration</li>
 </ul>
 </div>
-<div class="slide" id="log-processing">
-<h1>Log Processing</h1>
+<div class="slide" id="cassandra-cons">
+<h1>Cassandra Cons</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">ETL:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">Cons:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Extract the log data</li>
-<li>Transform it into a useful format</li>
-<li>Load the data for analysis</li>
-<li>Typically done in batch, e.g. daily or hourly</li>
+<li>Not polished</li>
+<li>No compression yet</li>
 </ul>
 </div>
-<div class="slide" id="log-processing-example">
-<h1>Log Processing Example</h1>
-<pre class="literal-block">
-db = sqlite3.connect('testdata.db')
-cur = db.cursor()
-COLS = 'timestamp', 'user_id', 'bucket', 'dollars'
-cur.execute('CREATE TABLE testdata(' +
-            ','.join(COLS) + ')')
-
-for line in log_lines:
-    dct = json.loads(line)
-    row = [dct[col] for col in COLS]
-    cur.execute(
-        'INSERT INTO testdata VALUES(?,?,?,?)',
-        row)
-db.commit()
-</pre>
-</div>
-<div class="slide" id="id1">
-<h1>Analysis</h1>
+<div class="slide" id="cassandra-diagram">
+<h1>Cassandra Diagram</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Math time:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">Soul Calibur:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<img alt="images/sc4_pub_ss_cassandra003_copy.jpg" src="images/sc4_pub_ss_cassandra003_copy.jpg" />
+</div>
+<div class="slide" id="distributed-musings">
+<h1>Distributed Musings</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">New Hotness:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Plot data over time to see difference visually</li>
-<li>Make sure to compare sample size</li>
-<li>Check your work</li>
+<li>Distributed databases are the new web framework</li>
+<li>... except none of them are awesome yet</li>
+<li>I don't think we need another half-baked Dynamo clone</li>
 </ul>
 </div>
-<div class="slide" id="overall-query">
-<h1>Overall Query</h1>
+<div class="slide" id="key-value-stores">
+<h1>Key-Value Stores</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Overall Performance:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name" colspan="2">Simple and Fast:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
-<pre class="literal-block">
-SELECT
-    bucket,
-    COUNT(DISTINCT user_id) AS sample_size,
-    SUM(dollars)*1.0/COUNT(DISTINCT user_id) AS arpu
-FROM testdata
-GROUP BY bucket;
-</pre>
+<ul class="simple">
+<li>Similar to a Python dict</li>
+<li>Keys usually bytes, probably limited</li>
+<li>Values usually bytes, often have fewer limits</li>
+<li>Extremely fast, simple</li>
+</ul>
 </div>
-<div class="slide" id="daily-query">
-<h1>Daily Query</h1>
+<div class="slide" id="memcached">
+<h1>Memcached</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Plot This:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name" colspan="2">Key/Value store as cache:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
-<pre class="literal-block">
-SELECT
-    date(timestamp, 'unixepoch') as day,
-    bucket,
-    COUNT(DISTINCT user_id) AS sample_size,
-    SUM(dollars)*1.0/COUNT(DISTINCT user_id) AS arpu
-FROM testdata
-GROUP BY day, bucket ORDER BY day, bucket;
-</pre>
+<ul class="simple">
+<li>No persistence</li>
+<li>RAM only</li>
+<li>Throws data away (on purpose)</li>
+<li>Lightning fast</li>
+<li>&quot;Everyone&quot; uses it</li>
+</ul>
 </div>
-<div class="slide" id="sanity-check-query">
-<h1>Sanity Check Query</h1>
+<div class="slide" id="caching-immutable-data">
+<h1>Caching Immutable Data</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">No Results is Good:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name" colspan="2">If only data never changed:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
-<pre class="literal-block">
-SELECT user_id
-FROM testdata
-GROUP BY user_id HAVING COUNT(DISTINCT bucket)&gt;1;
-</pre>
+<ul class="simple">
+<li>Immutable is easy, do that</li>
+</ul>
 </div>
-<div class="slide" id="simpson-s-paradox">
-<h1>Simpson's Paradox</h1>
+<div class="slide" id="caching-mutable-data">
+<h1>Caching Mutable Data</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">When Good is Bad:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name" colspan="2">Invalidation sucks:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Hidden variables can confuse results</li>
-<li>Sample size is important</li>
-<li>Random selection and 50/50 split helps avoid this issue</li>
+<li>Mutable is hard</li>
+<li>Failed transactions?</li>
+<li>Concurrent writers?</li>
+<li>Dependent cache keys?</li>
+<li>You will get it wrong and it will be hard to debug</li>
 </ul>
 </div>
-<div class="slide" id="paradox-example">
-<h1>Paradox Example</h1>
+<div class="slide" id="tokyo-cabinet-tyrant">
+<h1>Tokyo Cabinet/Tyrant</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Grad School Acceptance:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name" colspan="2">Not your mom's BerkeleyDB:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
-<table border="1" class="docutils">
-<colgroup>
-<col width="21%" />
-<col width="40%" />
-<col width="38%" />
-</colgroup>
-<tbody valign="top">
-<tr><td>&nbsp;</td>
-<td>Men</td>
-<td>Women</td>
-</tr>
-<tr><td>Arts</td>
-<td>3/4 (75%)</td>
-<td>1/1 <strong>(100%)</strong></td>
-</tr>
-<tr><td>Science</td>
-<td>0/1 (0%)</td>
-<td>1/4 <strong>(25%)</strong></td>
-</tr>
-<tr><td>Totals</td>
-<td>3/5 <strong>(60%)</strong></td>
-<td>2/5 (40%)</td>
-</tr>
-</tbody>
-</table>
+<ul class="simple">
+<li>Disk persistent</li>
+<li>Very performant</li>
+<li>Actively developed</li>
+<li>Similar replication strategy to MySQL</li>
+</ul>
 </div>
-<div class="slide" id="ramp-up">
-<h1>Ramp-up</h1>
+<div class="slide" id="redis">
+<h1>Redis</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Start Slow:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">Still very new:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>New features are scary, might be broken</li>
-<li>Run a staged experiment, e.g. 1% -&gt; 5% -&gt; 20% -&gt; 50%</li>
-<li>If it is obviously broken, abort!</li>
+<li>Not just a Key/Value store</li>
+<li>Matching on key spaces</li>
+<li>Values can be bytes, lists or sets</li>
+<li>Requires full store in RAM</li>
+<li>Might be a nice cache server?</li>
 </ul>
 </div>
-<div class="slide" id="ramp-up-implementation">
-<h1>Ramp-up Implementation</h1>
+<div class="slide" id="document-databases">
+<h1>Document Databases</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Buckets For Your Buckets:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">Schema-free:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Segment users randomly into range(100) buckets</li>
-<li>During 1% period include range(1) in test group</li>
-<li>During 5% period include range(5) in test group</li>
-<li>...</li>
+<li>Very easy to use</li>
+<li>Document Versioning</li>
+<li>Great for storing documents</li>
 </ul>
 </div>
-<div class="slide" id="ramp-up-analysis">
-<h1>Ramp-up Analysis</h1>
+<div class="slide" id="couchdb">
+<h1>CouchDB</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">More Math:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name" colspan="2">Document DB Poster Child:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Users will cross from control to test group during experiment</li>
-<li>Overlap is still bad</li>
-<li>Easiest to only look at single test period</li>
-<li>Double-check queries to make sure there is no overlap</li>
+<li>Apache project</li>
+<li>Asynchronous replication</li>
+<li>JSON based</li>
+<li>Views materialized on demand (not indexes)</li>
+<li>Neat admin UI</li>
 </ul>
 </div>
-<div class="slide" id="parallel-tests">
-<h1>Parallel Tests</h1>
+<div class="slide" id="mongodb">
+<h1>MongoDB</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Even More Math:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">C++'s revenge:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Multivariate Testing</li>
-<li>Test multiple changes in parallel</li>
-<li>Changes can interfere, reinforce, or have no effect on each other</li>
-<li>Beyond scope of this talk</li>
+<li>Fast</li>
+<li>JSON and BSON (binary JSON-ish)</li>
+<li>Asynchronous replication with auto-sharding &quot;soon&quot;</li>
+<li>Index support</li>
+<li>Nested documents</li>
+<li>Advanced queries</li>
 </ul>
 </div>
-<div class="slide" id="group-selection-pitfalls">
-<h1>Group Selection Pitfalls</h1>
+<div class="slide" id="column-databases">
+<h1>Column Databases</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">DANGER!:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name" colspan="2">Data Warehousing:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Each experiment should have independent group selection</li>
-<li>If you re-use buckets across experiments your data is invalid</li>
-<li>Important even if not running parallel tests!</li>
+<li>Sequential reads are awesome</li>
+<li>Columns compress better than rows</li>
+<li>Doesn't waste I/O on uninteresting columns</li>
 </ul>
 </div>
-<div class="slide" id="confidence">
-<h1>Confidence</h1>
+<div class="slide" id="monetdb">
+<h1>MonetDB</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">More Data:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name" colspan="2">Research project:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Confidence goes up with lots of samples</li>
-<li>Chi-squared or Student's t-test good candidates</li>
-<li>Not my expertise, find a statistician :)</li>
+<li>Tried really hard to get it to work</li>
+<li>Crashes a lot and corrupts your data</li>
+<li>Do not waste your time</li>
 </ul>
 </div>
-<div class="slide" id="free-tools">
-<h1>Free Tools</h1>
-<dl class="docutils">
-<dt>Django Lean</dt>
-<dd><a class="reference external" href="http://bitbucket.org/akoha/django-lean/">http://bitbucket.org/akoha/django-lean/</a></dd>
-<dt>Google Website Optimizer</dt>
-<dd><a class="reference external" href="http://services.google.com/websiteoptimizer/">http://services.google.com/websiteoptimizer/</a></dd>
-</dl>
+<div class="slide" id="luciddb">
+<h1>LucidDB</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name" colspan="2">Sounds interesting:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<ul class="simple">
+<li>Java/C++ open source data warehouse</li>
+<li>No clustering</li>
+<li>No experience yet</li>
+</ul>
 </div>
-<div class="slide" id="more-info">
-<h1>More Info</h1>
-<dl class="docutils">
-<dt>Microsoft's Experimentation Platform</dt>
-<dd><a class="reference external" href="http://exp-platform.com/">http://exp-platform.com/</a></dd>
-<dt>Effective A/B Testing</dt>
-<dd><a class="reference external" href="http://elem.com/~btilly/effective-ab-testing/">http://elem.com/~btilly/effective-ab-testing/</a></dd>
-<dt>Startup Lessons Learned</dt>
-<dd><a class="reference external" href="http://www.startuplessonslearned.com/search/label/split-test">http://www.startuplessonslearned.com/search/label/split-test</a></dd>
-<dt>Andrew Chen's Blog</dt>
-<dd><a class="reference external" href="http://andrewchenblog.com/">http://andrewchenblog.com/</a></dd>
-</dl>
+<div class="slide" id="vertica">
+<h1>Vertica</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">We paid for it:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<ul class="simple">
+<li>Commercial (based on C-Store)</li>
+<li>Clustered</li>
+<li>Would still prefer open source</li>
+</ul>
+</div>
+<div class="slide" id="bitmap-indexes">
+<h1>Bitmap Indexes</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name" colspan="2">Sequential Scans can be fast:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<ul class="simple">
+<li>1-N bits per row of data</li>
+<li>Can apply logical operations across indexes</li>
+<li>Can be compressed (BBC, WAH)</li>
+<li>FastBit is a good implementation</li>
+</ul>
+</div>
+<div class="slide" id="bitmap-index-uses">
+<h1>Bitmap Index Uses</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">Big Queries:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<ul class="simple">
+<li>PostgreSQL 8.1+ in-memory for some queries</li>
+<li>Almost a requirement for column stores</li>
+<li>FastBit is a great implementation (WAH)</li>
+</ul>
+</div>
+<div class="slide" id="bloom-filters-are-neat">
+<h1>Bloom Filters are Neat</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name" colspan="2">But our Princess is in another castle:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<ul class="simple">
+<li>Probabilistic data structure</li>
+<li>False positives at a known error</li>
+<li>Constant space</li>
+<li>I won't bore you with the math</li>
+</ul>
+</div>
+<div class="slide" id="bloom-filter-diagram">
+<h1>Bloom Filter Diagram</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name" colspan="2">Actually Relevant:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<img alt="images/649px-Bloom_filter.png" src="images/649px-Bloom_filter.png" />
+</div>
+<div class="slide" id="bloom-filter-uses">
+<h1>Bloom Filter Uses</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name" colspan="2">Find stuff, maybe:</th></tr>
+<tr><td>&nbsp;</td><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<ul class="simple">
+<li>Approximate counting of a large set (e.g. unique IPs from logs)</li>
+<li>Knowing that data is definitely NOT stored somewhere, e.g. remote cache</li>
+<li>Several variants (Counting Bloom Filter, Scalable Bloom Filter, ...)</li>
+</ul>
 </div>
 <div class="slide" id="questions">
 <h1>Questions?</h1>
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Twitter:</th><td class="field-body">&#64;etrepum</td>
-</tr>
-<tr class="field"><th class="field-name">Blog:</th><td class="field-body"><a class="reference external" href="http://bob.pythonmac.org/">http://bob.pythonmac.org/</a></td>
-</tr>
-<tr class="field"><th class="field-name">Mochi Media:</th><td class="field-body"><a class="reference external" href="http://www.mochimedia.com/">http://www.mochimedia.com/</a></td>
+<tr class="field"><th class="field-name">Open Space:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
-<p>Mochi is hiring:</p>
 <ul class="simple">
-<li><a class="reference external" href="http://www.mochimedia.com/jobs.html">http://www.mochimedia.com/jobs.html</a></li>
+<li>Open Space TODAY &#64; 5pm, Lambert. See Jonathan Ellis</li>
 </ul>
 </div>
 </div>

File slides.pdf

Binary file added.
 .. raw:: html
     :file: includes/logo.html
 
-=====================================
- Analysis: The Other Kind of Testing
-=====================================
+================================
+ Drop ACID and think about data
+================================
 
 :Author:
-    Bob Ippolito (@etrepum)
+    Bob Ippolito
 :Date:
-    February 2010
+    March 2009
 :Venue:
-    PyCon 2010
+    PyCon 2009
 
 Bob's Perspective
 =================
 
-:Mochi Media:
+:Startup with lots of data:
 
-* CTO/Cofounder of Mochi Media
-* Platform for Flash games
-* Virtual currency and game discovery for gamers
+* Cofounded Mochi Media in 2005
+* MochiBot analytics platform (for Flash)
+* MochiAds ad serving platform (for Flash games)
+* Other cool services for game developers
 
-Analysis
+Mochi Ad Sales
+==============
+
+:Hard Sell:
+
+.. image:: images/mochi_ad_sales.jpg
+
+What's ACID?
+============
+
+:A promise ring your DBMS wears:
+
+:Atomicity:
+    all or nothing
+:Consistency:
+    no explosions
+:Isolation:
+    no fights
+:Durability:
+    no lying
+
+ACID Trips
+==========
+
+:Scalability and reliability:
+
+* Downtime is unacceptable
+* Reliable is >= 2 nodes
+* Scalable is ... more
+* Networks make it hard
+* Networks make it hard
+* Networks make it hard
+
+What can I have?
+================
+
+:CAP theorem says pick two:
+
+* Consistency
+* Availability
+* Partition tolerance
+
+Turn up the BASE
+================
+
+:Write smarter applications:
+
+* Basically Available
+* Soft state
+* Eventually consistent
+
+BASE jumping
+============
+
+:Everyone else is doing it:
+
+* Google
+* Amazon
+* eBay
+* Yahoo!
+* Facebook
+* ...
+
+BigTable
 ========
 
-:Definition:
+:Google:
 
-* Separation of a whole into its component parts
-* An examination of a complex, its elements, and their relations
+* Paxos (Chubby)
+* Single-master
+* Distributed tablets via GFS
+* Row/Column db hybrid
+* Compression (BMDiff, Zippy)
+* Versioned (Row, Column, Timestamp)
+* Bloom filters
 
-Why do I care?
+BigTable Pros
+=============
+
+:Pros:
+
+* Compression = Awesome
+* Clients are probably simple
+* Integrates with map/reduce
+
+BigTable Cons
+=============
+
+:Cons:
+
+* Proprietary to Google
+* Single-master
+
+BigTable Diagram
+================
+
+:Single-master:
+
+.. image:: images/i-has-minions.jpg
+
+Dynamo
+======
+
+:Amazon:
+
+* Key/Value store
+* Consistent hashing
+* Vector clocks
+* Read repair
+
+Dynamo Pros
+===========
+
+:Pros:
+
+* No master
+* Highly available for write
+* Knobs to make it fast to read
+* "Simple" (lots of half-baked clones!)
+
+Dynamo Cons
+===========
+
+:Cons:
+
+* Proprietary to Amazon
+* Clients need to be smart
+* No compression
+* Not suitable for column-like workloads
+* Just a Key/Value store
+
+Dynamo Diagram
 ==============
 
-:SCIENCE:
+:Smart client:
 
-* Be creative when exploring new ideas
-* Not when measuring them
+.. image:: images/we-await-ur-instrucsions.jpg
 
-Who does this?
+Cassandra
+=========
+
+:Facebook -> Apache:
+
+* Open source!
+* No master like Dynamo
+* Storage model more like BigTable
+
+Cassandra Pros
 ==============
 
-:A-Z of online:
+:Pros:
 
-* Amazon
-* Facebook
-* Google
-* Microsoft
-* Zynga
+* OPEN SOURCE
+* Incrementally scalable
+* Minimal administration
 
-How it works
-============
+Cassandra Cons
+==============
 
-:Scientific Method:
+:Cons:
 
-* Create a hypothesis
-* Design an experiment
-* Run experiment
-* Analyze the results
+* Not polished
+* No compression yet
 
-Hypothesis
-==========
+Cassandra Diagram
+=================
 
-:Your Great Idea:
+:Soul Calibur:
 
-* Divergent from what you have
-* Simple
-* Testable
+.. image:: images/sc4_pub_ss_cassandra003_copy.jpg
 
-Testable
-========
+Distributed Musings
+===================
 
-:Fitness Metric:
+:New Hotness:
 
-* Quantifiable
-* Business relevant
-* Key Performance Indicator
-* Overall Evaluation Criteria
+* Distributed databases are the new web framework
+* ... except none of them are awesome yet
+* I don't think we need another half-baked Dynamo clone
 
-Fitness Metric Examples
-=======================
+Key-Value Stores
+================
 
-:Your Metric May Vary:
+:Simple and Fast:
 
-* Click-through rate (CTR)
-* Cost Per Acquisition (CPA)
-* Avg $ per user (ARPU)
-* % of gamers who beat the last level
+* Similar to a Python dict
+* Keys usually bytes, probably limited
+* Values usually bytes, often have fewer limits
+* Extremely fast, simple
 
-Experiment
-==========
+Memcached
+=========
 
-:"To Try Out":
+:Key/Value store as cache:
 
-* Method of investigation
-* Many ways to do this
-* Easiest to follow a template
+* No persistence
+* RAM only
+* Throws data away (on purpose)
+* Lightning fast
+* "Everyone" uses it
 
-Split Testing
-=============
+Caching Immutable Data
+======================
 
-:A/B Testing:
+:If only data never changed:
 
-* 50% control group gets no changes
-* 50% test group sees changes
-* Group selection method is important
+* Immutable is easy, do that
 
-Group Selection
-===============
-
-:random.choice:
-
-* Group should be randomly selected
-* Pin user to group during experiment
-* Easiest with logins, but can be done without (cookies, etc.)
-
-Selection Pseudocode
+Caching Mutable Data
 ====================
 
-::
+:Invalidation sucks:
 
-    from random import choice
+* Mutable is hard
+* Failed transactions?
+* Concurrent writers?
+* Dependent cache keys?
+* You will get it wrong and it will be hard to debug
 
-    def view_page():
-        if get_bucket(user) == 'test':
-            test_group_page()
-        else:
-            control_group_page()
+Tokyo Cabinet/Tyrant
+====================
 
-    def get_bucket(user):
-        if user.bucket is None:
-            user.bucket = choice(('test', 'control'))
-            user.save()
-        return user.bucket
-    
-Logging
+:Not your mom's BerkeleyDB:
+
+* Disk persistent
+* Very performant
+* Actively developed
+* Similar replication strategy to MySQL
+
+Redis
+=====
+
+:Still very new:
+
+* Not just a Key/Value store
+* Matching on key spaces
+* Values can be bytes, lists or sets
+* Requires full store in RAM
+* Might be a nice cache server?
+
+Document Databases
+==================
+
+:Schema-free:
+
+* Very easy to use
+* Document Versioning
+* Great for storing documents
+
+CouchDB
 =======
 
-:Log Everything:
+:Document DB Poster Child:
 
-* Log everything you can
-* Recording the bucket is very important
-* Flat files are easy
-* JSON is convenient (can be newline delimited)
+* Apache project
+* Asynchronous replication
+* JSON based
+* Views materialized on demand (not indexes)
+* Neat admin UI
 
-Logging Pseudocode
-==================
+MongoDB
+=======
 
-:Just Kidding:
+:C++'s revenge:
 
-* It's not actually easy
-* Multiple threads/processes/machines -> :(
-* Options include database, syslog or message queue
-* Gets harder at scale (e.g. Facebook's Scribe)
+* Fast
+* JSON and BSON (binary JSON-ish)
+* Asynchronous replication with auto-sharding "soon"
+* Index support
+* Nested documents
+* Advanced queries
 
-Log Processing
+Column Databases
+================
+
+:Data Warehousing:
+
+* Sequential reads are awesome
+* Columns compress better than rows
+* Doesn't waste I/O on uninteresting columns
+
+MonetDB
+=======
+
+:Research project:
+
+* Tried really hard to get it to work
+* Crashes a lot and corrupts your data
+* Do not waste your time
+
+LucidDB
+=======
+
+:Sounds interesting:
+
+* Java/C++ open source data warehouse
+* No clustering
+* No experience yet
+
+Vertica
+=======
+
+:We paid for it:
+
+* Commercial (based on C-Store)
+* Clustered
+* Would still prefer open source
+
+Bitmap Indexes
 ==============
 
-:ETL:
+:Sequential Scans can be fast:
 
-* Extract the log data
-* Transform it into a useful format
-* Load the data for analysis
-* Typically done in batch, e.g. daily or hourly
+* 1-N bits per row of data
+* Can apply logical operations across indexes
+* Can be compressed (BBC, WAH)
+* FastBit is a good implementation
 
-Log Processing Example
+Bitmap Index Uses
+=================
+
+:Big Queries:
+
+* PostgreSQL 8.1+ in-memory for some queries
+* Almost a requirement for column stores
+* FastBit is a great implementation (WAH)
+
+Bloom Filters are Neat
 ======================
 
-::
+:But our Princess is in another castle:
 
-    db = sqlite3.connect('testdata.db')
-    cur = db.cursor()
-    COLS = 'timestamp', 'user_id', 'bucket', 'dollars'
-    cur.execute('CREATE TABLE testdata(' +
-                ','.join(COLS) + ')')
+* Probabilistic data structure
+* False positives at a known error
+* Constant space
+* I won't bore you with the math
 
-    for line in log_lines:
-        dct = json.loads(line)
-        row = [dct[col] for col in COLS]
-        cur.execute(
-            'INSERT INTO testdata VALUES(?,?,?,?)',
-            row)
-    db.commit()
+Bloom Filter Diagram
+====================
 
-Analysis
-========
+:Actually Relevant:
 
-:Math time:
+.. image:: images/649px-Bloom_filter.png 
 
-* Plot data over time to see difference visually
-* Make sure to compare sample size
-* Check your work
-
-Overall Query
-=============
-
-:Overall Performance:
-
-::
-
-    SELECT
-        bucket,
-        COUNT(DISTINCT user_id) AS sample_size,
-        SUM(dollars)*1.0/COUNT(DISTINCT user_id) AS arpu
-    FROM testdata
-    GROUP BY bucket;
-
-Daily Query
-===========
-
-:Plot This:
-
-::
-
-    SELECT
-        date(timestamp, 'unixepoch') as day,
-        bucket,
-        COUNT(DISTINCT user_id) AS sample_size,
-        SUM(dollars)*1.0/COUNT(DISTINCT user_id) AS arpu
-    FROM testdata
-    GROUP BY day, bucket ORDER BY day, bucket;
-
-Sanity Check Query
-==================
-
-:No Results is Good:
-
-::
-
-    SELECT user_id
-    FROM testdata
-    GROUP BY user_id HAVING COUNT(DISTINCT bucket)>1;
-
-Simpson's Paradox
+Bloom Filter Uses
 =================
 
-:When Good is Bad:
+:Find stuff, maybe:
 
-* Hidden variables can confuse results
-* Sample size is important
-* Random selection and 50/50 split helps avoid this issue
-
-Paradox Example
-===============
-
-:Grad School Acceptance:
-
-+---------+-----------------+----------------+
-|         | Men             | Women          |
-+---------+-----------------+----------------+
-| Arts    | 3/4 (75%)       | 1/1 **(100%)** |
-+---------+-----------------+----------------+
-| Science | 0/1 (0%)        | 1/4 **(25%)**  |
-+---------+-----------------+----------------+
-| Totals  | 3/5 **(60%)**   | 2/5 (40%)      |
-+---------+-----------------+----------------+
-
-Ramp-up
-=======
-
-:Start Slow:
-
-* New features are scary, might be broken
-* Run a staged experiment, e.g. 1% -> 5% -> 20% -> 50%
-* If it is obviously broken, abort!
-
-Ramp-up Implementation
-======================
-
-:Buckets For Your Buckets:
-
-* Segment users randomly into range(100) buckets
-* During 1% period include range(1) in test group
-* During 5% period include range(5) in test group
-* ...
-
-Ramp-up Analysis
-================
-
-:More Math:
-
-* Users will cross from control to test group during experiment
-* Overlap is still bad
-* Easiest to only look at single test period
-* Double-check queries to make sure there is no overlap
-
-Parallel Tests
-==============
-
-:Even More Math:
-
-* Multivariate Testing
-* Test multiple changes in parallel
-* Changes can interfere, reinforce, or have no effect on each other
-* Beyond scope of this talk
-
-Group Selection Pitfalls
-========================
-
-:DANGER!:
-
-* Each experiment should have independent group selection
-* If you re-use buckets across experiments your data is invalid
-* Important even if not running parallel tests!
-
-Confidence
-==========
-
-:More Data:
-
-* Confidence goes up with lots of samples
-* Chi-squared or Student's t-test good candidates
-* Not my expertise, find a statistician :)
-
-Free Tools
-==========
-
-Django Lean
-    http://bitbucket.org/akoha/django-lean/
-Google Website Optimizer
-    http://services.google.com/websiteoptimizer/
-
-More Info
-=========
-
-Microsoft's Experimentation Platform
-    http://exp-platform.com/
-Effective A/B Testing
-    http://elem.com/~btilly/effective-ab-testing/
-Startup Lessons Learned
-    http://www.startuplessonslearned.com/search/label/split-test
-Andrew Chen's Blog
-    http://andrewchenblog.com/
+* Approximate counting of a large set (e.g. unique IPs from logs)
+* Knowing that data is definitely NOT stored somewhere, e.g. remote cache
+* Several variants (Counting Bloom Filter, Scalable Bloom Filter, ...)
 
 Questions?
 ==========
 
-:Twitter:
-  @etrepum
-:Blog:
-  http://bob.pythonmac.org/
-:Mochi Media:
-  http://www.mochimedia.com/
-  
-Mochi is hiring:
+:Open Space:
 
-* http://www.mochimedia.com/jobs.html
+* Open Space TODAY @ 5pm, Lambert. See Jonathan Ellis