Bob Ippolito avatar Bob Ippolito committed d6798e2

checkpoint

Comments (0)

Files changed (12)

+testdata.db
 
 slides.html: slides.txt includes/*.html
 	rst2s5.py --theme-url ui/mochikit slides.txt $@
+	./bin/fixup_definitions.py $@
+
+testdata.db: bin/make_testdata.py
+	./bin/make_testdata.py $@
 
 .PHONY: all clean

bin/fixup_definitions.py

+#!/usr/bin/env python
+import sys
+
+def main():
+    fn = sys.argv[1]
+    f = open(fn, 'rb')
+    txt = f.read()
+    f.close()
+    txt = txt.replace(
+        '<th class="field-name" colspan="2">',
+        '<th class="field-name">').replace(
+        '</tr>\n<tr><td>&nbsp;</td><td class="field-body"></td>',
+        '<td class="field-body"></td>')
+    f = open(fn, 'wb')
+    f.write(txt)
+    f.close()
+
+
+if __name__ == '__main__':
+    main()

bin/make_testdata.py

+#!/usr/bin/env python
+import sys
+import random
+import sqlite3
+
+def main():
+    db = sqlite3.connect(sys.argv[1])
+    cur = db.cursor()
+    COLS = 'timestamp', 'user_id', 'bucket', 'dollars'
+    cur.execute('CREATE TABLE testdata(' +
+                ','.join(COLS) + ')')
+    EPOCH = 1262322000
+    for i in xrange(1000000):
+        bucket = random.randrange(2)
+        row = [
+            EPOCH + random.randrange(14 * 86400),
+            (0,10000)[bucket] + random.randrange(10000),
+            ('control','test')[bucket],
+            random.randrange(10 * (1 + bucket)),
+        ]
+        cur.execute('INSERT INTO testdata VALUES(?,?,?,?)',
+                    row)
+    db.commit()
+
+if __name__ == '__main__':
+    main()
Add a comment to this file

images/649px-Bloom_filter.png

Removed
Old image
Add a comment to this file

images/i-has-minions.jpg

Removed
Old image
Add a comment to this file

images/mochi_ad_sales.jpg

Removed
Old image
Add a comment to this file

images/sc4_pub_ss_cassandra003_copy.jpg

Removed
Old image
Add a comment to this file

images/we-await-ur-instrucsions.jpg

Removed
Old image
 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
 <meta name="generator" content="Docutils 0.5: http://docutils.sourceforge.net/" />
 <meta name="version" content="S5 1.1" />
-<title>Drop ACID and think about data</title>
-<meta name="author" content="Bob Ippolito" />
-<meta name="date" content="March 2009" />
+<title>Analysis: The Other Kind of Testing</title>
+<meta name="author" content="Bob Ippolito (&#64;etrepum)" />
+<meta name="date" content="February 2010" />
 <style type="text/css">
 
 /*
 
 </div>
 <div id="footer">
-<h1>Drop ACID and think about data</h1>
+<h1>Analysis: The Other Kind of Testing</h1>
 
 </div>
 </div>
 <div class="presentation">
 <div class="slide" id="slide0">
-<h1 class="title">Drop ACID and think about data</h1>
+<h1 class="title">Analysis: The Other Kind of Testing</h1>
 <table class="docinfo" frame="void" rules="none">
 <col class="docinfo-name" />
 <col class="docinfo-content" />
 <tbody valign="top">
 <tr><th class="docinfo-name">Author:</th>
-<td>Bob Ippolito</td></tr>
+<td>Bob Ippolito (&#64;etrepum)</td></tr>
 <tr><th class="docinfo-name">Date:</th>
-<td>March 2009</td></tr>
-<tr class="field"><th class="docinfo-name">Venue:</th><td class="field-body">PyCon 2009</td>
+<td>February 2010</td></tr>
+<tr class="field"><th class="docinfo-name">Venue:</th><td class="field-body">PyCon 2010</td>
 </tr>
 </tbody>
 </table>
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">Startup with lots of data:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
+<tr class="field"><th class="field-name">Mochi Media:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Cofounded Mochi Media in 2005</li>
-<li>MochiBot analytics platform (for Flash)</li>
-<li>MochiAds ad serving platform (for Flash games)</li>
-<li>Other cool services for game developers</li>
+<li>CTO/Cofounder of Mochi Media</li>
+<li>Platform for Flash games</li>
+<li>Virtual currency and game discovery for gamers</li>
 </ul>
 </div>
-<div class="slide" id="mochi-ad-sales">
-<h1>Mochi Ad Sales</h1>
+<div class="slide" id="analysis">
+<h1>Analysis</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Hard Sell:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">Definition:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
-<img alt="images/mochi_ad_sales.jpg" src="images/mochi_ad_sales.jpg" />
+<ul class="simple">
+<li>Separation of a whole into its component parts</li>
+<li>An examination of a complex, its elements, and their relations</li>
+</ul>
 </div>
-<div class="slide" id="what-s-acid">
-<h1>What's ACID?</h1>
+<div class="slide" id="why-do-i-care">
+<h1>Why do I care?</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">A promise ring your DBMS wears:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
+<tr class="field"><th class="field-name">SCIENCE:</th><td class="field-body"></td>
 </tr>
-<tr class="field"><th class="field-name">Atomicity:</th><td class="field-body">all or nothing</td>
+</tbody>
+</table>
+<ul class="simple">
+<li>Be creative when exploring new ideas</li>
+<li>Not when measuring them</li>
+</ul>
+</div>
+<div class="slide" id="who-does-this">
+<h1>Who does this?</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">A-Z of online:</th><td class="field-body"></td>
 </tr>
-<tr class="field"><th class="field-name">Consistency:</th><td class="field-body">no explosions</td>
+</tbody>
+</table>
+<ul class="simple">
+<li>Amazon</li>
+<li>Facebook</li>
+<li>Google</li>
+<li>Microsoft</li>
+<li>Zynga</li>
+</ul>
+</div>
+<div class="slide" id="how-it-works">
+<h1>How it works</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">Scientific Method:</th><td class="field-body"></td>
 </tr>
-<tr class="field"><th class="field-name">Isolation:</th><td class="field-body">no fights</td>
+</tbody>
+</table>
+<ul class="simple">
+<li>Create a hypothesis</li>
+<li>Design an experiment</li>
+<li>Run experiment</li>
+<li>Analyze the results</li>
+</ul>
+</div>
+<div class="slide" id="hypothesis">
+<h1>Hypothesis</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">Your Great Idea:</th><td class="field-body"></td>
 </tr>
-<tr class="field"><th class="field-name">Durability:</th><td class="field-body">no lying</td>
+</tbody>
+</table>
+<ul class="simple">
+<li>Divergent from what you have</li>
+<li>Simple</li>
+<li>Testable</li>
+</ul>
+</div>
+<div class="slide" id="testable">
+<h1>Testable</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">Fitness Metric:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<ul class="simple">
+<li>Quantifiable</li>
+<li>Business relevant</li>
+<li>Key Performance Indicator</li>
+<li>Overall Evaluation Criteria</li>
+</ul>
+</div>
+<div class="slide" id="fitness-metric-examples">
+<h1>Fitness Metric Examples</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">Your Metric May Vary:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<ul class="simple">
+<li>Click-through rate (CTR)</li>
+<li>Cost Per Acquisition (CPA)</li>
+<li>Avg $ per user (ARPU)</li>
+<li>% of gamers who beat the last level</li>
+</ul>
+</div>
+<div class="slide" id="experiment">
+<h1>Experiment</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">&quot;To Try Out&quot;:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<ul class="simple">
+<li>Method of investigation</li>
+<li>Many ways to do this</li>
+<li>Easiest to follow a template</li>
+</ul>
+</div>
+<div class="slide" id="split-testing">
+<h1>Split Testing</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">A/B Testing:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<ul class="simple">
+<li>50% control group gets no changes</li>
+<li>50% test group sees changes</li>
+<li>Group selection method is important</li>
+</ul>
+</div>
+<div class="slide" id="group-selection">
+<h1>Group Selection</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">random.choice:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<ul class="simple">
+<li>Group should be randomly selected</li>
+<li>Pin user to group during experiment</li>
+<li>Easiest with logins, but can be done without (cookies, etc.)</li>
+</ul>
+</div>
+<div class="slide" id="selection-pseudocode">
+<h1>Selection Pseudocode</h1>
+<pre class="literal-block">
+from random import choice
+
+def view_page():
+    if get_bucket(user) == 'test':
+        test_group_page()
+    else:
+        control_group_page()
+
+def get_bucket(user):
+    if user.bucket is None:
+        user.bucket = choice(('test', 'control'))
+        user.save()
+    return user.bucket
+</pre>
+</div>
+<div class="slide" id="logging">
+<h1>Logging</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">Log Everything:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<ul class="simple">
+<li>Log everything you can</li>
+<li>Recording the bucket is very important</li>
+<li>Flat files are easy</li>
+<li>JSON is convenient (can be newline delimited)</li>
+</ul>
+</div>
+<div class="slide" id="logging-pseudocode">
+<h1>Logging Pseudocode</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">Just Kidding:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<ul class="simple">
+<li>It's not actually easy</li>
+<li>Multiple threads/processes/machines -&gt; :(</li>
+<li>Options include database, syslog or message queue</li>
+<li>Gets harder at scale (e.g. Facebook's Scribe)</li>
+</ul>
+</div>
+<div class="slide" id="log-processing">
+<h1>Log Processing</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">ETL:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<ul class="simple">
+<li>Extract the log data</li>
+<li>Transform it into a useful format</li>
+<li>Load the data for analysis</li>
+<li>Typically done in batch, e.g. daily or hourly</li>
+</ul>
+</div>
+<div class="slide" id="log-processing-example">
+<h1>Log Processing Example</h1>
+<pre class="literal-block">
+db = sqlite3.connect('testdata.db')
+cur = db.cursor()
+COLS = 'timestamp', 'user_id', 'bucket', 'dollars'
+cur.execute('CREATE TABLE testdata(' +
+            ','.join(COLS) + ')')
+
+for line in log_lines:
+    dct = json.loads(line)
+    row = [dct[col] for col in COLS]
+    cur.execute(
+        'INSERT INTO testdata VALUES(?,?,?,?)',
+        row)
+db.commit()
+</pre>
+</div>
+<div class="slide" id="id1">
+<h1>Analysis</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">Math time:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<ul class="simple">
+<li>Plot data over time to see difference visually</li>
+<li>Make sure to compare sample size</li>
+<li>Check your work</li>
+</ul>
+</div>
+<div class="slide" id="overall-query">
+<h1>Overall Query</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">Overall Performance:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<pre class="literal-block">
+SELECT
+    bucket,
+    COUNT(DISTINCT user_id) AS sample_size,
+    SUM(dollars)*1.0/COUNT(DISTINCT user_id) AS arpu
+FROM testdata
+GROUP BY bucket;
+</pre>
+</div>
+<div class="slide" id="daily-query">
+<h1>Daily Query</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">Plot This:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<pre class="literal-block">
+SELECT
+    date(timestamp, 'unixepoch') as day,
+    bucket,
+    COUNT(DISTINCT user_id) AS sample_size,
+    SUM(dollars)*1.0/COUNT(DISTINCT user_id) AS arpu
+FROM testdata
+GROUP BY day, bucket ORDER BY day, bucket;
+</pre>
+</div>
+<div class="slide" id="sanity-check-query">
+<h1>Sanity Check Query</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">No Results is Good:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<pre class="literal-block">
+SELECT user_id
+FROM testdata
+GROUP BY user_id HAVING COUNT(DISTINCT bucket)&gt;1;
+</pre>
+</div>
+<div class="slide" id="simpson-s-paradox">
+<h1>Simpson's Paradox</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">When Good is Bad:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<ul class="simple">
+<li>Hidden variables can confuse results</li>
+<li>Sample size is important</li>
+<li>Random selection and 50/50 split helps avoid this issue</li>
+</ul>
+</div>
+<div class="slide" id="paradox-example">
+<h1>Paradox Example</h1>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field"><th class="field-name">Grad School Acceptance:</th><td class="field-body"></td>
+</tr>
+</tbody>
+</table>
+<table border="1" class="docutils">
+<colgroup>
+<col width="21%" />
+<col width="40%" />
+<col width="38%" />
+</colgroup>
+<tbody valign="top">
+<tr><td>&nbsp;</td>
+<td>Men</td>
+<td>Women</td>
+</tr>
+<tr><td>Arts</td>
+<td>3/4 (75%)</td>
+<td>1/1 <strong>(100%)</strong></td>
+</tr>
+<tr><td>Science</td>
+<td>0/1 (0%)</td>
+<td>1/4 <strong>(25%)</strong></td>
+</tr>
+<tr><td>Totals</td>
+<td>3/5 <strong>(60%)</strong></td>
+<td>2/5 (40%)</td>
 </tr>
 </tbody>
 </table>
 </div>
-<div class="slide" id="acid-trips">
-<h1>ACID Trips</h1>
+<div class="slide" id="ramp-up">
+<h1>Ramp-up</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">Scalability and reliability:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
+<tr class="field"><th class="field-name">Start Slow:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Downtime is unacceptable</li>
-<li>Reliable is &gt;= 2 nodes</li>
-<li>Scalable is ... more</li>
-<li>Networks make it hard</li>
-<li>Networks make it hard</li>
-<li>Networks make it hard</li>
+<li>New features are scary, might be broken</li>
+<li>Run a staged experiment, e.g. 1% -&gt; 5% -&gt; 20% -&gt; 50%</li>
+<li>If it is obviously broken, abort!</li>
 </ul>
 </div>
-<div class="slide" id="what-can-i-have">
-<h1>What can I have?</h1>
+<div class="slide" id="ramp-up-implementation">
+<h1>Ramp-up Implementation</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">CAP theorem says pick two:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
+<tr class="field"><th class="field-name">Buckets For Your Buckets:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Consistency</li>
-<li>Availability</li>
-<li>Partition tolerance</li>
+<li>Segment users randomly into range(100) buckets</li>
+<li>During 1% period include range(1) in test group</li>
+<li>During 5% period include range(5) in test group</li>
+<li>...</li>
 </ul>
 </div>
-<div class="slide" id="turn-up-the-base">
-<h1>Turn up the BASE</h1>
+<div class="slide" id="ramp-up-analysis">
+<h1>Ramp-up Analysis</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">Write smarter applications:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
+<tr class="field"><th class="field-name">More Math:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Basically Available</li>
-<li>Soft state</li>
-<li>Eventually consistent</li>
+<li>Users will cross from control to test group during experiment</li>
+<li>Overlap is still bad</li>
+<li>Easiest to only look at single test period</li>
+<li>Double-check queries to make sure there is no overlap</li>
 </ul>
 </div>
-<div class="slide" id="base-jumping">
-<h1>BASE jumping</h1>
+<div class="slide" id="parallel-tests">
+<h1>Parallel Tests</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">Everyone else is doing it:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
+<tr class="field"><th class="field-name">Even More Math:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Google</li>
-<li>Amazon</li>
-<li>eBay</li>
-<li>Yahoo!</li>
-<li>Facebook</li>
-<li>...</li>
+<li>Multivariate Testing</li>
+<li>Test multiple changes in parallel</li>
+<li>Changes can interfere, reinforce, or have no effect on each other</li>
+<li>Beyond scope of this talk</li>
 </ul>
 </div>
-<div class="slide" id="bigtable">
-<h1>BigTable</h1>
+<div class="slide" id="group-selection-pitfalls">
+<h1>Group Selection Pitfalls</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Google:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">DANGER!:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Paxos (Chubby)</li>
-<li>Single-master</li>
-<li>Distributed tablets via GFS</li>
-<li>Row/Column db hybrid</li>
-<li>Compression (BMDiff, Zippy)</li>
-<li>Versioned (Row, Column, Timestamp)</li>
-<li>Bloom filters</li>
+<li>Each experiment should have independent group selection</li>
+<li>If you re-use buckets across experiments your data is invalid</li>
+<li>Important even if not running parallel tests!</li>
 </ul>
 </div>
-<div class="slide" id="bigtable-pros">
-<h1>BigTable Pros</h1>
+<div class="slide" id="confidence">
+<h1>Confidence</h1>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Pros:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">More Data:</th><td class="field-body"></td>
 </tr>
 </tbody>
 </table>
 <ul class="simple">
-<li>Compression = Awesome</li>
-<li>Clients are probably simple</li>
-<li>Integrates with map/reduce</li>
+<li>Confidence goes up with lots of samples</li>
+<li>Chi-squared or Student's t-test good candidates</li>
+<li>Not my expertise, find a statistician :)</li>
 </ul>
 </div>
-<div class="slide" id="bigtable-cons">
-<h1>BigTable Cons</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name">Cons:</th><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Proprietary to Google</li>
-<li>Single-master</li>
-</ul>
+<div class="slide" id="free-tools">
+<h1>Free Tools</h1>
+<dl class="docutils">
+<dt>Django Lean</dt>
+<dd><a class="reference external" href="http://bitbucket.org/akoha/django-lean/">http://bitbucket.org/akoha/django-lean/</a></dd>
+<dt>Google Website Optimizer</dt>
+<dd><a class="reference external" href="http://services.google.com/websiteoptimizer/">http://services.google.com/websiteoptimizer/</a></dd>
+</dl>
 </div>
-<div class="slide" id="bigtable-diagram">
-<h1>BigTable Diagram</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name">Single-master:</th><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<img alt="images/i-has-minions.jpg" src="images/i-has-minions.jpg" />
-</div>
-<div class="slide" id="dynamo">
-<h1>Dynamo</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name">Amazon:</th><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Key/Value store</li>
-<li>Consistent hashing</li>
-<li>Vector clocks</li>
-<li>Read repair</li>
-</ul>
-</div>
-<div class="slide" id="dynamo-pros">
-<h1>Dynamo Pros</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name">Pros:</th><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>No master</li>
-<li>Highly available for write</li>
-<li>Knobs to make it fast to read</li>
-<li>&quot;Simple&quot; (lots of half-baked clones!)</li>
-</ul>
-</div>
-<div class="slide" id="dynamo-cons">
-<h1>Dynamo Cons</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name">Cons:</th><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Proprietary to Amazon</li>
-<li>Clients need to be smart</li>
-<li>No compression</li>
-<li>Not suitable for column-like workloads</li>
-<li>Just a Key/Value store</li>
-</ul>
-</div>
-<div class="slide" id="dynamo-diagram">
-<h1>Dynamo Diagram</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name">Smart client:</th><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<img alt="images/we-await-ur-instrucsions.jpg" src="images/we-await-ur-instrucsions.jpg" />
-</div>
-<div class="slide" id="cassandra">
-<h1>Cassandra</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">Facebook -&gt; Apache:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Open source!</li>
-<li>No master like Dynamo</li>
-<li>Storage model more like BigTable</li>
-</ul>
-</div>
-<div class="slide" id="cassandra-pros">
-<h1>Cassandra Pros</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name">Pros:</th><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>OPEN SOURCE</li>
-<li>Incrementally scalable</li>
-<li>Minimal administration</li>
-</ul>
-</div>
-<div class="slide" id="cassandra-cons">
-<h1>Cassandra Cons</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name">Cons:</th><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Not polished</li>
-<li>No compression yet</li>
-</ul>
-</div>
-<div class="slide" id="cassandra-diagram">
-<h1>Cassandra Diagram</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name">Soul Calibur:</th><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<img alt="images/sc4_pub_ss_cassandra003_copy.jpg" src="images/sc4_pub_ss_cassandra003_copy.jpg" />
-</div>
-<div class="slide" id="distributed-musings">
-<h1>Distributed Musings</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name">New Hotness:</th><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Distributed databases are the new web framework</li>
-<li>... except none of them are awesome yet</li>
-<li>I don't think we need another half-baked Dynamo clone</li>
-</ul>
-</div>
-<div class="slide" id="key-value-stores">
-<h1>Key-Value Stores</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">Simple and Fast:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Similar to a Python dict</li>
-<li>Keys usually bytes, probably limited</li>
-<li>Values usually bytes, often have fewer limits</li>
-<li>Extremely fast, simple</li>
-</ul>
-</div>
-<div class="slide" id="memcached">
-<h1>Memcached</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">Key/Value store as cache:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>No persistence</li>
-<li>RAM only</li>
-<li>Throws data away (on purpose)</li>
-<li>Lightning fast</li>
-<li>&quot;Everyone&quot; uses it</li>
-</ul>
-</div>
-<div class="slide" id="caching-immutable-data">
-<h1>Caching Immutable Data</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">If only data never changed:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Immutable is easy, do that</li>
-</ul>
-</div>
-<div class="slide" id="caching-mutable-data">
-<h1>Caching Mutable Data</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">Invalidation sucks:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Mutable is hard</li>
-<li>Failed transactions?</li>
-<li>Concurrent writers?</li>
-<li>Dependent cache keys?</li>
-<li>You will get it wrong and it will be hard to debug</li>
-</ul>
-</div>
-<div class="slide" id="tokyo-cabinet-tyrant">
-<h1>Tokyo Cabinet/Tyrant</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">Not your mom's BerkeleyDB:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Disk persistent</li>
-<li>Very performant</li>
-<li>Actively developed</li>
-<li>Similar replication strategy to MySQL</li>
-</ul>
-</div>
-<div class="slide" id="redis">
-<h1>Redis</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name">Still very new:</th><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Not just a Key/Value store</li>
-<li>Matching on key spaces</li>
-<li>Values can be bytes, lists or sets</li>
-<li>Requires full store in RAM</li>
-<li>Might be a nice cache server?</li>
-</ul>
-</div>
-<div class="slide" id="document-databases">
-<h1>Document Databases</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name">Schema-free:</th><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Very easy to use</li>
-<li>Document Versioning</li>
-<li>Great for storing documents</li>
-</ul>
-</div>
-<div class="slide" id="couchdb">
-<h1>CouchDB</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">Document DB Poster Child:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Apache project</li>
-<li>Asynchronous replication</li>
-<li>JSON based</li>
-<li>Views materialized on demand (not indexes)</li>
-<li>Neat admin UI</li>
-</ul>
-</div>
-<div class="slide" id="mongodb">
-<h1>MongoDB</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name">C++'s revenge:</th><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Fast</li>
-<li>JSON and BSON (binary JSON-ish)</li>
-<li>Asynchronous replication with auto-sharding &quot;soon&quot;</li>
-<li>Index support</li>
-<li>Nested documents</li>
-<li>Advanced queries</li>
-</ul>
-</div>
-<div class="slide" id="column-databases">
-<h1>Column Databases</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">Data Warehousing:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Sequential reads are awesome</li>
-<li>Columns compress better than rows</li>
-<li>Doesn't waste I/O on uninteresting columns</li>
-</ul>
-</div>
-<div class="slide" id="monetdb">
-<h1>MonetDB</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">Research project:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Tried really hard to get it to work</li>
-<li>Crashes a lot and corrupts your data</li>
-<li>Do not waste your time</li>
-</ul>
-</div>
-<div class="slide" id="luciddb">
-<h1>LucidDB</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">Sounds interesting:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Java/C++ open source data warehouse</li>
-<li>No clustering</li>
-<li>No experience yet</li>
-</ul>
-</div>
-<div class="slide" id="vertica">
-<h1>Vertica</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name">We paid for it:</th><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Commercial (based on C-Store)</li>
-<li>Clustered</li>
-<li>Would still prefer open source</li>
-</ul>
-</div>
-<div class="slide" id="bitmap-indexes">
-<h1>Bitmap Indexes</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">Sequential Scans can be fast:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>1-N bits per row of data</li>
-<li>Can apply logical operations across indexes</li>
-<li>Can be compressed (BBC, WAH)</li>
-<li>FastBit is a good implementation</li>
-</ul>
-</div>
-<div class="slide" id="bitmap-index-uses">
-<h1>Bitmap Index Uses</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name">Big Queries:</th><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>PostgreSQL 8.1+ in-memory for some queries</li>
-<li>Almost a requirement for column stores</li>
-<li>FastBit is a great implementation (WAH)</li>
-</ul>
-</div>
-<div class="slide" id="bloom-filters-are-neat">
-<h1>Bloom Filters are Neat</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">But our Princess is in another castle:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Probabilistic data structure</li>
-<li>False positives at a known error</li>
-<li>Constant space</li>
-<li>I won't bore you with the math</li>
-</ul>
-</div>
-<div class="slide" id="bloom-filter-diagram">
-<h1>Bloom Filter Diagram</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">Actually Relevant:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<img alt="images/649px-Bloom_filter.png" src="images/649px-Bloom_filter.png" />
-</div>
-<div class="slide" id="bloom-filter-uses">
-<h1>Bloom Filter Uses</h1>
-<table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field"><th class="field-name" colspan="2">Find stuff, maybe:</th></tr>
-<tr><td>&nbsp;</td><td class="field-body"></td>
-</tr>
-</tbody>
-</table>
-<ul class="simple">
-<li>Approximate counting of a large set (e.g. unique IPs from logs)</li>
-<li>Knowing that data is definitely NOT stored somewhere, e.g. remote cache</li>
-<li>Several variants (Counting Bloom Filter, Scalable Bloom Filter, ...)</li>
-</ul>
+<div class="slide" id="more-info">
+<h1>More Info</h1>
+<dl class="docutils">
+<dt>Microsoft's Experimentation Platform</dt>
+<dd><a class="reference external" href="http://exp-platform.com/">http://exp-platform.com/</a></dd>
+<dt>Effective A/B Testing</dt>
+<dd><a class="reference external" href="http://elem.com/~btilly/effective-ab-testing/">http://elem.com/~btilly/effective-ab-testing/</a></dd>
+<dt>Startup Lessons Learned</dt>
+<dd><a class="reference external" href="http://www.startuplessonslearned.com/search/label/split-test">http://www.startuplessonslearned.com/search/label/split-test</a></dd>
+<dt>Andrew Chen's Blog</dt>
+<dd><a class="reference external" href="http://andrewchenblog.com/">http://andrewchenblog.com/</a></dd>
+</dl>
 </div>
 <div class="slide" id="questions">
 <h1>Questions?</h1>
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
-<tr class="field"><th class="field-name">Open Space:</th><td class="field-body"></td>
+<tr class="field"><th class="field-name">Twitter:</th><td class="field-body">&#64;etrepum</td>
+</tr>
+<tr class="field"><th class="field-name">Blog:</th><td class="field-body"><a class="reference external" href="http://bob.pythonmac.org/">http://bob.pythonmac.org/</a></td>
+</tr>
+<tr class="field"><th class="field-name">Mochi Media:</th><td class="field-body"><a class="reference external" href="http://www.mochimedia.com/">http://www.mochimedia.com/</a></td>
 </tr>
 </tbody>
 </table>
+<p>Mochi is hiring:</p>
 <ul class="simple">
-<li>Open Space TODAY &#64; 5pm, Lambert. See Jonathan Ellis</li>
+<li><a class="reference external" href="http://www.mochimedia.com/jobs.html">http://www.mochimedia.com/jobs.html</a></li>
 </ul>
 </div>
 </div>

Binary file removed.

 .. raw:: html
     :file: includes/logo.html
 
-================================
- Drop ACID and think about data
-================================
+=====================================
+ Analysis: The Other Kind of Testing
+=====================================
 
 :Author:
-    Bob Ippolito
+    Bob Ippolito (@etrepum)
 :Date:
-    March 2009
+    February 2010
 :Venue:
-    PyCon 2009
+    PyCon 2010
 
 Bob's Perspective
 =================
 
-:Startup with lots of data:
+:Mochi Media:
 
-* Cofounded Mochi Media in 2005
-* MochiBot analytics platform (for Flash)
-* MochiAds ad serving platform (for Flash games)
-* Other cool services for game developers
+* CTO/Cofounder of Mochi Media
+* Platform for Flash games
+* Virtual currency and game discovery for gamers
 
-Mochi Ad Sales
+Analysis
+========
+
+:Definition:
+
+* Separation of a whole into its component parts
+* An examination of a complex, its elements, and their relations
+
+Why do I care?
 ==============
 
-:Hard Sell:
+:SCIENCE:
 
-.. image:: images/mochi_ad_sales.jpg
+* Be creative when exploring new ideas
+* Not when measuring them
 
-What's ACID?
+Who does this?
+==============
+
+:A-Z of online:
+
+* Amazon
+* Facebook
+* Google
+* Microsoft
+* Zynga
+
+How it works
 ============
 
-:A promise ring your DBMS wears:
+:Scientific Method:
 
-:Atomicity:
-    all or nothing
-:Consistency:
-    no explosions
-:Isolation:
-    no fights
-:Durability:
-    no lying
+* Create a hypothesis
+* Design an experiment
+* Run experiment
+* Analyze the results
 
-ACID Trips
+Hypothesis
 ==========
 
-:Scalability and reliability:
+:Your Great Idea:
 
-* Downtime is unacceptable
-* Reliable is >= 2 nodes
-* Scalable is ... more
-* Networks make it hard
-* Networks make it hard
-* Networks make it hard
+* Divergent from what you have
+* Simple
+* Testable
 
-What can I have?
+Testable
+========
+
+:Fitness Metric:
+
+* Quantifiable
+* Business relevant
+* Key Performance Indicator
+* Overall Evaluation Criteria
+
+Fitness Metric Examples
+=======================
+
+:Your Metric May Vary:
+
+* Click-through rate (CTR)
+* Cost Per Acquisition (CPA)
+* Avg $ per user (ARPU)
+* % of gamers who beat the last level
+
+Experiment
+==========
+
+:"To Try Out":
+
+* Method of investigation
+* Many ways to do this
+* Easiest to follow a template
+
+Split Testing
+=============
+
+:A/B Testing:
+
+* 50% control group gets no changes
+* 50% test group sees changes
+* Group selection method is important
+
+Group Selection
+===============
+
+:random.choice:
+
+* Group should be randomly selected
+* Pin user to group during experiment
+* Easiest with logins, but can be done without (cookies, etc.)
+
+Selection Pseudocode
+====================
+
+::
+
+    from random import choice
+
+    def view_page():
+        if get_bucket(user) == 'test':
+            test_group_page()
+        else:
+            control_group_page()
+
+    def get_bucket(user):
+        if user.bucket is None:
+            user.bucket = choice(('test', 'control'))
+            user.save()
+        return user.bucket
+    
+Logging
+=======
+
+:Log Everything:
+
+* Log everything you can
+* Recording the bucket is very important
+* Flat files are easy
+* JSON is convenient (can be newline delimited)
+
+Logging Pseudocode
+==================
+
+:Just Kidding:
+
+* It's not actually easy
+* Multiple threads/processes/machines -> :(
+* Options include database, syslog or message queue
+* Gets harder at scale (e.g. Facebook's Scribe)
+
+Log Processing
+==============
+
+:ETL:
+
+* Extract the log data
+* Transform it into a useful format
+* Load the data for analysis
+* Typically done in batch, e.g. daily or hourly
+
+Log Processing Example
+======================
+
+::
+
+    db = sqlite3.connect('testdata.db')
+    cur = db.cursor()
+    COLS = 'timestamp', 'user_id', 'bucket', 'dollars'
+    cur.execute('CREATE TABLE testdata(' +
+                ','.join(COLS) + ')')
+
+    for line in log_lines:
+        dct = json.loads(line)
+        row = [dct[col] for col in COLS]
+        cur.execute(
+            'INSERT INTO testdata VALUES(?,?,?,?)',
+            row)
+    db.commit()
+
+Analysis
+========
+
+:Math time:
+
+* Plot data over time to see difference visually
+* Make sure to compare sample size
+* Check your work
+
+Overall Query
+=============
+
+:Overall Performance:
+
+::
+
+    SELECT
+        bucket,
+        COUNT(DISTINCT user_id) AS sample_size,
+        SUM(dollars)*1.0/COUNT(DISTINCT user_id) AS arpu
+    FROM testdata
+    GROUP BY bucket;
+
+Daily Query
+===========
+
+:Plot This:
+
+::
+
+    SELECT
+        date(timestamp, 'unixepoch') as day,
+        bucket,
+        COUNT(DISTINCT user_id) AS sample_size,
+        SUM(dollars)*1.0/COUNT(DISTINCT user_id) AS arpu
+    FROM testdata
+    GROUP BY day, bucket ORDER BY day, bucket;
+
+Sanity Check Query
+==================
+
+:No Results is Good:
+
+::
+
+    SELECT user_id
+    FROM testdata
+    GROUP BY user_id HAVING COUNT(DISTINCT bucket)>1;
+
+Simpson's Paradox
+=================
+
+:When Good is Bad:
+
+* Hidden variables can confuse results
+* Sample size is important
+* Random selection and 50/50 split helps avoid this issue
+
+Paradox Example
+===============
+
+:Grad School Acceptance:
+
++---------+-----------------+----------------+
+|         | Men             | Women          |
++---------+-----------------+----------------+
+| Arts    | 3/4 (75%)       | 1/1 **(100%)** |
++---------+-----------------+----------------+
+| Science | 0/1 (0%)        | 1/4 **(25%)**  |
++---------+-----------------+----------------+
+| Totals  | 3/5 **(60%)**   | 2/5 (40%)      |
++---------+-----------------+----------------+
+
+Ramp-up
+=======
+
+:Start Slow:
+
+* New features are scary, might be broken
+* Run a staged experiment, e.g. 1% -> 5% -> 20% -> 50%
+* If it is obviously broken, abort!
+
+Ramp-up Implementation
+======================
+
+:Buckets For Your Buckets:
+
+* Segment users randomly into range(100) buckets
+* During 1% period include range(1) in test group
+* During 5% period include range(5) in test group
+* ...
+
+Ramp-up Analysis
 ================
 
-:CAP theorem says pick two:
+:More Math:
 
-* Consistency
-* Availability
-* Partition tolerance
+* Users will cross from control to test group during experiment
+* Overlap is still bad
+* Easiest to only look at single test period
+* Double-check queries to make sure there is no overlap
 
-Turn up the BASE
-================
-
-:Write smarter applications:
-
-* Basically Available
-* Soft state
-* Eventually consistent
-
-BASE jumping
-============
-
-:Everyone else is doing it:
-
-* Google
-* Amazon
-* eBay
-* Yahoo!
-* Facebook
-* ...
-
-BigTable
-========
-
-:Google:
-
-* Paxos (Chubby)
-* Single-master
-* Distributed tablets via GFS
-* Row/Column db hybrid
-* Compression (BMDiff, Zippy)
-* Versioned (Row, Column, Timestamp)
-* Bloom filters
-
-BigTable Pros
-=============
-
-:Pros:
-
-* Compression = Awesome
-* Clients are probably simple
-* Integrates with map/reduce
-
-BigTable Cons
-=============
-
-:Cons:
-
-* Proprietary to Google
-* Single-master
-
-BigTable Diagram
-================
-
-:Single-master:
-
-.. image:: images/i-has-minions.jpg
-
-Dynamo
-======
-
-:Amazon:
-
-* Key/Value store
-* Consistent hashing
-* Vector clocks
-* Read repair
-
-Dynamo Pros
-===========
-
-:Pros:
-
-* No master
-* Highly available for write
-* Knobs to make it fast to read
-* "Simple" (lots of half-baked clones!)
-
-Dynamo Cons
-===========
-
-:Cons:
-
-* Proprietary to Amazon
-* Clients need to be smart
-* No compression
-* Not suitable for column-like workloads
-* Just a Key/Value store
-
-Dynamo Diagram
+Parallel Tests
 ==============
 
-:Smart client:
+:Even More Math:
 
-.. image:: images/we-await-ur-instrucsions.jpg
+* Multivariate Testing
+* Test multiple changes in parallel
+* Changes can interfere, reinforce, or have no effect on each other
+* Beyond scope of this talk
 
-Cassandra
+Group Selection Pitfalls
+========================
+
+:DANGER!:
+
+* Each experiment should have independent group selection
+* If you re-use buckets across experiments your data is invalid
+* Important even if not running parallel tests!
+
+Confidence
+==========
+
+:More Data:
+
+* Confidence goes up with lots of samples
+* Chi-squared or Student's t-test good candidates
+* Not my expertise, find a statistician :)
+
+Free Tools
+==========
+
+Django Lean
+    http://bitbucket.org/akoha/django-lean/
+Google Website Optimizer
+    http://services.google.com/websiteoptimizer/
+
+More Info
 =========
 
-:Facebook -> Apache:
-
-* Open source!
-* No master like Dynamo
-* Storage model more like BigTable
-
-Cassandra Pros
-==============
-
-:Pros:
-
-* OPEN SOURCE
-* Incrementally scalable
-* Minimal administration
-
-Cassandra Cons
-==============
-
-:Cons:
-
-* Not polished
-* No compression yet
-
-Cassandra Diagram
-=================
-
-:Soul Calibur:
-
-.. image:: images/sc4_pub_ss_cassandra003_copy.jpg
-
-Distributed Musings
-===================
-
-:New Hotness:
-
-* Distributed databases are the new web framework
-* ... except none of them are awesome yet
-* I don't think we need another half-baked Dynamo clone
-
-Key-Value Stores
-================
-
-:Simple and Fast:
-
-* Similar to a Python dict
-* Keys usually bytes, probably limited
-* Values usually bytes, often have fewer limits
-* Extremely fast, simple
-
-Memcached
-=========
-
-:Key/Value store as cache:
-
-* No persistence
-* RAM only
-* Throws data away (on purpose)
-* Lightning fast
-* "Everyone" uses it
-
-Caching Immutable Data
-======================
-
-:If only data never changed:
-
-* Immutable is easy, do that
-
-Caching Mutable Data
-====================
-
-:Invalidation sucks:
-
-* Mutable is hard
-* Failed transactions?
-* Concurrent writers?
-* Dependent cache keys?
-* You will get it wrong and it will be hard to debug
-
-Tokyo Cabinet/Tyrant
-====================
-
-:Not your mom's BerkeleyDB:
-
-* Disk persistent
-* Very performant
-* Actively developed
-* Similar replication strategy to MySQL
-
-Redis
-=====
-
-:Still very new:
-
-* Not just a Key/Value store
-* Matching on key spaces
-* Values can be bytes, lists or sets
-* Requires full store in RAM
-* Might be a nice cache server?
-
-Document Databases
-==================
-
-:Schema-free:
-
-* Very easy to use
-* Document Versioning
-* Great for storing documents
-
-CouchDB
-=======
-
-:Document DB Poster Child:
-
-* Apache project
-* Asynchronous replication
-* JSON based
-* Views materialized on demand (not indexes)
-* Neat admin UI
-
-MongoDB
-=======
-
-:C++'s revenge:
-
-* Fast
-* JSON and BSON (binary JSON-ish)
-* Asynchronous replication with auto-sharding "soon"
-* Index support
-* Nested documents
-* Advanced queries
-
-Column Databases
-================
-
-:Data Warehousing:
-
-* Sequential reads are awesome
-* Columns compress better than rows
-* Doesn't waste I/O on uninteresting columns
-
-MonetDB
-=======
-
-:Research project:
-
-* Tried really hard to get it to work
-* Crashes a lot and corrupts your data
-* Do not waste your time
-
-LucidDB
-=======
-
-:Sounds interesting:
-
-* Java/C++ open source data warehouse
-* No clustering
-* No experience yet
-
-Vertica
-=======
-
-:We paid for it:
-
-* Commercial (based on C-Store)
-* Clustered
-* Would still prefer open source
-
-Bitmap Indexes
-==============
-
-:Sequential Scans can be fast:
-
-* 1-N bits per row of data
-* Can apply logical operations across indexes
-* Can be compressed (BBC, WAH)
-* FastBit is a good implementation
-
-Bitmap Index Uses
-=================
-
-:Big Queries:
-
-* PostgreSQL 8.1+ in-memory for some queries
-* Almost a requirement for column stores
-* FastBit is a great implementation (WAH)
-
-Bloom Filters are Neat
-======================
-
-:But our Princess is in another castle:
-
-* Probabilistic data structure
-* False positives at a known error
-* Constant space
-* I won't bore you with the math
-
-Bloom Filter Diagram
-====================
-
-:Actually Relevant:
-
-.. image:: images/649px-Bloom_filter.png 
-
-Bloom Filter Uses
-=================
-
-:Find stuff, maybe:
-
-* Approximate counting of a large set (e.g. unique IPs from logs)
-* Knowing that data is definitely NOT stored somewhere, e.g. remote cache
-* Several variants (Counting Bloom Filter, Scalable Bloom Filter, ...)
+Microsoft's Experimentation Platform
+    http://exp-platform.com/
+Effective A/B Testing
+    http://elem.com/~btilly/effective-ab-testing/
+Startup Lessons Learned
+    http://www.startuplessonslearned.com/search/label/split-test
+Andrew Chen's Blog
+    http://andrewchenblog.com/
 
 Questions?
 ==========
 
-:Open Space:
+:Twitter:
+  @etrepum
+:Blog:
+  http://bob.pythonmac.org/
+:Mochi Media:
+  http://www.mochimedia.com/
+  
+Mochi is hiring:
 
-* Open Space TODAY @ 5pm, Lambert. See Jonathan Ellis
+* http://www.mochimedia.com/jobs.html
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.