Bryan O'Sullivan  committed c4a665b

More docs

  • Participants
  • Parent commits 03663d9

Comments (0)

Files changed (1)

File www/

 linear regression generated from this data.  Ideally, all measurements
 will be on (or very near) this line.
 ## Understanding the data under a chart
 Underneath the chart for each benchmark is a small table of
   the accuracy of the linear model.  A value below 0.9 is outright
-* "**Mean execution time**" and "**standard deviation**" are
+* "**Mean execution time**" and "**Standard deviation**" are
   statistics calculated (more or less) from execution time divided by
   number of iterations.
 of good quality, the lower and upper bounds will be close to its
-# How to write a benchmark
+# Reading command line output
+Before you look at HTML reports, you'll probably start by inspecting
+the report that criterion prints in your terminal window.
+benchmarking ByteString/HashMap/random
+time                 4.046 ms   (4.020 ms .. 4.072 ms)
+                     1.000 R²   (1.000 R² .. 1.000 R²)
+mean                 4.017 ms   (4.010 ms .. 4.027 ms)
+std dev              27.12 μs   (20.45 μs .. 38.17 μs)
+The first column is a name; the second is an estimate. The third and
+fourth, in parentheses, are the 95% lower and upper bounds on the
+* `time` corresponds to the "OLS regression" field in the HTML table
+  above.
+* `R²` is the goodness-of-fit metric for `time`.
+* `mean` and `std dev` have the same meanings as "Mean execution time"
+  and "Standard deviation" in the HTML table.
+# How to write a benchmark suite
 A criterion benchmark suite consists of a series of
 We use
 to specify that after we run the `IO` action, its result must be
-evaluated to **normal form**, i.e. so that all of its internal
-constructors are fully evaluated, and it contains no thunks.
+evaluated to <a name="normal-form">**normal form**</a>, i.e. so that
+all of its internal constructors are fully evaluated, and it contains
+no thunks.
 ~~~~ {.haskell}
 nfIO :: NFData a => IO a -> IO ()
 In addition to `nfIO`, criterion provides a
-function that evaluates the result of an action only to the point that
-the outermost constructor is known (using `seq`).
+function that evaluates the result of an action only deep enough for
+the outermost constructor to be known (using `seq`).  This is known as
+<a name="weak-head-normal-form">**weak head normal form** (WHNF)</a>.
 ~~~~ {.haskell}
 whnfIO :: IO a -> IO ()
 40,000 times faster than they "should".
 As always, if you see numbers that look wildly out of whack, you
-shouldn't rejoice that you have magic fast performance---be skeptical
-and investigate!
+shouldn't rejoice that you have magically achieved fast
+performance---be skeptical and investigate!
 <div class="bs-callout bs-callout-info">
 #### Defeating let-floating
 every GHC-and-criterion command line, as let-floating helps with
 performance more often than it hurts with benchmarking.
+# Benchmarking pure functions
+Lazy evaluation makes it tricky to benchmark pure code. If we tried to
+saturate a function with all of its arguments and evaluate it
+repeatedly, laziness would ensure that we'd only do "real work" the
+first time through our benchmarking loop.  The expression would be
+overwritten with that result, and no further work would happen on
+subsequent loops through our benchmarking harness.
+We can defeat laziness by benchmarking an *unsaturated* function---one
+that has been given *all but one* of its arguments.
+This is why the
+function accepts two arguments: the first is the almost-saturated
+function we want to benchmark, and the second is the final argument to
+give it.
+~~~~ {.haskell}
+nf :: NFData b => (a -> b) -> a -> Benchmarkable
+As the
+constraint suggests, `nf` applies the argument to the function, then
+evaluates the result to <a href="#normal-form">normal form</a>.
+function evaluates the result of a function only to <a
+href="#weak-head-normal-form">weak head normal form</a> (WHNF).
+~~~~ {.haskell}
+whnf :: (a -> b) -> a -> Benchmarkable
+If we go back to our first example, we can now fully understand what's
+going on.
+~~~~ {.haskell}
+main = defaultMain [
+  bgroup "fib" [ bench "1"  $ whnf fib 1
+               , bench "5"  $ whnf fib 5
+               , bench "9"  $ whnf fib 9
+               , bench "11" $ whnf fib 11
+               ]
+  ]
+We can get away with using `whnf` here because we know that an
+`Integer` has only one constructor, so there's no deeper buried
+structure that we'd have to reach using `nf`.
+As with benchmarking `IO` actions, there's no clear-cut case for when
+to use `whfn` versus `nf`, especially when a result may be lazily
+Guidelines for thinking about when to use `nf` or `whnf`:
+* If a result is a lazy structure (or a mix of strict and lazy, such
+  as a balanced tree with lazy leaves), how much of it would a
+  real-world caller use?  You should be trying to evaluate as much of
+  the result as a realistic consumer would.  Blindly using `nf` could
+  cause way too much unnecessary computation.
+* If a result is something simple like an `Int`, you're probably safe
+  using `whnf`---but then again, there should be no additional cost to
+  using `nf` in these cases.
+# Tips, tricks, and pitfalls
+While criterion tries hard to automate as much of the benchmarking
+process as possible, there are some things you will want to pay
+attention to.
+* Measurements are only as good as the environment in which they're
+  gathered.  Try to make sure your computer is quiet when measuring
+  data.
+* Be judicious in when you choose `nf` and `whnf`.  Always think about
+  what the result of a function is, and how much of it you want to
+  evaluate.
+* Simply rerunning a benchmark can lead to variations of a few percent
+  in numbers.  This variation can have many causes, including address
+  space layout randomization, recompilation between runs, cache
+  effects, CPU thermal throttling, and the phase of the moon.  Don't
+  treat your first measurement as golden!
+* Keep an eye out for completely bogus numbers, as in the case of
+  `-fno-full-laziness` above.
+## How to sniff out bogus results
+If some external factors are making your measurements noisy, criterion
+tries to make it easy to tell.  At the level of raw data, noisy
+measurements will show up as "outliers", but you shouldn't need to
+inspect the raw data directly.
+The easiest yellow flag to spot is the R² goodness-of-fit measure
+dropping below 0.9.  If this happens, scrutinise your data carefully.
+Another easy pattern to look for is severe outliers in the raw
+measurement chart when you're using `--output`.  These should be easy
+to spot: they'll be points sitting far from the linear regression line
+(usually above it).
+If the lower and upper bounds on an estimate aren't "tight" (close to
+the estimate), this suggests that noise might be having some kind of
+negative effect.
+A warning about "variance introduced by outliers" may be printed.
+This indicates the degree to which the standard deviation is inflated
+by outlying measurements, as in the following snippet (notice that the
+lower and upper bounds aren't all that tight, too).
+std dev              652.0 ps   (507.7 ps .. 942.1 ps)
+variance introduced by outliers: 91% (severely inflated)