# HG changeset patch
# User Bryan O'Sullivan
# Date 1378934332 25200
# Node ID d4a608f362989efefe720a8fab1f57e1a8524535
# Parent 143713886d046cd9ce9c6dcd5adf02a655f8a60c
Correctly handle samples with heavily clustered values (gh-11)
If a sample contains many values that are clustered around a single
value, this throws off classifyOutliers.
Before this change, we could easily end up considering a value as
both high and low, thereby counting it in more than one bucket at
a time (which should not happen). As a result, we would sometimes
report more outliers in a data set than sample values.
With this change, every outlier should be classified into a single
bucket. Our estimate-based weighted average method can still lead
to the wrong bucket being chosen, but at least there should be only
one bucket now!
diff --git a/Criterion/Analysis.hs b/Criterion/Analysis.hs
--- a/Criterion/Analysis.hs
+++ b/Criterion/Analysis.hs
@@ -46,10 +46,10 @@
classifyOutliers sa = U.foldl' ((. outlier) . mappend) mempty ssa
where outlier e = Outliers {
samplesSeen = 1
- , lowSevere = if e <= loS then 1 else 0
+ , lowSevere = if e <= loS && e < hiM then 1 else 0
, lowMild = if e > loS && e <= loM then 1 else 0
, highMild = if e >= hiM && e < hiS then 1 else 0
- , highSevere = if e >= hiS then 1 else 0
+ , highSevere = if e >= hiS && e > loM then 1 else 0
}
loS = q1 - (iqr * 3)
loM = q1 - (iqr * 1.5)