Commits

Bryan O'Sullivan committed d4a608f

Correctly handle samples with heavily clustered values (gh-11)

If a sample contains many values that are clustered around a single
value, this throws off classifyOutliers.

Before this change, we could easily end up considering a value as
both high and low, thereby counting it in more than one bucket at
a time (which should not happen). As a result, we would sometimes
report more outliers in a data set than sample values.

With this change, every outlier should be classified into a single
bucket. Our estimate-based weighted average method can still lead
to the wrong bucket being chosen, but at least there should be only
one bucket now!

  • Participants
  • Parent commits 1437138

Comments (0)

Files changed (1)

File Criterion/Analysis.hs

 classifyOutliers sa = U.foldl' ((. outlier) . mappend) mempty ssa
     where outlier e = Outliers {
                         samplesSeen = 1
-                      , lowSevere = if e <= loS then 1 else 0
+                      , lowSevere = if e <= loS && e < hiM then 1 else 0
                       , lowMild = if e > loS && e <= loM then 1 else 0
                       , highMild = if e >= hiM && e < hiS then 1 else 0
-                      , highSevere = if e >= hiS then 1 else 0
+                      , highSevere = if e >= hiS && e > loM then 1 else 0
                       }
           loS = q1 - (iqr * 3)
           loM = q1 - (iqr * 1.5)