Wiki

Clone wiki

OYSTER / Index_Stats

The Index Stats section contains 17 statistics all relating to the Indexes defined by the user.

  • Keys - The count of Hash Keys in the indices.
  • Total tokens - The total number of RefID's held in the indicies.
  • Unique tokens - The unique number of RefID's held in the indicies.
    • Since a single record can match multiple indexes this only counts a RefID once.
  • Max tokens per key - Give the max number of RefID's (tokens) associated to a single Key.
  • Min tokens per key - Give the min number of RefID's (tokens) associated to a single Key.
  • Min tokens > 1 per key - Give the min number of RefID's (tokens) associated to a single Key where there is more than a single token associated to the key.
    • Total tokens per key - The average number of RefID's per Hash Key.
  • Unique tokens per key - The average number of unique RefID's per Hash Key.
  • Total per Unique tokens - The total number of RefID's / the unique number of RefID's
  • Unique per Total tokens - The unique number of RefID's / the total number of RefID's

The next three statistics are explained in detail on the following website. The concepts for each are such that it is not possible to concisely explaine each. Please refer to this site for details: http://www.tc3.edu/instruct/sbrown/stat/shape.htm

  • Skewness

    • If skewness is less than -1 or greater than +1, the distribution is highly skewed.
    • If skewness is between -1 and -½ or between +½ and +1, the distribution is moderately skewed.
    • If skewness is between -½ and +½, the distribution is approximately symmetric.
    • Bulmer, M. G., Principles of Statistics (Dover, 1979)
  • Kurtosis

    • A normal distribution has kurtosis exactly 3 (excess kurtosis exactly 0). Any distribution with kurtosis ?3 (excess ?0) is called mesokurtic.
    • A distribution with kurtosis <3 (excess kurtosis <0) is called platykurtic. Compared to a normal distribution, its central peak is lower and broader, and its tails are shorter and thinner.
    • A distribution with kurtosis >3 (excess kurtosis >0) is called leptokurtic. Compared to a normal distribution, its central peak is higher and sharper, and its tails are longer and fatter.
  • Excess - The Kurtosis value minus 3

    • The reference standard is a normal distribution, which has a kurtosis of 3. In token of this, often the excess kurtosis is presented: excess kurtosis is simply kurtosis-3. For example, the "kurtosis" reported by Excel is actually the excess kurtosis.
  • Max key - This is the top index key in terms of size.

  • Top 10 keys - The top ten index keys by descending size.
  • Frequency of the Index Candidates - This is the frequency of the Index Candidates. The candidates of size zero are records that did not match anything in the index, i.e. the first record in a cluster. This statistic is represented by three corresponding columns:
    • Candidate Size
    • #of Candidates
    • #of Records

NOTE: These counts can be used to help fine tune index. If there are a large number of records in the zero bin then you are possibly missing some candidates with your rules. If there were some large groups, i.e. candidate sizes > 50 then you rules are not granular enough.

  • Frequency of the Index Groups by Size - Shows the Frequency of the Index Groups sorted by size. This is represented by three corresponding columns:
  • Index Group
  • Index Size
  • of Records

An example of the Index Stats can be seen in Figure 17 and Figure 18.

Screen Shot 2019-09-10 at 5.13.37 PM.png

Screen Shot 2019-09-10 at 5.14.22 PM.png

Back to OYSTER Reference Guide page

Click Prev Rule Stats page

Click Next Resolution Stats page

Updated