Wiki

Clone wiki

OYSTER / Cluster_Stats

The Cluster Stats section contains 17 statistics. The first statistic is:

  • Cluster Size Distribution - Provides a Frequency listing of all the clusters in a knowledgebase grouped by the number of linked references in a cluster (identity).

This is represented by 3 columns:

  • Cluster Size - Represents the number of linked references in the cluster
  • #of Clusters - Represents the number of clusters which contain the number of linked references specifies in the corresponding "Cluster Size"
  • #of Records - indicates the number of linked references that exist in clusters of a particular size. Calculated as: "Cluster Size" * "# of Clusters"

Note: Exceedingly large clusters are outliers and should be investigated as should large numbers of single clusters.

The next three statistics in this section provide information about the references and clusters located in and loaded from an identity input file.

  • Clusters loaded - Total number of clusters loaded from the identity input file.
    • Shows a value of 0 if no identity input file was loaded.
  • References loaded - Total number of references that comprise the loaded clusters.
    • Shows a value of 0 if no identity input file was loaded.
  • Avg # of Refs/Cluster - This is the average number of clusters that were loaded per cluster. Calculated as "Clusters loaded" / "References loaded"
    • Shows a value of "NaN" if no identity input file was loaded. "NaN" means Not a Number and is generated when dividing 0 by 0.

The next group of statistics in this section relies on the values presented in the Cluster Size, # of Clusters, and # of Records columns displayed in the Cluster Size Distribution statistic.

  • Average Cluster Grouping - The ACG is the average of the Cluster Size. This is found by summing all the unique Cluster Sizes and then dividing by the count of the unique cluster sizes. i.e. (1+2+3+4+5+6+7+8+9)/ 9 = 5.
  • Average Cluster by Count - The ACC is the average of the # of Clusters column. This is found by summing all the # of Clusters values and dividing by the count of the # of Clusters values.
  • Average Cluster Size - The ACS is the average cluster size for the run. This is found by summing the # of Records values and dividing by the sum of the # of Clusters values.
  • Number of Duplicate Recs - calculates the number of duplicate records found while processing the input references. This is found by taking the summation of each cluster size minus 1 times the corresponding # of Records. ∑ (Cluster Size - 1) * # of Records
  • Duplication Rate - The duplication rate is the percentage of the references that are found to be duplicates based on the identity rule set. The Calculation is: 1 - (Total clusters / Total records)

The remaining 8 statistics provided in this section are focused on match candidates.

  • Total Candidates Size - The total number of Candidates that were returned by the index based on the input record set and the indexing rules.
  • Total DeDup Candidates Size - A unique count of the Total Candidate Size. This is possible due to a cluster having multiple refID's (many to one relationship).
  • Total # Candidates - A count of the fact that a Candidate was found for a record, i.e. input #25 returns 3 candidate records, this is counted one time.
  • Avg Candidates per Input - The Avg. Candidates is the Total Candidates Size / Total # Candidates.
  • Total Matched Count - Represents a count of matches that occurred between references and candidates.
  • Matches per Candidates Size - Represents the percentage of matches per the full Candidates Size. Calculated as Total Match Count / Total Candidates Size
  • Matches per DeDup Candidates Size - Represents the percentage of matches per the Total DeDup Candidates Size. Calculated as Total Match Count / Total DeDup Candidates Size
  • Matches per Candidates - Represents the percentage of matches per the Total # Candidates. Calculated as Total Match Count / Total # Candidates

An example of the Cluster Stats section can be seen in Figure 14 and Figure 15

Screen Shot 2019-09-10 at 5.11.26 PM.png

Back to OYSTER Reference Guide page

Click Prev Summary Stats page

Click Next Rule Stats page

Updated