Wiki

Clone wiki

OYSTER / MATRIXCOMPARATOR

Overview

The purpose of the matrix comparator is to allow users to use one comparator (e.g. LED, EXACT) on one identity attribute partially in a matrix. The identity attribute of each references is separated by the ListParser hash function. Then each part of the identity attribute of two references are compared in a matrix. After each comparison, a score (between 0-1) is given, which represents the probabilistic matching of two parts. In each column of the matrix, if one of matching scores equal to 1, this column counted as a match. Finally count how many columns have been defined as a match and calculate the proportion. If the proportion is larger than a predefined threshold, then the rule signals a match condition. If the proportion is smaller than the predefined threshold, the rule signals a no-match condition.

Semantics

The comparison of two references is determined in five steps:

1.The value of each identity attribute is separated into properties by ListParser hash function.

2.For each identity attribute, the two values are compared by a designed OYSTER similarity function after set up each property of the two values into a comparison matrix.

3.For each comparison, a score (between 0-1) is given, which represents the probabilistic matching of two properties. In each column of the matrix, if there is one matching score equal to 1, this column counted as a match.

4.After all the properties are compared, count the number of columns which are defined as a match and calculate the proportion. If the proportion is larger than the threshold, a match condition exists between these two values otherwise a no-match condition exists.

5.Keep repeating the process until all the attribute values are compared.

Syntax

There is no Syntax change in the Attributes Script. The predefined threshold is built in.

MC Requirements

The following diagram illustrates the concept of a matrix comparator using string tokenization. The attribute being compared is an unstructured "address" field. The two values of the address field have not been parsed into more granular components.

Screen Shot 2019-10-09 at 3.01.35 PM.png

In this simple example, the fields are tokenized into the white-space or punctuation delimited substrings (tokens). The tokens are used as labels for rows and columns of a matrix. Each cell of the matrix contains a value representing the similarity between the tokens labeling the row and column of the cell. In the example shown, the similarities are given as the normalized Levenshtein Edit Distance between the two string. For visual clarity, cells with a similarity of 0.00 are left blank. The basic scheme is to select the highest similarities in each row and column, and use these values to calculate an overall score, in this example simply the unweighted average. However, this example only illustrates the basic technique. Many variations and enhancements are possible. For example, stacking comparators so that each pair of tokens is compared in a different way, e.g. Levenstein, SOUNDEX, Nickname, etc. Another is to discount (reduce) the scores for matches between short tokens or between high-frequency strings similar to the method used in the standard scoring rule. For example, a match on the token "ST" in an address may be discounted because it is such a common address token.

Updated