Demo 22 - ScoringTest

The scoring rule configuration in OYSTER was created to support probabilistic matching. The scoring rule is similar to the Boolean rule in that you can specify a similarity function and optional data preparation function for comparing the values of identity attributes between two entity references. The primary difference is that instead of the similarity resulting in a True or False decision, the decision is whether the identity attribute values should be contribute an “agreement” or “disagreement” weight to an overall match score for the pair of references being compared.

Important Note: The logic of the OYSTER scoring rule described in this document only applies to OYSTER Version 3.6.2 and higher.

In the scoring rule, the agree and disagree decisions are both associated with a numerical value call a “weight”. Hence, for each identity attribute comparison there is an agreement weight and a disagreement weight. In some cases, there can also be a third weight called a “missing weight” to be used instead of the agreement or disagreement weight in the case either or both values of the identity attribute are missing.

Depending upon the outcome of the comparison for each identity attribute, either the agreement weight, the disagreement weight, or the missing weight is added into a total score. The total score is then compared to a predefined “match score.” If the total score is greater than or equal to the match score, then the overall decision is to link the references, otherwise the references are not linked.

An additional feature to support accuracy analysis is the ability to also specify an optional “review score.” If a review score is specified, then when the total score falls below the match score, but is greater than or equal to the review score, the pair of references and their total score are written to a “clerical review” file for post-processing analysis.

This run will use the test data file named ‘TestInput.txt’, illustrated in Figure 1. This data consists of four references composed by three attributes. The first attribute is the RecID, this is a unique identifier associated to each record (which must be explicitly identified in the source descriptor for the run). The other attributes consist of First Name and Last Name.

Figure 1: Scoring Rule Source Input File

Configuration of Scoring Rule in OYSTER

Just as with the Boolean identity rules, the scoring rule is configured using XML elements in the OYSTER Attributes Script. Figure 2 shows an example of a scoring defined on two identity attributes “Fname” and “Lname”.

捕获.JPG

Figure 2: Example of Scoring Rule

Scoring Rule consists of six parameter codes. The syntax of writing parameter code is shown in Figure 2 (ScoringRule Ident="$BSSTTB")

The full explaination of all six parameter codes are listed below. Their actions are as follows:

parm1: controls use of table entries * B = use both agree and disagree weights from table * A = use only agree weights from table * X = do not use weights from table

parm2: Treatment of agreement weights * L = use the larger (maximum) weight * S = use the smaller (minimum) weight * A = use the average weight

parm3: Treatment of disagreement weights * L = use the larger (maximum) weight * S = use the smaller (minimum) weight * A = use the average weight

parm4: Priority of table weights on agreement * T = always use weight from table * E = no priority to table weights

parm5: Priority of table weights on disagreement * T = always use weight from table * E = no priority to table weights

parm6: Treatment of missing values * N = never use missing weight * B = use missing weight only when both are missing * E = use missing weight when either or both are missing

The example in Figure 2 shows how agreement weights can be associated with identity attributes at both the attribute level or at the value level. For example, the rule term for the identity attribute Lname only defines weights at the attribute level. When two Last Names are compared, if they agree, there is a fixed agreement weight of “10.5”, if they disagree there is a fixed disagreement weight of “-5.5”, and if either or both dates are missing a weight of “0”. The same weights apply regardless of the actual Last Name value.

On the other hand, the identity attribute Fname references the weight table “FnameWeights.txt” in the first rule <term>. The weight table contains a set of key value pairs as shown in Figure 3. The key is an attribute value, and the value is the corresponding agreement weight. In the table, these values are separated by a single tab character.

Figure 3: Screenshot of the Weight Table for attribute First Name

An important note about the use of weight tables. The lookup key for the table will be the output of the DataPrep function. If a DataPrep function is not used, then the lookup key will be the attribute value from the input file, but ignoring case. Note in Figure 2 the Similarity is by SOUNDEX, but there is no DataPrep function. This means if the two values being compared are “John” and “Jon” they will agree by SOUNDEX, but the lookup keys for the table will be either “JOHN” or “JON”. A better configuration for this term would be to have DataPrep=”SOUNDEX” and Similarity=”EXACT”. In this way, the lookup key for the weight table would be “J500” the SOUNDEX hash for both “John” and “Jon” and the table would only need one entry for “J500”. In configuration given, you would need both “JOHN” and “JON” in the weight table to obtain consistent results.

Weight Table Logic To properly use the Scoring Rule, it is important to fully understand the details of the logic used to determine the agreement or disagreement weight added to the overall match score. This is especially true when the weight used by the rule term when a weight table is given. Two very important rules to remember are

Rule 1: The agreement or disagreement is determined by the SIMILARITY function

Rule 2: If a weight table is used AND if

Rule 2A: DataPrep IS NOT USED, then the system will use the value of the attribute from the input file to look up the weight. However, the weight lookup is NOT CASE SENSITIVE. If the weight table key is “MAX” and the input source has “Max”, then the system will still find “MAX” in the weight table and assign it the proper agreement weight.

Rule 2B: DataPrep IS USED, the system will use the DataPrep value as the lookup value for the weight table. So if the input value is “John” and the DataPrep is SOUNDEX, then the system will try to lookup “J500” the SOUNDEX has of “John” in the table. On the other hand, if DataPrep is SCAN(Letters to uppercase), then the system would try to lookup “JOHN” in the weight table.

Examples: Suppose the two values for Fname being compared are “John” and “Jon” Example 1: <Term Item = Fname Similarity=”Exact” AgreeWgt=”10” Disagree=”-5” WgtTable=”MyTable”/> Because “John” and “Jon” disagree by Exact match, then the rule will use the overall disagreement weight of “-5”.

Example 2: <Term Item = Fname Similarity=”SOUNDEX” AgreeWgt=”10” Disagree=”-5” WgtTable=”MyTable”/> Because “John” and “Jon” agree by Soundex and because DataPrep is not used, then the rule will try to look up both source values, “John” and “Jon”, in the weight table. Case 1: Neither “John” or “Jon” is found in the table, then the rule will select the overall agreement weight of “10”. Case 2: One of the names is in the table, but the other name is not, then the rule will use the agreement weight for the name found in the table. Case 3: Both “John” and “Jon” are found in the table, then the rule will use the minimum of the two weights found in the table.

Example 3: <Term Item = Fname DataPrep=”SOUNDEX” Similarity=”Exact” AgreeWgt=”10” Disagree=”-5” WgtTable=”MyTable”/> “John” and “Jon” are first converted to the SOUNDEX hash “J500”. Because both generate the same SOUNDEX hash, the Exact match is true. Because DataPrep is used, the system will lookup “J500” in the weight table. If “J500” is found in the weight table, then the agreement weight from the table is selected. If “J500” is not found in the weight table, then the overall agreement weight of “10”

1. At the prompt opened earlier, enter 'ScoringTestRunScript.xml' and press Enter to perform the run, as shown in Figure 4.

Figure 4: Running Scoring Rule Run Script

2. Information about the run will be displayed in the Command Prompt. For this run, there are 6 references processed which are grouped as 3 identities. The OYSTER Run Statistics are shown in Figure 5-6.

Figure 5-6: Scoring Rule OYSTER Run Statistics

3. After the run finishes, the Output folder will contain the ScoringTestIndex.link file along with some other auto generated files as shown in Figure 7.

Figure 7:Scoring Rule Run Output Folder

4. OYSTER creates the persistent identifiers for identities and stores them in the ScoringTestIndex.link file. The ScoringTestIndex.link file is shown in Figure 8.

Figure 8: ScoringTestIndex.link file

Back to OYSTER Demonstration Run page

Wiki

OYSTER / Demo 22 - ScoringTest

Demo 22 - ScoringTest