Wiki

Clone wiki

OYSTER / LISTOVERLAPPV

Overview

The purpose of the Property-Value List Comparator ("ListOverlapPV") is to allow users to determine if two strings representing property-value pair lists rise to a given level of similarity. ListOverlapPV is a similarity function and requires two inputs. Both inputs are assumed to be character strings comprising a list of items where each item is pair of values. The first value of the pair is the Property Name and the second value is the Property Value. ListOverlapPV assumes that the property values pairs in the list are consistently separate by single character (the list delimiter), and within the pair, the property name is consistently separated from its property value by a different character (the pair delimiter).

ListOvelapPV also takes three control parameters. The first control parameter is the minimum percentage of overlap between the two lists required to signal a "True" match condition. The second control parameter is a two-character string specifying the list delimiter character followed by the pair delimiter character. The optional third control parameter is string containing a list of placeholder property values. After both input lists have been separated into pairs, the ListOverlapPV first removes any pairs where the property value is empty or where the property value is in the list of placeholder values. After these pairs are removed, any duplicate property-value pairs within the same input are list are removed. Finally, the comparator calculates the degree of overlap between the two lists as the percentage of pairs in common between the lists versus the total number of pairs in the longer list. If the degree of overlap is equal to, or larger than, a predefined threshold given as the first input parameter, then the ListOverlapPV comparator signals a "True" match condition, otherwise the ListOverlapPV compartor signals a "False" match condition.

Semantics

The similarity between two property-value pair lists is determined as follows:

1.Each input string is separated into a list of property-value pairs based on the list delimiter character given in the second control parameter.

2.A property-value pair is deemed "empty" is it contains no characters or all blank characters. Empty property-value pairs are dropped from the list.

3.Next each property value pair is separated into its property name and corresponding property value based on the pair delimiter character given in the second control parameter. If either the property name or property value is empty, the property-value pair is dropped from the list.

4.Next each property value is compared to the list of placeholder property values given in third control parameter. If the property value of a property-value pair matches one of the placeholder values, the property-value pair is dropped from the list.

5.Next, all duplicate property-value pairs within the same input string are dropped.

6.If the final list of property-value pairs extracted from either input string is empty, then the comparator signals a "False" match condition.

7.Otherwise the pairs in both lists are compared for matching items. The number of matching items between the pair lists is divided by the number of items in the longer of the two lists. If this ratio is greater than or equal to the first control parameter, then the ListOverlapPV comparator signals a "True" match condition, otherwise the ListOverlapPV comparator signals a "False" match condition

Syntax

The syntax for ListOverlapPV is "ListOverlapPV(C1, C2, C3, C4)" where

C1: The similarity threshold given as a decimal value between 0.00 and 1.00.

C2: The list delimiter given as a single character enclosed in apostrophes.

C3: The pair delimiter given as a single character enclosed in apostrophes.

    C4: The list of placeholder values given as a string enclosed in apostrophes. The placeholder values in the list must be separated from each other by the pipe (|) character.

For example, if the threshold is 80%, the list delimiter is a comma, the pair delimiter is a colon (:), and there are two placeholder values "UNK" and "?", then the OYSTER call to the Property-Value List Comparator would be encoded in a match rule as

Similarity = "ListOverlapPV(0.80, ',', ':', 'UNK, ?')"

PVC Requirements

PVC.1 Syntax for the Property-Value List Comparator

PVC.1.1 The name of the comparator shall be "ListOverlapPV"

PVC.1.2 The name shall not be case sensitive

PVC.2 Control parameters for ListOverlapPV

PVC.2.1 The comparator shall have 4 control parameters (arity of 4)

PVC.2.2 The first control parameter shall be the degree of overlap threshold value.

PVC.2.2.1 The degree of overlap threshold value shall be given as a numeric decimal value between 0.00 and 1.00.

PVC.2.2.2 If the degree of overlap threshold value is not given or in not in the proper format the system shall default the value to 0.80.

PVC.2.3 The second control parameter shall be the list delimiter character

PVC.2.3.1 The list delimiter character shall be given as a single character enclosed by apostrophe characters

PVC.2.3.2 The list delimiter character shall be any character except it shall not be anampersand (&), less-than (<), greater-than (>), quotes ("), apostrophe ('), or pipe (|) character.

PVC.2.3.3 If the list delimiter character is not given or not in the proper format, the system shall default its value to the comma (,) character.

PVC.2.4 The third control parameter shall be the pair delimiter character.

PVC.2.4.1 The pair delimiter character shall be given as a single character enclosed by apostrophe characters.

PVC.2.4.2 The pair delimiter character shall be any character except it shall not be an ampersand (&), less-than (<), greater-than (>), quotes ("), apostrophe ('), or pipe (|) character, and it shall not be the same as the list delimiter character.

PVC.2.4.3 If the pair delimiter character is not given or not in the proper format, the system shall default its value to the colon (:) character.

PVC.2.5 The fourth control parameter shall be the list of property placeholder values.

PVC.2.5.1 The list of placeholder values shall be given as a string of characters enclosed by apostrophe characters.

PVC.2.5.2 If there is more than one placeholder value in list, consecutive values shall be separated from each other by the pipe (|) character.

PVC.2.5.3 Placeholder values shall be comprised of any characters except they shall contain an ampersand (&), less-than (<), greater-than (>), quotes ("), apostrophe ('), or pipe (|) character.

PVC.3 Creating the lists of property value pairs

PVC.3.1 The two inputs for ListOverlapPV shall both be character strings

PVC.3.2 Each string shall be interpreted as a list of property-value pairs where the pairs are separated from each other by the list delimiter character given in the second control parameter.

PVC.3.3 Each property-value pair shall be trimmed of leading and trailing blanks

PVC.3.4 If the result of trimming blanks is an empty string, then the property-value pair shall be dropped from the list

PVC.3.5 If the result of dropping empty property-value pairs results in an empty list, then the ListOverlapPV comparator shall signal a "False" match condition. No further processing is required

PVC.3.6 Otherwise, if there are items remaining in both lists of property-value pairs, each pair in each list shall be separated into a property name and corresponding property value according to the pair delimiter character given as the third control parameter.

PVC.3.7 Each property name and each property value shall be trimmed of leading and trailing blacks.

PVC.3.8 If either the property name or the property value of a property-value pair is empty, then the property-value pair shall be dropped from the list.

PVC.3.9 Each property value shall be compared to the list of placeholder property values given in the third control parameter. For purposes of comparison, all letter characters shall be converted to upper case.

PVC.3.10 If the property value of a property-value pair matches one of the placeholder values, the property-value pair shall be dropped from the list.

PVC.3.11 After dropping empty pairs and pairs with placeholder values, each list shall be checked for duplicate property value pairs within the same list. For purposes of comparison, all letter characters shall be converted to upper case. If a pair in the list is a duplicate of another pair in the same list, then the duplicate pair shall be dropped.

PVC.4 Determination of True/False Match

PVC.4.1 In the case that either or both of the final lists of property-value pairs are empty, then the comparator shall signal a "False" match and not further processing is necessary.

PVC.4.2 Otherwise, all of the pairs in the first list shall be compared to all of the pairs in the second list to determine the number of duplicate pairs between the two lists. For purpose of comparison, all letters characters shall be converted to upper case.

PVC.4.3 After all comparisons are made, the degree of overlap shall be calculated as Degree of Overlap = (Number of Pairs in Common)/(Length of the Longer List)

If the Degree of Overlap is greater than or equal to the match threshold given as the first control parameter, then ListOverlapPV shall signal a "True" match, else ListOverlapPV shall signal a "False" match.

Updated