Wiki

Clone wiki

OYSTER / LISTOVERLAP

Overview

The purpose of the List Overlap Comparator ("ListOverlap") is to allow users to determine if two strings representing item lists rise to a given level of similarity. ListOverlap is a similarity function and requires two inputs. Both inputs are assumed to be character strings comprising a list of items where the items are separated by a single character. ListOverlap also takes two control parameters. The first control parameter is the minimum percentage of overlap between the two lists required to signal a "True" match condition. The second control parameter is the list delimiter character. After both input lists have been separated into items, the ListOverlap comparator calculates the similarity between the strings as the percentage of non-empty list items in common between the lists versus the total number of non-empty items in the longer list. If the percentage of overlap is equal or larger than a predefined threshold given as the first input parameter, then the ListOverlap comparator signals a "True" match condition, otherwise the ListOverlap compartor signals a "False" match condition.

Semantics

The comparison of two references is determined as follows:

  1. Each input string is separated into a list of items based on the list delimiter character provided as the second control parameter to ListOverlap. An item is deemed "empty" is it contains no characters or all blank characters. Empty items are dropped from the list. If after dropping empty list items one or both lists are empty, then the ListOverlap comparator signals a "False"match.

  2. After empty list items are dropped from the list, all duplicate items within the same list are dropped.

  3. Given that both lists have at least one item, the ListOverlap comparator determines the degree of overlap between the two lists. The Degree of Overalp between the two lists is the total number of items shared between the two lists is divided by the number of items in the longer list.

  4. If the list Degree of Overlap is greater than or equal to the predefined threshold given as the first control parameter, then the ListOverlap comparator signals a "True" match condition, otherwise the ListOverlap comparator signals a "False" match condition

Syntax

The syntax for ListOverlap is "ListOverlap(C1, C2)" where the two parameters C1 and C2 are as follows:

C1 is the similarity threshold given as a decimal value between 0.00 and 1.00.

C2 is a single character enclosed in apostrophes, e.g. ','

For example, if the threshold is 70% and the list delimiter is a comma, then the List Overlap Comparator would be encoded in a match rule as

Similarity = "ListOverlap(0.70, ',')"

If only one parameter given, it is assumed to be the threshold value, and the delimiter defaults to a comma. For example,

Similarity = "ListOverlap(0.90)"

Would indicate a similarity threshold of 90% and a list comparator defaults to a comma character. Another option is to not include parameters. For example,

Similarity = "ListOverlap()"

Defaults to a threshold of 80% and the list delimiter to a comma character.

LOC Requirements

LOC.1 Syntax for the List Overlap Comparator

LOC.1.1 The name of the comparator shall be "ListOverlap"

LOC.1.2 The name shall not be case sensitive

LOC.2 Control parameters for ListOverlap

LOC.2.1 The comparator shall have 2 control parameters C1 and C2 represented as "ListOverlap(C1, C2)"

LOC.2.2 The first control parameter C1 is the match threshold value, and it shall be represented as a numeric decimal value between 0.00 and 1.00

LOC.2.3 If the first control parameter is not a numeric decimal value between 0.00 and 1.00, then the system shall default to a value of 0.80.

LOC.2.4 The second control parameter C2 is the list delimiter character, and it shall be a single character enclosed in apostrophes.

LOC.2.5 If the second control parameter is not a character, then the system will default to a comma character for a list delimiter.

LOC.2.6 The list delimiter character shall be any character except it shall not be an ampersand (&), less-than (<), greater-than (>), quotes ("), apostrophe ('), or pipe (|) character.

LOC.2.7 If a valid character is not given the system shall default to comma character as the list delimiter.

LOC.3 Inputs for ListOverap

LOC.3.1 The two inputs for ListOverlap shall both be character strings

LOC.3.2 Each string shall be interpreted as a list of items where the items are separated by a list delimiter character given as the second control parameter.

LOC.3.3 Each list item shall be trimmed of leading and trailing blanks

LOC.3.4 If the result of trimming blanks is an empty string, then the item shall be dropped from the list

LOC.3.5 If the result of dropping empty items results in an empty list, then the ListOverlap comparator shall signal a "False" match condition. No further processing is required

LOC.3.6 After dropping items from each list, each list shall be checked for duplicate items within the same list. For purposes of comparison, all letter characters shall be converted to upper case. If an item in the list is a duplicate of another item in the same list, then the duplicate item shall be dropped.

LOC.4 Determination of True/False Match

LOC.4.1 In the case that either or both of the final lists are empty, then the comparator shall signal a "False" match and no further processing necessary.

LOC.4.2 Otherwise, all of the items in the first list shall be compared to all of the items in the second list to determine the number of duplicate items between the two lists. For purpose of comparison, all letters characters shall be converted to upper case.

PVC.4.3 After all comparisons are made, the degree of overlap shall be calculated as Degree of Overlap = (Number of Duplicates)/(Length of the Longer List)

PVC 4.4 If the Degree of Overlap is greater than or equal to the match threshold given as the first control parameter, then ListOverlap shall signal a "True" match, else ListOverlap shall signal a "False" match.

Updated