Wiki

Clone wiki

OYSTER / Index_Design

The goal of index design is two-fold

  1. (Alignment) Define one or more indices so that whenever two records match by one the rules, both records with generate the same match key by at least one index
  2. (Reduction) Define each index in such a way that a minimal number of records will generate the same match key.

The ideal situation would be to define one index for each rule that only generates the same match key for records that match by that rule. For certain rule sets this is possible by simply translating each rule term-by-term into a corresponding index. For example, suppose that a system only used Rules 1 and 2 shown in Table 3. Since Rule 1 requires and exact match on Attributes A1, A3, and A4, the index for this rule would simply be generate a match key by concatenating these three values, i.e. the match key would be A1+A3+A4. Similarly, the index for Rule 2 would only need to concatenate Attributes A1, A3, and A5, i.e. the match key would be A1+A3+A5. In this case, for each rule, every record that matches by the rule would create the same match key, and any two records creating the match key would match by the rule. An OYSTER implementation of these two rules and indices is shown in the Figure 30. Screen Shot 2019-09-10 at 5.25.01 PM.png

Note that the hash generator SCAN produces the equivalent of an EXACT match with the settings shown in Figure 1Figure 30. Here the SCAN is set to scan the string from left-to-right (Direction parameter set to "LR"), extract all of the characters in string (Character Type parameter set to "All"), scan the entire length of the string (Length set to "0"), and keep the characters in the same order (Ordering parameter set to "SameOrder").

However, for many rule sets is not possible because of comparators that perform approximate matching. Most comparators are built to allow for some level of variation between the values being compared yet still considering the values to be (approximately) the same. In many cases this is done by reducing both values the common third value. Even for exact comparison between strings, there is often consideration for ignoring differences between corresponding letters due to case, i.e. upper case versus lower case. In OYSTER this corresponds to the "ExactIgnoreCase" comparator or the "SCAN" comparator using all upper case conversion. For example, this would map the strings "John", "john", and "JOHN" all to the common value string of "JOHN" before making the comparison.

6.2.1 Match Key Comparators versus Similarity Comparators

Each hash algorithm in a match key generator is designed as a function which takes as its argument a single attribute value and maps it (transforms it) to a single output value. On the other hand, rule comparators take a pair of attribute values and map it a Boolean value, either True or False. However, as described earlier, many rule comparators such as the "ExactIgnoreCase" comparator work on the same principle as a match key function. When a pair of attribute values is input to the comparator, it first operates separately on each value to produce two outputs. If the two outputs are the same, then the comparator value for the pair is True, else the comparator value is False. For this reason, comparators of this type are called match key comparators.

The simplest type of match key comparator is the EXACT each value of the pair is mapped to itself. Another type of match key comparator is the Soundex comparator primarily used for strings represented a person's name. The Soundex comparator maps each name string to a 5-character string that starts with the first letter of the input string and is followed by four decimal digits. The four digits are derived according to the Soundex Algorithm. If both input name strings produce the same 5-character output string, then the Soundex comparator produces a True value. For example, the names "Phillip" and "Philip" product the same Soundex value of "P410".

Matching rules that use only match key comparators can be translated directly in an index by a series of hash functions that correspond to the match key comparator as was illustrated in Figure X. Unfortunately not all comparators operate in this way. Many comparators are a type called similarity comparators. Similarity comparators generate a numerical value that represents some measure of the similarity between two attribute values. These comparators cannot operate on each attribute value independently, but must consider both values to product a similarity measure. A primary example is the Levenshtein Edit Distance (LED) algorithm that measures the similarity between two character strings. Unlike Soundex, it does not make sense to talk about the LED of a single string. The LED can only be generated between two strings where it is based on the number simple character transformations that are required to convert one string into the other string. For example if one value is "John" then it could be transformed into each of the values "Jon", "Jhn", or "ohn" with a single deletion transformation. This makes "John" 75% similar to each of these three other strings (one change out of four characters). However, there is no function that can operate on each of these strings independently and produce a single value that will predict that it will be 75% LED similar to each of the other strings. The similarity can only be known when both strings are present and are processed together by the LED algorithm.

6.2.2 Balancing Alignment with Reduction

As long as each rule has at least one match key comparator, then it will always be possible to create a set of indices that have proper alignment. For example, each of the 17 rules in Table 3 uses at least one Exact comparator. In fact, Table 3 shows that every rule requires either an exact match on A1 or an exact match on A3. Therefore one possible design to index the rules in Table 3 would be two indices where the first index uses only the value of A1 and the second index uses only the value A3. This would satisfy the alignment condition in that every pair of records that matches by one of the rules, but also have to have either the same value for A1 or the same value for A3.

Whether or not this would be a practical index design will depend on the number of records being processed. The second constraint of index design is reduction. For small sets of records, this may not be a problem, but for large datasets there may be so many records that have the same value for A1 or the same value for A3, that the number of comparisons performed by the rules will be too large, and system performance will be affected.

For example, suppose that the system implementing the 17 rules of Table 3 is able to make 200,000 comparisons per second for each rule. Assuming the match rate is low then a worst case estimate is that each input record has to be compared to every candidate record. If this system runs without an index then each input record will have to compared to each previously processed record (assuming the One-Pass algorithm) by each of the 17 rules, or 17xNx(N-1)/2 for N input records. If N is relatively small, say 10,000 records, then the total run-time, even without an index will only be about 142 minutes or 2 hours and 22 minutes.

Now suppose that the two index design (A1 and A3) is implemented in the system. The reduction in comparisons will depend on the average number of records that generate the same match keys for these indices. With the index in place, each input record will only be compared to all previously processed records that generate the same match key as the input record by either of the two indices. If a set of indices on average returns M candidates for comparison, then the total comparison effort for the system with 17 rules to process N input records is approximately 17xNxM comparisons.

Even though there can be some overlap between the records that generate the same match key for both by both indices, for worst case assumption that can be ignored. Therefore assuming on that on average the A1 index returns 200 candidate records and the A3 index returns 400 records, or 600 candidates combined, the total run-time for 10,000 input records can be dramatically reduced to approximately 8.5 minutes.

However, if the number of input records is increased to 1,000,000 records, then the run-time using the two-index design will increase to more than 14 hours. In this case the reduction afforded by the simple two-index design may no longer be acceptable. One analysis technique that can be used to estimate the level of index reduction is to load in the input data into a database, then translate each index into an SQL query to profile the counts of duplicate match keys that each index will generate. In addition, some systems may provide the user with information on index performance. For example, OYSTER provides information in its standard log file about the average number of tokens (records) per match key, the 10 match keys with the largest number of tokens, and other helpful index statistics.

In general, the more attributes that are concatenated to produce a match key, the fewer records that produce the same match key. At the same time, when there are many rules such as in Table 3, designing indices that use more attributes per index may require adding new indices in order to maintain proper alignment.

Screen Shot 2019-09-10 at 5.26.01 PM.png

Table 4 shows an alternative 4-Index design that aligns with the 17 rules of Table 1. An "X" in a cell of Table 4 means that the rule indicated by the cell row is in alignment with the index given by the cell column. Three of the four indices each hash two of the attributes, A1+A4, A1+A5, and A3+A4. These three indices are aligned with all of the rules except for Rule 10. A fourth index based on A1+A2+A7 is added to cover this rule and assure rule-to-index alignment. Although there are more indices in this scheme, the combined number of candidates returned by this design is approximately 25 records per match key.
Applying the same calculation as before, the run-time for an input file of 1,000,000 records will be reduced from 14 hours using the previous design to approximately 35 minutes using the design shown in Table 4.

Another factor to consider is that more indices will increase the storage requirements plus some additional overhead in performance for multiple lookups. However, in general these are minor in consideration of the much larger reduction that can be obtained.

Given the consideration of alignment and reduction, index design may have an impact on rule design. For example, if a proposed rule uses only similarity comparator such as LED or Q-gram, then it may not be possible to design and index that assures alignment. In this case it may be prudent to break the single rule into multiple rules where each rule has at least one match key comparator that can be index. Another strategy is to experiment and see if similarity comparators can be replaced by match key comparators. For example for name values, it may be that the Levenshtein comparator can be replaced with the Soundex, NYSIIS, or other match key name comparators without a significant loss in accuracy.

In some situations performance considerations may dictate that 100% alignment is not possible or practical. Nevertheless, every attempt should be made to attain alignment, or at least to have the highest level of alignment possible.

Back to OYSTER Reference Guide page

Click Prev Rule Analysis page

Click Next Alignment Validation page

Updated