Wiki

Clone wiki

OYSTER / Demo 9 - BlockingNoIndex

Demo 9 - BlockingNoIndex

OYSTER uses indices to build the candidate lists used when matching is performed.

In the original design of OYSTER, the attribute values of incoming references were inserted into an inverted index as a way to find the most probably match candidates for newly input references. However, it was found that this method of building the inverted index has three major drawbacks. 1. It is predicated on the idea that each identity rule will have at least one exact match term. System performance tends to degrade dramatically when rules have one or more inexact match terms (LED, Scan, etc.) 2. The index key is always a value from a single attribute. To find candidates for multi-term identity rules the system has to perform multiple lookups, one for each rule term with an exact match before reducing the candidate set to a manageable size. 3. The index logic is fixed. Therefore to gain maximum performance, users must tailor identity rules in a way that best fits the logic of the index scheme rather than using an index logic that best fits the identity rules being used.

The new user-defined index scheme will allow the user to direct OYSTER to create a single index value that represents multiple terms (attributes) in a rule. It will also allow the user to define more than one index which allows the user to customize an index for specific rules. Each index defined by the user will be a single value formed by concatenating a series of “hash values.” Each hash value is created by applying a pre-defined transformation to the value of an attribute. Many of the hash algorithms may be the same as or directly related to a particular similarity function (Soundex, Scan, etc.)

This ability to build custom indices provides drastic improvements in runtimes. The tradeoff is that the user must have an intimate knowledge of the rules and data to build optimized indexes. When indexes are designed correctly, the OYSTER runtime can decrease from hours and possible days to a matter of minutes. It is important to note that with the introduction of UDI into the OYSTER system, the default inverted index has been removed and if not UDI is defined, OYSTER will do brute force comparisons (compare every record with every other record).

This demo shows the Identity Capture configuration without Blocking Index.

The input file contains 1001 records with 15 attributs. The details of input data can be seen in the screenshot below:

Capture.PNG

Without index, the attribute script lists two Boolean matching rules.

Capture.PNG

To run OYSTER in command line Enter ‘BlockingNoIndexRunScript.xml’ and press Enter to perform the run as shown in screenshot listed below.

Capture.PNG

Information about the run will be displayed in the Command Prompt. For this run, there are 1001 references processed and grouped as 833 identities. The running time is 16 seconds. The OYSTER run statistics for this run are shown in screenshots listed below.

Capture.PNG Capture1.PNG Capture2.PNG

After the run finishes, the Output folder will contain several files which are shown in the screenshot below.

Capture.PNG

The link file output is the same logic as Identity Capture configuration. OYSTER creates the persistent identifiers for identities and stores them in the link file. Being persistent, these IDs are the same as were generated in the previous MergePurge run and the same method as described previously was used to get the matches.

11.PNG

Back to OYSTER Demonstration Run page

Updated