Demo 1 - MergePurge

Merge-purge is a form of Entity Resolution in which entity references are systematically compared to each other and separated into clusters (subsets) of equivalent records. This is the most common form of ER. This is also known as record linkage. A merge-purge run is specifically looking for equivalent records in the source input file with the intention to group these records and uses no previously defined Identity input file.

This run will use the test data file named ‘MergePurgeTest.txt’, illustrated in Figure 3. This data consists of six references composed by five attributes. The first attribute is the IdentityID, this is a unique identifier associated to each record (which must be explicitly identified in the source descriptor for the run). The other attributes consist of FirstName, LastName, SchoolCode, and DOB. When these attributes are combined as they are in the source file they are used to define a set of sample student references.

Figure 3: Merge-Purge Source Input

This run uses the set of matching rules defined in Figure 4.

Figure 4: Merge-purge Match Rules

1. At the prompt opened earlier, enter 'MergePurgeRunScript.xml' and press Enter to perform the run, as shown in Figure 5.

Figure 5: Running MergePurge Run Script

2. Information about the run will be displayed in the Command Prompt. For this run, there are 6 references processed which are grouped as 3 identities. The OYSTER Run Statistics are shown in Figure 6-9.

)

Figure 6-9: Merge Purge OYSTER Run Statistics

3. After the run finishes, the Output folder will contain the MergePurgeIndex.link file along with some other auto generated files as shown in Figure 10.

Figure 10: MergePurge Run Output Folder

4. OYSTER creates the persistent identifiers for identities and stores them in the MergePurgeIndex.link file. The MergePurgeIndex.link file is shown in Figure 11.

In this run, records 1, 3, and 5 are assigned the OysterID XVI8NV5E03OWX86Y. These records were identified as a single entity through a combination of Rule 1, 2, and transitive closure. First, records 3 and 5 were matched using Rule 1 since their FirstName, LastName, and DOB matched exactly. Next, record 1 was matched with record 5 based on Rule 2 since their LastName, SchoolCode, and DOB matched exactly. Lastly, through transitive closure, record 1 was found to match record 3. Records 2 and 4 are assigned the OysterID MW9AGFLZ2A1ENXZ5. These records were identified as matches through Rule 1 since their FirstName, LastName, and DOB matched exactly. Record 6 is assigned the OysterID FYONETPU881DH2L0 by itself since no other records are found to match based on any specified rules.

Figure 11: MergePurgeIndex.link file

You may replace the input data in the MergePurgeTest.txt file with your data, and edit the MergePurgeSourceDescriptor.xml, MergePurgeAttributes.xml, and MergePurgeRunScript.xml files to correspond to your new data. Detailed information for each of the XML configurations can be found in the OYSTER Reference Guide.

Back to OYSTER Demonstration Run page

Wiki

OYSTER / Demo 1 - MergePurge

Demo 1 - MergePurge