Wiki

Clone wiki

OYSTER / Demo 2 - IdentityCapture

Demo 2 - IdentityCapture

Identity Capture is a form of entity resolution in which the system builds (learns) a set of identities from the references it processes rather than starting with a known set of identities.

This run will use the test source file named ‘IdentityCaptureTest.txt’. This data consists of the same six references that were used for the previous Merge-purge example and can be seen in Figure 1.

1.JPG

Figure 1: Identity Capture Source Input

The Match Rules defined for this run are likewise identical to the Match Rules used in the Merge-purge run. This was done to show the consistency in the IDs produced between the different types of runs. The rules can be seen in Figure 2.

2.JPG

Figure 2: Identity Capture Match Rules

The difference between the previous Merge-purge configuration and this Identity Capture configuration is that Identity Capture creates an identity file that acts as a knowledgebase which contains all the entity identity structures (EIS) constructed from the source references during the run. This file will be used as input for future OYSTER runs in this guide. This run configuration is used to construct an initial knowledgebase that can be updated and maintained with future runs.

  1. Run OYSTER
  2. Enter ‘IdentityCaptureRunScript.xml’ and press Enter to perform the run as shown in Figure 3.

3.JPG

Figure 3: Running Identity Capture Run Script

  1. Information about the run will be displayed in the Command Prompt. For this run, there are 6 references processed and grouped as 3 identities. The OYSTER run statistics for this run are shown in Figure 4 and Figure 7.

4.JPG 5.JPG 6.JPG 7.JPG)

Figure 4-7: Identity Capture OYSTER Run Statistics

  1. After the run finishes, the Output folder will contain the IdentityCaptureIndex.link, IdentityCaptureOutput.idty, Identity Change Report.txt, Identity Merge Map.csv, IdentityCaptureOutput.idty.emap, and IdentityCaptureOutput.indx files as shown in Figure 8. The .emap and .indx files are generated since the Explanation and Debug attributes in the RunScript are set to “On”.

8.JPG

Figure 8: Identity Capture Run Output Folder

  1. OYSTER creates the persistent identifiers for identities and stores them in the IdentityCaptureIndex.link file, shown in Figure 9. Being persistent, these IDs are the same as were generated in the previous MergePurge run and the same method as described previously was used to get the matches.

9.JPG

Figure 9: IdentityCaptureIndex.link file

Being an IdentityCapture run, OYSTER built the Identity file and stored it in the IdentityCaptureOutput.idty file. This file is the Identity Knowledge Base that can be updated and maintained in future runs. The contents of this file are shown in Figure 10-11. As you can see, the references with the same OYSTER ID are grouped together in the .idty output file. The Trace values correctly attach attributes to each Reference so that it can later be traced back to its origin after many updates to this knowledge base.

10.JPG 11.JPG

Figure 10-11: IdentityCaptureOutput.idty File

Figure 12 shows the Identity Change report for this run. You will see that the run was able to identify three identities and that three new identities were created. This is because the Identity Capture run does retain the identities that it finds and stored them in the idty file shown above in Figure 10-11.

12.JPG

Figure 12: Identity Change Report for Identity Capture

You may replace the input data in the IdentityCaptureTest.txt file with your data, and edit the IdentityCaptureSourceDescriptor.xml, IdentityCaptureAttributes.xml, and IdentityCaptureRunScript.xml files to correspond to your new data. Detailed information for each of the XML configurations can be found in the OYSTER Reference Guide.

This identity file created in this run will act as the input for future runs that will update and maintain the knowledgebase.

Back to OYSTER Demonstration Run page

Updated