Wiki

Clone wiki

OYSTER / Example_-_Identity_Capture

Example - Identity_Capture

This run will use the test source file named ‘IdentityCaptureTest.txt’, shown in Figure 47. This data consists of the same six references that were used for the previous Merge-purge example. Note that the same source file used in the Merge-purge example could have been used without creating a copy with a new name by placing the path to the Merge-purgeTest.txt file in the OysterSourceDescriptor.xml defined for this example. This was not done in order to provide this example with a sense of autonomy.

Figure 47.PNG

After analyzing the source data the source descriptor file can be created and is named ‘IdentityCaptureSourceDescriptor.xml’. This file is shown in Figure 48.

Figure 48.PNG

Following the same process as was performed when setting up the merge-purge example, once the source descriptor is defined the source attributes file must also be defined. This file is stored in the Source folder along with the Source Descriptor file. The attributes file is used to define the attributes in the source along with the algorithm used to compare the attributes and the matching (identity) rules used when ER is performed. For this sample run two identity rules will be used. The first rule says that the reference will be considered equivalent if the FirstName, LastName, and DOB attributes match. The second rules states that the references are equivalent if the LastName, DOB, and SchoolCode (LEA) match. These are the same rules that were used for the merge-purge example. The source attribute file is named ‘IdentityCaptureAttributes.xml’ and is depicted in Figure 49.

Figure 49.PNG

The attributes file may look familiar. This is due to the fact that the same source records were used for both the merge-purge example and this identity capture example. Due to this, both attributes files are defined identically.

As with the merge-purge example, the last file that needs to be created is the RunScript for this example. For this identity capture example, no input identity file should be specified in the Run Script but both the output identity file and the link files should be specified. The Run Script should again be stored in the root OYSTER folder as this is where the OYSTER program is expecting the file to reside. The file for this sample is named ‘IdentityCaptureRunScript.xml’ and is shown in Figure 50.

Figure 50.PNG

Now that all the scripts for the Identity capture example have been created we can run OYSTER. This process is depicted in Figure 37, Figure 38, and Figure 39 and described in their surrounding text in the Example section.

Once the run is complete the output for the run will be written to the command box by OYSTER. This output is shown in Figure 51 and Figure 52:

Figure 51.PNG

Figure 52.png

Above is the Figure 52: Output written to command box by OYSTER run - 2

By examining the output you can see that OYSTER processed 6 references and found that these 6 references belong to 3 real-world identities (groups). These results are identical to the results from the merge-purge example. This was expected since the same source records were used in both examples and the same rules were used to match those records. The difference between the merge-purge and the identity capture is that the identity capture example stored the resulting identities of the ER into an output file in addition to creating an identical link file as was seen in the merge-purge example. Both of these output files can be seen in the Z:\Oyster\Run002\Output folder, as shown in Figure 53.

Figure 53.PNG

The link file, shown in Figure 54, lists all the references that were read in from the source by their source ID, it also lists the OYSTER ID that was assigned to each reference and what rule it used, if any, to match the reference to another reference. The references that share the same OYSTER ID compose the linked records meaning they are the same real-world entity.

Figure 54.PNG

As discussed previously, identity capture is a form of entity resolution in which the system builds (learns) a set of identities from the references it processes rather than starting with a known set of identities. This set of identities is stored in the IdentityCaptureOutput.idty file, shown in Figure 55, which was generated by this run preserving the information derived from the three clusters of references.

Figure 55.PNG

By examining the identity structures created in the identity capture configuration you can see how each one directly corresponds to one of the Link Indexes shown in Figure 54. The OYSTER IDs in Figure 54 correspond to the Identity Identifiers in Figure 55.

As with the merge-purge example, this sample run was done using a delimited text file. Examples of how to connect to a Fixed Width text file, a Microsoft Access DB, MySQL, and Microsoft SQLServer can be seen in the OYSTER Reference Guide.

Previous to Configuration - Identity Capture Page ......................................................... Next to 7 - Identity Build from Assertions Page

Back to OYSTER User Guide Page

Updated