Wiki

Clone wiki

OYSTER / Example

Example

This run will use the test data file named ‘Merge-purgeTest.txt’, illustrated in Figure 33. This data consists of six references composed by five attributes. The first attribute is the IdentityID, this is a unique identifier associated to each record. The other attributes consist of FirstName, LastName, SchoolCode, and DOB. When these attributes are combined as they are in the source file they are used to define a set of sample student references.

Figure 33.PNG

After analyzing the source data the source descriptor file can be created. This file contains information including the location of the source data, the attributes to be used to define each reference in the source, and how to connect to the source. The contents of this file are illustrated in Figure 34.

Figure 34.PNG

Once the source descriptor is defined the source attributes file must be defined. This file is stored in the Source folder along with the source descriptor file above. The attributes file is used to define the attributes in the source along with the algorithm used to compare the attributes and the matching (identity) rules used to perform ER. For this sample run two identity rules will be used. The first rule states that the reference will be considered equivalent if the FirstName, LastName, and DOB attributes match. The second rules states that the references are equivalent if the LastName, DOB, and SchoolCode match. The source attribute file is named ‘MergePurgeAttributes.xml’ and is shown in Figure 35.

Figure 35.PNG

Once the source data is obtained, the source descriptor is created, and the attributes file is created the last step is to configure the run script since it is the controlling xml file that tells OYSTER where to find all the other files. Due to this being a merge-purge ER run, no input identities or output identities files should be specified in the run script. As mentioned above the run script should be stored in the root OYSTER folder as this is where the OYSTER program is expecting the file to reside. The file for this sample is named ‘MergePurgeRunScript.xml’ and is shown in Figure 36.

Figure 36.PNG

Now that all the scripts for the Merge-purge sample have been created we can run OYSTER. To run OYSTER you double click on the Oyster.bat file (highlighted in Figure 37) that was described earlier in the Launching OYSTER section of this document.

Figure 37.PNG

This action opens a command prompt and calls the Oyster.jar file to run via the command line. The command prompt requests that you type the name of the Run Script that was created, shown in Figure 38. Be sure to include the .xml extension along with the file name. This file name is not case sensitive.

Figure 38.PNG

Once the name of the run script has been specified, as shown in Figure 39, press enter. (Please see the Invoking the OYSTER Run Script section for more information.)

Figure 39.PNG

Once the run is complete you will see the results of the run written to the command box window.

Figure 40.PNG

Figure 41.png

Above is the Figure 41: Information generated by OYSTER run to Command box - 2

By examining the output, as shown in Figure 40 and Figure 41, you can see that OYSTER processed 6 references and found that these 6 references belong to 3 real-world identities (Clusters).

Although multiple output files were created, the only output file desired for a Merge-purge run is the link file. This file can be found in the Output folder shown in Figure 42.

Figure 42.png

When the MergePurgeIndex.link file is opened, as shown in Figure 43, it lists all the references that were read in from the source by their source ID, it also lists the OYSTER ID that was assigned to each reference and what rule it used, if any, to match the reference to another reference. The references that share the same OYSTER ID compose the linked records meaning they are the same real-world entity as discovered by performing the Merge-Purge with OYSTER.

Figure 43.PNG

This sample run was done using a delimited text file. Examples of how to connect to a Fixed Width text file, a Microsoft Access DB, MySQL, and Microsoft SQLServer can be seen in the OYSTER Reference Guide.

Back to OYSTER Reference Guide page

Click Prev 3-Attribute Based vs Record Based Matching for Identities page

Click Next Attribute Based page

Updated