HTTPS SSH

Example Data Handler Two

Author: Daniel Needham (daniel.needham@manchester.ac.uk)

Overview

This example data handler show how the names disambiguation library can be used to disambiguate the unique individuals with a source data set. It then shows how the names database manager library can be used to match against existing names records and then add / update a names database instance as necessary.

Within the Names context a data handler is the term given to something that retrieves meta data pertaining to individuals from a source data set and then uses the disambiguation library to firstly normalise that data, and then compare the resulting records to determine match scores for each comparison. These match scores can then be used to determine candidate matches.

Once the unique individuals within a dataset have been identified, they can be added to an instance of the names database.

Dependencies

The example data handler is a maven managed Java application. Its dependencies are:

  1. Log4J
    • This should be picked up from maven's central repository
  2. names-disambiguator
    • This currently needs to be downloaded from here and added to your local maven repository
  3. names-database-manager
    • This currently needs to be downloaded from here and added to your local maven repository
  4. Mysql-connector-java
    • This should be picked up from maven's central repository

In this example:

  1. Build an example data source in memory.
  2. Iterate through the data source in batches, transforming each record into a normalised names record.
  3. Use the names disambiguator to derive match scores for comparisons between each record.
  4. Use the names database manager to find candidate matches to the source record within the existing names records.
  5. Add / update the names database as required