1. Daniel Needham
  2. names-disambiguator

Overview

HTTPS SSH

Names Disambiguator

Author: Daniel Needham (daniel.needham@manchester.ac.uk)

Overview

The Names Disambiguator library provides functionality for comparing disparate sources of information about individuals within external datasources in an effort to determine the unique individuals that they describe.

This library is separate from the underlying data architecture of Names. It is intended only to try and provide a score for the level of likeness between a number of source records. Therefore it is entirely flexible in how it is used and for what end purpose.

The main entry class provides three methods:

  • disambiguate

    Takes an arraylist of normalised records and iteratively calls for a match comparison to be performed against each of the other individuals in the list.

    The result of each match comparison will be added to respective normalised record's list of match scores.

    After disambiguate has been run using a list of normalised records, each normalised record will have a match score (getMatchResults()) for each other record in the collection. Something like this:

    Record-A {Record-B (75.0), Record-C (33.3), Record-D (0.0)} Record-B {Record-A (75.0), Record-C (12.0), Record-D (0.0)} Record-C {Record-A (33.3), Record-B (12.0), Record-D (80.0)} Record-D {Record-A (0.0), Record-B (0.0), Record-C (80.0)}

    These match scores can then be used however they are needed.

  • match

    Compares the attributes of two normalised records, assigning a match score between 0 & 100 accordingly. The weightings of each attribute can be adjusted within the Disambiguator class.

  • merge

    Takes a list of normalised records and what a match score above the specified threshold is found merges those records. The first record examined is considered the 'root' record, the attributes in the second record that don't exist in the first are added to the first. The second is then ignored. If there are no records to merge for a specific record then it is just added to the returned list as is. If some form of comparison hasn't been done, and the results put into each records matchResult attribute, then no merges will occur.

The above methods can be used by an external class to obtain match scores for a number of internal source records, and to merge the resulting records whose match scores meet a certain threshold. In order to do this the external class must firstly pull in the source data from whereever it resides. Once this has been done NamesDisambiguator provides a number of normalised types that can be used to transform source records into normalised records, which allows general comparison across different data sets to occur. This is also necessary where comparison with existing names records (described elsewhere) needs to happen.