Wiki

Clone wiki

BibSonomy / development / modules / model / Hashes

Motivation

In particular for literature references there is the problem of detecting duplicate posts, because there are big variations in how users enter fields such as journal name or author. On the one hand it is desirable to allow a user to have several posts which differ only slightly. On the other hand one might want to find other users posts which refer to the same paper or book even if they are not completely identical.

To fulfill both goals we implemented two hashes to compare publication posts. One is for comparing the posts of a single user (intra hash) and one for comparing the posts of different users (inter hash). Comparison is accomplished by normalizing and concatenating publication fields, hashing the result with the MD5 message digest algorithm and comparing the resulting hashes. MD5 hashing is done for efficiency reasons only, since this allows for a fixed length storage in the database. Storing the hashes along with the resources in the posts table enables fast comparison and search of posts.

The intra hash is relatively strict and takes into account the fields title, author, editor, year, entrytype, journal, booktitle, volume, and number. This allows one to have articles with the same title from the same authors in the same year but in different volumes (e.g., a technical report and the corresponding journal article).

In contrast, the inter hash is less specific and includes only the title, year and author or editor (depending on what the user has entered).

In both hashes all fields which are taken into account are normalized, i.e., certain special characters are removed, whitespace and author/editor names normalized. The latter is done by concatenating the first letter of the first name by a dot with the last name, both in lower case. Persons are then sorted alphabetically by this string and concatenated by a colon.

More details about the hashing can be found in the following publication: Mapping Bibliographic Records with Bibliographic Hash Keys. Jakob Voss, Andreas Hotho, and Robert Jäschke. Information: Droge, Ware oder Commons?, Hochschulverband Informationswissenschaft, Verlag Werner Hülsbusch, 2009.

Demo

This demonstration allows you to enter publication metadata and see how the hash changes. This example shows the hashes for the publication with the title "Example Publication" by the author "John Doe" from the year "2006".

Code

The source code to compute the hashes is located in the SimHash class in the bibsonomy-model module. It contains the following code to compute the intra hash:

public static String getSimHash2(final BibTex bibtex) {
   return
     StringUtils.getMD5Hash(StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(bibtex.getTitle()) + " " +
     StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(PersonNameUtils.serializePersonNames(bibtex.getAuthor(), false)) + " " +
     StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(PersonNameUtils.serializePersonNames(bibtex.getEditor(), false)) + " " +
     StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(bibtex.getYear()) + " " +
     StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(bibtex.getEntrytype()) + " " +
     StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(bibtex.getJournal()) + " " +
     StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(bibtex.getBooktitle()) + " " +
     StringUtils.removeNonNumbersOrLetters(bibtex.getVolume()) + " " +
     StringUtils.removeNonNumbersOrLetters(bibtex.getNumber()));
}

The following code is responsible to compute the inter hash:

public static String getSimHash1(final BibTex publication) {
   if (!present(StringUtils.removeNonNumbersOrLetters(PersonNameUtils.serializePersonNames(publication.getAuthor())))) {
      // no author set --> take editor
      return
         StringUtils.getMD5Hash(getNormalizedTitle(publication.getTitle()) + " " +
         PersonNameUtils.getNormalizedPersons(publication.getEditor()) + " " +
         getNormalizedYear(publication.getYear()));
   }
   // author set
   return
      StringUtils.getMD5Hash(getNormalizedTitle(publication.getTitle()) + " " +
      PersonNameUtils.getNormalizedPersons(publication.getAuthor()) + " " +
      getNormalizedYear(publication.getYear()));
}

To see how the helper functions (e.g., removeNonNumbersOrLetters) work, have a look at the StringUtils class.

Updated