HTTPS SSH

Ad Hoc Monitoring of Vocabulary Shifts over Time - ground truth

This repository contains the ground truth material used to obtain the results reported in "Ad Hoc Monitoring of Vocabulary Shifts over Time", Tom Kenter, Melvin Wevers, Pim Huijnen, Maarten de Rijke, CIKM 2015.

If you use this material, please cite the paper:

@inproceedings{kenter2015vocabulary_shifts,
  title={Ad Hoc Monitoring of Vocabulary Shifts over Time},
  author={Kenter, Tom and Wevers, Melvin and Huijnen, Pim and de Rijke, Maarten},
  booktitle={CIKM},
  year={2015}
}

File format

There are 21 files, all of which are in the same format:

seed words<TAB>time period<TAB>candidate word<TAB>annotator 1 name<TAB>annotator 1 score<TAB>annotator 2 name<TAB>annotator 2 score

Character encoding

The files are UTF-8 encoded. You can see this works correctly in, e.g., the efficiency_efficiƫntie.txt file. Still, there are a lot of 'weird' characters. They stem from the original output of the OCR engine.