What is it?
The dataset is intended for evaluating toponym extraction and disambiguation approaches for unstructured texts. It consists of 152 texts obtained from different pages from the web. The main text content was extracted and occurring locations were marked in the text using XML annotations of different types, e.g.:
<UNIT>Florida</CITY> is a state in the southeast of the <COUNTRY>United States</COUNTRY>.
The following locations types were used for annotations:
UNIT (for administrative entities such as federal states, counties or cities' districts),
REGION (areas without political or administrative meaning),
LANDMARK (geographical features such as rivers, lakes, valleys, or mountains),
POI (buildings such as universities, hospitals, etc.). For all annotated locations, longitude/latitude values were manually assigned.
Annotated are entities directly referring to a location with any of the types described before. Adjectives such as "French" or demonyms such as "Frenchwoman" are not considered locations. If a location name is part of another distinct and well-known entity, such as "New York Times", "Voice of Korea", or "Virgin of Lujan" it is not annotated as location. The annotations are always as specific as possible, e.g. we annotate the complete phrase "University of Kent" as POI, and not only "Kent" as UNIT. This implies, that there are no nested annotations in the dataset. Colloquial mentions and short forms, such as "Down Under", "Sunshine State" or "Nam" (for Vietnam) are treated like a mention of the full name, in case they can be considered common knowledge.
The dataset contains 152 text files, 3,850 annotations, of which 3,484 (90.49 percent) have been assigned with coordinates. The coordinates are in a separate
coordinates.csv file, containing the document's filename, running annotation index and character offset (both zero-based), latitude and longitude as decimal WGS84 coordinates and source ID, e.g.:
The dataset is pre-split in the following disjoint sets "training", "validation" and "test" (40/20/40 split), to allow for better reproducibility of machine-learning-based results. The file
index.csv in the root directory contains the source URLs from which the texts were obtained. Additionally, the clean, non-annotated texts are included.
The following table gives a summary for the annotations of each type in the dataset.
The map shows the geographic distribution of the annotations in the dataset:
Who made it?
The dataset was created in 2012, 2013, and 2014 at the TU Dresden by Philipp Katz, David Urbansky, and Uliana Andriyeshyna. In case of any questions or feedback contact Philipp Katz, firstname.lastname@example.org
For more information, refer to "To Learn or to Rule: Two Approaches for Extracting Geographical Information from Unstructured Text", Philipp Katz and Alexander Schill, Proceedings of the 11th Australasian Data Mining & Analytics Conference (AusDM 2013), Canberra, Australia.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.