HTTPS SSH

TUD-Loc-2013

What is it?

The dataset is intended for evaluating toponym extraction and disambiguation approaches for unstructured texts. It consists of 152 texts obtained from different pages from the web. The main text content was extracted and occurring locations were marked in the text using XML annotations of different types, e.g.:

<UNIT>Florida</CITY> is a state in the southeast of the <COUNTRY>United States</COUNTRY>.

Annotation guidelines

The following locations types were used for annotations: CONTINENT, COUNTRY, CITY, UNIT (for administrative entities such as federal states, counties or cities' districts), REGION (areas without political or administrative meaning), LANDMARK (geographical features such as rivers, lakes, valleys, or mountains), POI (buildings such as universities, hospitals, etc.). For all annotated locations, longitude/latitude values were manually assigned.

Annotated are entities directly referring to a location with any of the types described before. Adjectives such as "French" or demonyms such as "Frenchwoman" are not considered locations. If a location name is part of another distinct and well-known entity, such as "New York Times", "Voice of Korea", or "Virgin of Lujan" it is not annotated as location. The annotations are always as specific as possible, e.g. we annotate the complete phrase "University of Kent" as POI, and not only "Kent" as UNIT. This implies, that there are no nested annotations in the dataset. Colloquial mentions and short forms, such as "Down Under", "Sunshine State" or "Nam" (for Vietnam) are treated like a mention of the full name, in case they can be considered common knowledge.

Statistics

The dataset contains 152 text files, 3,850 annotations, of which 3,484 (90.49 percent) have been assigned with coordinates. The coordinates are in a separate coordinates.csv file, containing the document's filename, running annotation index and character offset (both zero-based), latitude and longitude as decimal WGS84 coordinates and source ID, e.g.:

text1.txt;0;0;53.00000;-8.00000;geonames:2963597

The disambiguation to coordinates was done using the GeoNames database, and -- as a fallback -- the Google Geocoding API.

The dataset is pre-split in the following disjoint sets "training", "validation" and "test" (40/20/40 split), to allow for better reproducibility of machine-learning-based results. The file index.csv in the root directory contains the source URLs from which the texts were obtained. Additionally, the clean, non-annotated texts are included.

The following table gives a summary for the annotations of each type in the dataset.

Type Total Distinct w/Coordinate
CONTINENT 72 6 72
COUNTRY 1502 149 1502
CITY 1035 415 1012
UNIT 242 132 233
REGION 141 86 108
LANDMARK 286 187 240
POI 463 362 234
STREET 55 45 39
STREETNR 37 33 28
ZIP 17 17 16
Total 3850 1432 3484

The map shows the geographic distribution of the annotations in the dataset:

Map: Annotation Distribution

Who made it?

The dataset was created in 2012, 2013, and 2014 at the TU Dresden by Philipp Katz, David Urbansky, and Uliana Andriyeshyna. In case of any questions or feedback contact Philipp Katz, philipp.katz@tu-dresden.de

For more information, refer to "To Learn or to Rule: Two Approaches for Extracting Geographical Information from Unstructured Text", Philipp Katz and Alexander Schill, Proceedings of the 11th Australasian Data Mining & Analytics Conference (AusDM 2013), Canberra, Australia.

License

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.