GeoText is a corpus of geotagged Twitter messages (tweets) discussed in this readme and in the following paper:
A Latent Variable Model for Geographic Lexical Variation Jacob Eisenstein, Brendan O'Connor, Noah A. Smith, and Eric P. Xing In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, 2010.
Initial TextGrounder Setup
If you haven't already, follow steps 1 through 7 on the Home page.
Download and Unpack GeoText
1. Download the following GZipped tar file: http://www.ark.cs.cmu.edu/GeoText/GeoText.2010-10-12.tgz
2. Unzip the file to a desired location. Here's an example of how to do this on Mac or Linux:
$ tar -xzvf GeoText.2010-10-12.tgz
On Windows, many unzipping software packages like WinRAR that have free versions can be used to unzip .tgz files.
Among other files, the newly created directory should contain the file full_text.txt, which you'll be pointing TextGrounder to in the next step.
Import the GeoText corpus so that it can be read by TextGrounder
3. Run the following command from your TextGrounder directory:
textgrounder --memory 8g import-corpus -i /PATH/TO/GeoText.2010-10-12/full_text.txt -sg geonames.ser.gz -sco geotext.ser.gz -cf geotext
where /PATH/TO/GeoText.2010-10-12/full_text.txt is the full path to full_text.txt as unpacked from above and geonames.ser.gz is the full path to the serialized gazetteer from step 7 of the Home page. The -cf geotext flag tells the importer to treat the corpus as the GeoText corpus rather than a plain text corpus, which is relevant for how evaluation is performed (see below).
Run a Resolver on GeoText
4. Run a resolver, similarly to step 10 of the Home page, for example like this:
textgrounder --memory 2g resolve -sci geotext.ser.gz -cf geotext -r WeightedMinDistResolver -it 10 -o resolved-geotext.xml -ok resolved-geotext.kml -sco resolved-geotext.ser.gz
This will generate the resolved corpus in serialized format as resolved-geotext.ser.gz and in XML format as geotext.xml and in Google Earth viewable KML format as geotext.kml.
It will also give minimum, mean, median, and maximum error distances in kilometers. The GeoText corpus contains a number of users who each tweet a number of times, and TextGrounder tries to detect the location of each user based on the concatenation of their tweets (one "document" refers to a single user's set of tweets). This is what these error numbers refer to. Your output should look something like this:
Mean error distance (km): 8603.893292960647 Median error distance (km): 8889.529949759155 Minimum error distance (km): 1.5173467199628332 Maximum error distance (km): 15237.299350752575 Total documents evaluated: 1825
Note that currently only the development set of 1825 users (see the GeoText readme) is resolved and evaluated on.
You can also specify a bounding box on the earth, and no points outside of that box will be considered as possible user locations, though individual toponyms aren't subject to this restriction. For example, try this:
textgrounder --memory 2g resolve -sci geotext.ser.gz -cf geotext -r WeightedMinDistResolver -it 10 -o resolved-geotext.xml -ok resolved-geotext.kml -sco resolved-geotext.ser.gz -minlat 24 -maxlat 48 -minlon n126 -maxlon n65
In the above, the bounding box is 24 to 48 latitude and -126 to -65 longitude, roughly surrounding the continental United States which is where all users in the GeoText corpus are located, and results in improved results as follows:
Mean error distance (km): 1577.951214871467 Median error distance (km): 1740.5090113480996 Minimum error distance (km): 1.5173467199628332 Maximum error distance (km): 4122.1972282958595 Total documents evaluated: 1825
Alternatively, you can quickly run or re-run any evaluation via a command like the following, using either a serialized resolved corpus:
textgrounder --memory 2g eval -sci resolved-geotext.ser.gz -cf geotext
or a resolved corpus in XML format:
textgrounder --memory 2g eval -ix resolved-geotext.xml -cf geotext
A few Caveats
Currently, the unsupervised geolocation algorithms in TextGrounder use OpenNLP's default Named Entity Recognition (NER) system, which is trained on newspaper style text and is rather poor at detecting toponyms in typical tweet-style text. OpenNLP's default tokenizer is also used and suffers from similar problems, though its adverse effects are less severe. Furthermore, once all detected toponyms have been resolved (this too a very imperfect process), a very simple heuristic is used to pick a resolved-to location as each user's location. All of these are areas we are working to improve. Lastly, both toponym and user location detection are done in a largely unsupervised manner: no gold toponym or user labels are used as training data.
We also have a supervised model that gets state of the art results on GeoText and is described in Wing and Baldridge (2011) "Simple Supervised Document Geolocation with Geodesic Grids" and which is part of the Python portion of the TextGrounder code. The paper and instructions for using these models can be found on the WingBaldridge2011 page.