This page will describe how to install TextGrounder and run it on a dataset.
Running TextGrounder for the First Time
1. Install the latest version of Mercurial if you don't already have it.
2. Clone the project repository:
hg clone https://email@example.com/utcompling/textgrounder textgrounder
3. Set the TextGrounder directory in your environment:
Of course, you should substitute /path/to with the location you cloned the TextGrounder repository to.
4. Build the TextGrounder jar file.
textgrounder build update package
Note: this actually is invoking SBT to build the system.
5. Obtain the OpenNLP 1.5 models for English. Do the following:
cd $TEXTGROUNDER_DIR/data/models ./getOpenNLPModels.sh
That script simply does a wget on the model files that TextGrounder needs. You can put them elsewhere if you wish, but you'd need to set the OPENNLP_MODEL_DIR environment variable to point to the directory where they reside.
6. Download the GeoNames gazetteer:
cd $TEXTGROUNDER_DIR/data/gazetteers ./getGeoNames.sh
That script simply does a wget on the gazetteer file.
- Important*: Using this gazetteer will require at least 8GB of memory on your machine. If you have less, you should download US.zip, the gazetteer file for just the USA. If you need to use US.zip, make sure to change allCountries.zip to US.zip in the instructions below.
7. Import the GeoNames gazetteer so that it is in the format TextGrounder requires:
textgrounder --memory 8g import-gazetteer -i data/gazetteers/allCountries.zip -o geonames.ser.gz -dkm
You will only need to do this once. If an output filename without the .gz extension is specified, the resulting serialized gazetteer will not be compressed in the GZIP format, but we recommend doing this compression. The -dkm flag tells the importer to perform K-means clustering on locations in the gazetteer, which allows regions like countries to be represented by multiple points (recommended).
8. Download some data. To try it out quickly, pick any text that is likely to have plenty of toponyms. For example, you might try Memoirs of the Union's Three Great Civil War Generals by David Widger.
cd $TEXTGROUNDER_DIR mkdir tmp wget http://www.gutenberg.org/cache/epub/4546/pg4546.txt
If you want to try a much larger amount of toponym-rich text, download the Open American National Corpus (OANC) and unzip it. Then, go inside the OANC directory and choose any subdirectory whose txt files you want textgrounder to use as input.
For directions on how to work with the GeoText corpus of geotagged tweets, head over to the GeoText page.
9. Preprocess your corpus so that is ready for quick use by the TextGrounder core algorithms:
textgrounder --memory 8g import-corpus -i <PATH-TO-DATA> -sg geonames.ser.gz -sco corpus.ser.gz
where <PATH-TO-DATA> is the path to the directory or file containing the plaintext version of the corpus from step 8, geonames.ser.gz is the serialized gazetteer from step 7, and corpus.ser.gz is the filename to write the preprocessed corpus to. This step will require a large amount of memory like step 7, but will avoid long load times and high memory usage in further steps.
One can also use high recall named entity recognizer for the corpus by running import-corpus with following arguments:
textgrounder --memory 8g import-corpus -i <PATH-TO-DATA> -sg geonames.ser.gz -sco corpus.ser.gz -ner 1
10. Detect and resolve toponyms.
To use the basic Minimum Distance algorithm:
textgrounder --memory 2g resolve -sci corpus.ser.gz -r BasicMinDistResolver -o widger.xml -ok widger.kml -sco resolved-corpus.ser.gz -sg $TEXTGROUNDER_DIR/data/gazetteers/geonames.ser.gz
To use the Weighted Minimum Distance algorithm:
textgrounder --memory 2g resolve -sci corpus.ser.gz -r WeightedMinDistResolver -it 10 -o widger.xml -ok widger.kml -sco resolved-corpus.ser.gz -sg $TEXTGROUNDER_DIR/data/gazetteers/geonames.ser.gz
This should give you output that looks like this:
Reading serialized GeoNames gazetteer from /home/01199/jbaldrid/devel/textgrounder/data/gazetteers/geonames.ser.gz ... Done. Reading test corpus from pg4546.txt ...done. Initialization took 3.0412 minutes. Number of documents: 20165 Number of toponym types: 861 Maximum ambiguity (locations per toponym): 1457 Running WEIGHTED MINIMUM DISTANCE resolver with 10 iteration(s)... Iteration: 1 Iteration: 2 Iteration: 3 Iteration: 4 Iteration: 5 Iteration: 6 Iteration: 7 Iteration: 8 Iteration: 9 Iteration: 10 done. Writing resolved corpus in XML format to widger.xml ...done. Writing visualizable resolved corpus in KML format to widger.kml ...done.
11. Open widger.kml using Google Earth to see the resolved toponyms.
12. Alternatively, you can run the KML file generation separately once a corpus has been resolved with the following command:
textgrounder --memory 2g write-to-kml -sci resolved-corpus.ser.gz -ok widger.kml