Commits

Ben Wing committed 116292f

Numerous fixes for preprocessing

Comments (0)

Files changed (9)

bin/README.preprocess

+This file describes how to preprocess the Wikipedia dump to get the various
+necessary files.  The most important script is 'run-processwiki', which
+mostly makes use of 'processwiki.py'.
+
+=========== Quick start ==========
+Use 'preprocess-dump' to run all of the steps below.
+This calls 'run-processwiki'.
+
+=========== Introduction ==========
+For 'run-processwiki', output goes to the current directory, and input mostly
+comes from the current directory.  There are several steps to run, which
+are described below.
+
+The original Wikipedia dump file needs to be in the current directory,
+and it's strongly suggested that this script is run in a newly-created
+directory, empty save for the dump file (or a symlink to it), with the
+dump file marked read-only through 'chmod a-w'.
+
+All files other than the original dump file (and the disambig-id file
+mentioned below) are generated by the script.  The original dump file has
+a name like enwiki-20100905-pages-articles.xml.bz2; we also generate a permuted
+dump file with a name like enwiki-20100905-permuted-pages-articles.xml.bz2.
+
+The disambig-id file comes from $IN_DISAMBIG_ID_FILE, which is located
+in $TG_WIKIPEDIA_DIR (the directory where the results of preprocessing
+end up getting stored; set 'config-geolocate' in $TEXTGROUNDIR/bin).
+The reason for the exception regarding this particular file is that it's
+generated not by us but by Wikiprep, which may take several weeks to run.
+This file is also not especially important in the scheme of things --
+and in fact the relevant data is not currently used at all.  When the
+file is present, it lists articles that are identified as "disambiguation"
+pages, and this fact goes into one of the fields of the combined article
+data file.  If not present, all articles will have "no" in this field.
+As just mentioned, no current experiment apps make use of this info.
+
+Other important environment variables (with default settings in
+'config-geolocate', but which you might want to override):
+
+WP_VERSION       Specifies which dump file to use, e.g. "enwiki-20100905".
+USE_PERMUTED     If set to "false", uses the non-permuted version of the
+                 dump file.  If set to "true", always uses the permuted
+                 version.  If unset, attempts to auto-detect the presence of
+                 the permuted version, using it if so, otherwise non-permuted.
+
+
+============== How to do preprocessing from scratch ===============
+
+The following is a possible set of steps to use to generate the necessary
+data files from scratch.
+
+1. Create a new directory to work in, where you have a lot of free space.
+   (For example, the /scratch dir on Longhorn.) Either download a dump file
+   from Wikipedia, or symlink an existing dump file into the new directory.
+   Let's say the dump file has the dump prefix 'enwiki-2011007' --
+   the English Wikipedia, dump of October 7, 2011.  Also assume that for
+   this and all future commands, we're in the new directory.
+ 
+   If we want to download it, we might say
+
+wget http://dumps.wikimedia.org/enwiki/20111007/enwiki-20111007-pages-articles.xml.bz2
+
+   If we want to symlink from somewhere else, we might say
+
+ln -s ../../somewhere/else/enwiki-20111007-pages-articles.xml.bz2 .
+
+2. Generate the basic and combined article data files for the non-permuted dump
+
+WP_VERSION=enwiki-20111007 USE_PERMUTED=false run-processwiki combined-article-data
+
+3. Generate a permuted dump file; all future commands will operate on the
+   permuted dump file, because we won't specify a value for USE_PERMUTED.
+
+WP_VERSION=enwiki-20111007 run-permute all
+
+4. Generate the basic and combined article data files for the permuted dump
+
+WP_VERSION=enwiki-20111007 run-processwiki combined-article-data
+
+5. Generate the counts file for articles with coordinates -- this is the info
+   needed by most of the Geolocate experiments.
+
+WP_VERSION=enwiki-20111007 run-processwiki coord-counts
+
+6. Generate the counts and words files for all articles, splitting the dump
+   file so we can run in parallel.
+
+WP_VERSION=enwiki-20111007 split-dump
+WP_VERSION=enwiki-20111007 NUM_SIMULTANEOUS=8 run-processwiki all-counts all-words
+
+7. Move all final generated files (i.e. not including intermediate files) into
+   some final directory, e.g. $TG_WIKIPEDIA_DIR.
+
+mv -i *.bz2 *.txt $TG_WIKIPEDIA_DIR
+chmod a-w $TG_WIKIPEDIA_DIR/*
+
+   Note the use of '-i', which will query you in case you are trying to
+   overwrite an existing while.  We also run 'chmod' afterwards to make all
+   the files read-only, to lessen the possibility of accidentally overwriting
+   them later in another preprocessing run.
+
+============== How to rerun a single step ===============
+
+If all the preprocessing has already been done for you, and you simply want
+to run a single step, then you don't need to do all of the above steps.
+However, it's still strongly recommended that you do your work in a fresh
+directory, and symlink the dump file into that directory -- in this case the
+*permuted* dump file.  We use the permuted dump file for experiments because
+the raw dump file has a non-uniform distribution of articles, and so we can't
+e.g. count on our splits being uniformly distributed.  Randomly permuting
+the dump file and article lists takes care of that.  The permuted dump file
+has a name like
+
+enwiki-20111007-permuted-pages-articles.xml.bz2
+
+For example, if want to change processwiki.py to generate bigrams, and then
+run it to generate the bigram counts, you might do this:
+
+1. Note that there are currently options `output-coord-counts` to output
+   unigram counts only for articles with coordinates (which are the only ones
+   needed for standard document geotagging), and `output-all-counts` to
+   output unigram counts for all articles.  You want to add corresponding
+   options for bigram counts -- either something like
+   `output-coord-bigram-counts` and `output-all-bigram-counts`, or an option
+   `--n-gram` to specify the N-gram size (1 for unigrams, 2 for bigrams,
+   3 for trigrams if that's implemented, etc.).  *DO NOT* in any circumstance
+   simply hack the code so that it automatically outputs bigrams instead of
+   unigrams -- such code CANNOT be incorporated into the repository, which
+   means your mods will become orphaned and unavailable for anyone else.
+
+2. Modify 'config-geolocate' so that it has additional sets of environment
+   variables for bigram counts.  For example, after these lines:
+
+COORD_COUNTS_SUFFIX="counts-only-coord-documents.txt"
+ALL_COUNTS_SUFFIX="counts-all-documents.txt"
+
+   you'd add
+
+COORD_BIGRAM_COUNTS_SUFFIX="bigram-counts-only-coord-documents.txt"
+ALL_BIGRAM_COUNTS_SUFFIX="bigram-counts-all-documents.txt"
+
+   Similarly, after these lines:
+
+OUT_COORD_COUNTS_FILE="$DUMP_PREFIX-$COORD_COUNTS_SUFFIX"
+OUT_ALL_COUNTS_FILE="$DUMP_PREFIX-$ALL_COUNTS_SUFFIX"
+
+   you'd add
+
+OUT_COORD_BIGRAM_COUNTS_FILE="$DUMP_PREFIX-$COORD_BIGRAM_COUNTS_SUFFIX"
+OUT_ALL_BIGRAM_COUNTS_FILE="$DUMP_PREFIX-$ALL_BIGRAM_COUNTS_SUFFIX"
+
+   And then you'd do the same thing for IN_COORD_COUNTS_FILE and
+   IN_ALL_COUNTS_FILE.
+
+3. Modify 'run-processwiki', adding new targets ("steps")
+   'coord-bigram-counts' and 'all-bigram-counts'.  Here, you would just
+   copy the existing lines for 'coord-counts' and 'all-counts' and modify
+   them appropriately.
+
+4. Now finally you can run it:
+
+WP_VERSION=enwiki-20111007 run-processwiki coord-bigram-counts
+
+   This generates the bigram counts for geotagged articles -- the minimum
+   necessary for document geotagging.
+
+   Actually, since the above might take awhile and generate a fair amount
+   of diagnostic input, you might want to run it in the background
+   under nohup, so that it won't die if your terminal connection suddenly
+   dies.  One way to do that is to use the TextGrounder 'run-nohup' script:
+
+WP_VERSION=enwiki-20111007 run-nohup --id do-coord-bigram-counts run-processwiki coord-bigram-counts
+
+   Note that the '--id do-coord-bigram-counts' is optional; all it does is
+   insert the text "do-coord-bigram-counts" into the file that it stores
+   stdout and stderr output into.  This file will have a name beginning
+   'run-nohup.' and ending with a timestamp.  The beginning and ending of the
+   file will indicate the starting and ending times, so you can see how long
+   it took.
+
+   If you want to generate bigram counts for all articles, you could use a
+   similar command line, although it might take a couple of days to complete.
+   If you're on Longhorn, where you only have 24-hour time slots, you might
+   consider using the "divide-and-conquer" mode.  The first thing is to
+   split the dump file, like this:
+
+WP_VERSION=enwiki-20111007 run-processwiki split-dump
+
+   This takes maybe 45 mins and splits the whole dump file into 8 pieces.
+   (Controllable through NUM_SPLITS.)
+
+   Then, each operation you want to do in divide-and-conquer mode, run it
+   by setting NUM_SIMULTANEOUS to something more than 1, e.g.
+
+WP_VERSION=enwiki-20111007 NUM_SIMULTANEOUS=8 run-processwiki all-bigram-counts
+
+   (although you probably want to wrap it in 'run-nohup').  Essentially,
+   this runs 8 simultaneous run-processwiki processes (which fits well with
+   the workhorse Longhorn machines, since they are 8-core), one on each of
+   the 8 splits, and then concatenates the results together at the end.
+   You can set a NUM_SIMULTANEOUS that's lower than the number of splits,
+   and you get only that much simultaneity.

bin/config-geolocate

 #
 # 3. PCL_TRAVEL_DIR: Optional; location of PCL travel corpus and related files.
 #
-# 4. NO_USE_PERMUTED: Optional; if non-blank, don't use permuted version of
-#                     Wikipedia data files.
+# 4. USE_PERMUTED: Optional; if "false", don't use permuted version of
+#                  Wikipedia data files. If "true", do use it.  Otherwise,
+#                  use it if it appears available, otherwise not.  
 #
 # 5. WP_VERSION: Optional; if set, should specify the prefix of a dump file,
 #                which will be used for the dump file and all files computed
 set_wp_version() {
   WP_VERSION="$1"
   DUMP_PREFIX="$1"
-  if [ -z "$NO_USE_PERMUTED" ]; then
+  if [ -z "$USE_PERMUTED" ]; then
     if [ -e "$TG_CORPUS_DIR/wikipedia/$WP_VERSION/$WP_VERSION-permuted-pages-articles.xml.bz2" ]; then
       DUMP_PREFIX="$1-permuted"
     fi
+  elif [ "$USE_PERMUTED" = "true" ]; then
+    permuted_file="$TG_CORPUS_DIR/wikipedia/$WP_VERSION/$WP_VERSION-permuted-pages-articles.xml.bz2"
+    if [ -e "$permuted_file" ]; then
+      :
+    else
+      echo "WARNING: Permuted file $permuted_file does not appear to exist, but using anyway."
+    fi
+    DUMP_PREFIX="$1-permuted"
   fi
 
   WP_VERSION_DIR="$TG_WIKIPEDIA_DIR/$WP_VERSION"

bin/convert-corpus-to-latest

+#!/bin/sh
+
+# USAGE: convert-corpus-to-latest WIKITAG
+#
+# where WIKITAG is something like 'dewiki-20120225'. (Which needs to exist.)
+
+# Process options
+
+#NO_DOWNLOAD=false
+while true; do
+  case "$1" in
+    --no-download ) NO_DOWNLOAD=true; shift ;;
+    -- ) shift; break ;;
+    * ) break ;;
+  esac
+done
+
+wikitag="$1"
+cd $wikitag
+echo "Converting Wikipedia corpus $wikitag to latest format ..."
+mkdir convert
+cd convert
+ln -s .. $wikitag
+run-convert-corpus --steps wiki $wikitag
+mv convert-corpora-3/$wikitag/* $wikitag
+cd ..
+rm -rf convert
+echo "Converting Wikipedia corpus $wikitag to latest format ... done."
+cd ..

bin/download-preprocess-wiki

 #
 # where WIKITAG is something like 'dewiki-20120225'. (Which needs to exist.)
 
+# Process options
+
+NO_DOWNLOAD=false
+PREPROCESS_OPTS=
+while true; do
+  case "$1" in
+    --no-download ) NO_DOWNLOAD=true; shift ;;
+    --no-permute ) PREPROCESS_OPTS="$PREPROCESS_OPTS --no-permute"; shift ;;
+    -- ) shift; break ;;
+    * ) break ;;
+  esac
+done
+
 wikitag="$1"
 mkdir -p $wikitag
 cd $wikitag
-echo "Downloading Wikipedia corpus $wikitag ..."
-wikidir="`echo $wikitag | sed 's/-/\//'`"
-wget -nd http://dumps.wikimedia.org/$wikidir/$wikitag-pages-articles.xml.bz2
-echo "Downloading Wikipedia corpus $wikitag ... done."
+if [ "$NO_DOWNLOAD" != "true" ]; then
+  echo "Downloading Wikipedia corpus $wikitag ..."
+  wikidir="`echo $wikitag | sed 's/-/\//'`"
+  wget -nd http://dumps.wikimedia.org/$wikidir/$wikitag-pages-articles.xml.bz2
+  echo "Downloading Wikipedia corpus $wikitag ... done."
+fi
 echo "Preprocessing Wikipedia corpus $wikitag ..."
-preprocess-dump $wikitag
+preprocess-dump $PREPROCESS_OPTS $wikitag
 echo "Preprocessing Wikipedia corpus $wikitag ... done."
-echo "Converting Wikipedia corpus $wikitag to latest format ..."
-mkdir convert
-cd convert
-ln -s .. $wikitag
-run-convert-corpus --steps wiki $wikitag
-mv convert-corpora-3/$wikitag/* $wikitag
-cd ..
-rm -rf convert
-echo "Converting Wikipedia corpus $wikitag to latest format ... done."
-cd ..
+echo "Converting $wikitag to latest format ..."
+convert-corpus-to-latest $wikitag
+echo "Converting $wikitag to latest format ... done."
+echo "Settings permissions on $wikitag ..."
+# Make tree world-readable
+chmod -R u+w,a+rX .
+echo "Settings permissions on $wikitag ... done."

bin/preprocess-dump

 #!/bin/sh
 
+# Process options
+
+NO_PERMUTE=false
+while true; do
+  case "$1" in
+    --no-permute ) NO_PERMUTE=true; shift ;;
+    -- ) shift; break ;;
+    * ) break ;;
+  esac
+done
+
 if [ -z "$*" ]; then
   cat <<FOO
 Usage: $0 DUMP-PREFIX
 # This needs to be set for all subprocesses we call
 export WP_VERSION="$dumppref"
 
-# Generate article-data file from orginal dump
-NO_USE_PERMUTED=t run-processwiki article-data
+if [ "$NO_PERMUTE" != true ]; then
+  # Generate article-data file from orginal dump
+  USE_PERMUTED=false run-processwiki article-data
 
-# Generate a permuted dump file; all future commands will operate on the
-# permuted dump file, because we won't use NO_USE_PERMUTED.
-run-permute all
+  # Generate a permuted dump file; all future commands will operate on the
+  # permuted dump file, because we will set USE_PERMUTED appropriately.
+  run-permute all
+fi
 
+# Apparently there's a possible race condition in detection, so forcibly
+# use the permuted file.
+export USE_PERMUTED=true
 # Split the dump so we can faster afterwards
 run-processwiki split-dump
 
 fi
 
 # Non-standard here: Don't use permuted dumps
-NO_USE_PERMUTED=t
+USE_PERMUTED=false
 . $TEXTGROUNDER_DIR/bin/config-geolocate
 
 TG_PYTHON_DIR="$TEXTGROUNDER_DIR/python"
   steps="$*"
 fi
 
+docmd() {
+  cmd="$*"
+  echo "Executing at `date`: $cmd"
+  sh -c "$cmd"
+  echo "Ending at `date`: $cmd"
+}
+
 echo "Steps are $steps"
 
 for step in $steps; do
 
 if [ "$step" = permute ]; then
 echo "Permuting articles ..."
-$PERMUTE_WIKI --article-data-file $OUT_ORIG_DOCUMENT_DATA_FILE \
-  --mode=permute $OTHEROPTS > $PERMUTED_OUT_ORIG_DOCUMENT_DATA_FILE
+args="--article-data-file $OUT_ORIG_DOCUMENT_DATA_FILE \
+  --mode=permute $OTHEROPTS"
+outfile="$PERMUTED_OUT_ORIG_DOCUMENT_DATA_FILE"
+cmd="$PERMUTE_WIKI $args > $outfile"
+echo "Executing at `date`: $cmd"
+$PERMUTE_WIKI $args > $outfile
+echo "Ending at `date`: $cmd"
 
 elif [ "$step" = split ]; then
 echo "Splitting dump file ..."
 
-bzcat $OUT_DUMP_FILE | $PERMUTE_WIKI --mode=split \
+args="--mode=split \
   --article-data-file $PERMUTED_OUT_ORIG_DOCUMENT_DATA_FILE \
   --split-prefix $SPLIT_PREFIX \
   --number-of-splits $NUM_SPLITS \
-  $OTHEROPTS
+  $OTHEROPTS"
+cmd="bzcat $OUT_DUMP_FILE | $PERMUTE_WIKI $args"
+echo "Executing at `date`: $cmd"
+bzcat $OUT_DUMP_FILE | $PERMUTE_WIKI $args
+echo "Ending at `date`: $cmd"
 
 elif [ "$step" = sort ]; then
 echo "Sorting the split files ..."
   SPLITFILE="$SPLIT_PREFIX.$i"
   SPLITARTS="$SPLITFILE.articles"
   echo "Sorting file $SPLITFILE..."
+  args="-a $SPLITARTS --mode=sort"
+  outfile="$SPLITFILE.sorted"
   if [ "$NUM_SIMULTANEOUS" -eq 1 ]; then
-    < $SPLITFILE $PERMUTE_WIKI -a $SPLITARTS --mode=sort > $SPLITFILE.sorted
+    cmd="< $SPLITFILE $PERMUTE_WIKI $args > $outfile"
+    echo "Executing at `date`: $cmd"
+    < $SPLITFILE $PERMUTE_WIKI $args > $outfile
+    echo "Ending at `date`: $cmd"
   else
     if [ "$numleft" -gt 0 ]; then
-      < $SPLITFILE $PERMUTE_WIKI -a $SPLITARTS --mode=sort > $SPLITFILE.sorted &
+      cmd="< $SPLITFILE $PERMUTE_WIKI $args > $outfile &"
+      echo "Executing at `date`: $cmd"
+      < $SPLITFILE $PERMUTE_WIKI $args > $outfile &
+      echo "Ending at `date`: $cmd"
       numleft=`expr $numleft - 1`
       numrun=`expr $numrun + 1`
     fi
     if [ "$numleft" -eq 0 ]; then
       echo "Waiting for $numrun processes to finish..."
       wait
+      echo "Ending at `date`: Waiting."
       numleft="$NUM_SIMULTANEOUS"
       numrun=0
     fi
 if [ "$numrun" -gt 0 ]; then
   echo "Waiting for $numrun processes to finish..."
   wait
+  echo "Ending at `date`: Waiting."
   numrun=0
 fi
 
 done
 all_files="$SPLIT_PREFIX.prolog $splits $SPLIT_PREFIX.epilog"
 echo "Concatenating $all_files ..."
+cmd="cat $all_files | bzip2 > $PERMUTED_DUMP_FILE"
+echo "Executing at `date`: $cmd"
 cat $all_files | bzip2 > $PERMUTED_DUMP_FILE
+echo "Ending at `date`: $cmd"
 
 else
 echo "Unrecognized step $step"

bin/run-processwiki

 'config-geolocate', but which you might want to override):
 
 WP_VERSION       Specifies which dump file to use, e.g. "enwiki-20100905".
-NO_USE_PERMUTED  If set, uses the non-permuted version of the dump file.
+USE_PERMUTED     If set to "false", uses the non-permuted version of the dump
+                 file.  If set to "true", always try to use the permuted
+                 version.  If blank, use permuted version if it appears to
+                 exist, non-permuted otherwise.
 
 Output files are in the current directory.
 
 
 2. Generate the basic and combined article data files for the non-permuted dump
 
-WP_VERSION=enwiki-20111007 NO_USE_PERMUTED=t run-processwiki combined-article-data
+WP_VERSION=enwiki-20111007 USE_PERMUTED=false run-processwiki combined-article-data
 
 3. Generate a permuted dump file; all future commands will operate on the
-   permuted dump file, because we won't use NO_USE_PERMUTED.
+   permuted dump file, because we won't specify a value for USE_PERMUTED.
 
 WP_VERSION=enwiki-20111007 run-permute all
 
 
 action=
 cansplit=yes
+outfile=
+args=
 
 if [ "$step" = article-data ]; then
 
 
 action="Generating location-type data"
 args="--output-location-type"
-outfile=
 # Don't split because we output to separate split files (FIXME why?).
 cansplit=no
 
 
 # Uses a different program, not processwiki.
 echo "Combining article data ..."
-echo "Beginning at `date`:"
-echo "Executing: $GENERATE_COMBINED \
-  --links-file $OUT_COORD_LINKS_FILE \
+args="--links-file $OUT_COORD_LINKS_FILE \
   --coords-file $OUT_COORDS_FILE \
-  --article-data-file $OUT_ORIG_DOCUMENT_DATA_FILE \
-  > $OUT_COMBINED_DOCUMENT_DATA_FILE"
-$GENERATE_COMBINED \
-  --links-file $OUT_COORD_LINKS_FILE \
-  --coords-file $OUT_COORDS_FILE \
-  --article-data-file $OUT_ORIG_DOCUMENT_DATA_FILE \
-  > $OUT_COMBINED_DOCUMENT_DATA_FILE
-echo "Ended at `date`."
+  --article-data-file $OUT_ORIG_DOCUMENT_DATA_FILE"
+outfile="$OUT_COMBINED_DOCUMENT_DATA_FILE"
+cmd="$GENERATE_COMBINED $args > $outfile"
+echo "Executing at `date`: $cmd"
+$GENERATE_COMBINED $args > $outfile
+echo "Ended at `date`: $cmd"
 
 elif [ "$step" = split-dump ]; then
 
 
 # Uses a different program, not processwiki.
 echo "Splitting dump file ..."
-echo "Beginning at `date`:"
-echo "Executing: bzcat $OUT_DUMP_FILE | $PERMUTE_WIKI --mode=split \
-  --article-data-file $OUT_ORIG_DOCUMENT_DATA_FILE \
+args="--mode=split --article-data-file $OUT_ORIG_DOCUMENT_DATA_FILE \
   --split-prefix $SPLIT_PREFIX \
   --number-of-splits $NUM_SPLITS $OTHEROPTS"
-bzcat $OUT_DUMP_FILE | $PERMUTE_WIKI --mode=split \
-  --article-data-file $OUT_ORIG_DOCUMENT_DATA_FILE \
-  --split-prefix $SPLIT_PREFIX \
-  --number-of-splits $NUM_SPLITS $OTHEROPTS
-echo "Ended at `date`."
+cmd="bzcat $OUT_DUMP_FILE | $PERMUTE_WIKI $args"
+echo "Executing at `date`: $cmd"
+bzcat $OUT_DUMP_FILE | $PERMUTE_WIKI $args
+echo "Ended at `date`: $cmd"
 
 elif [ "$step" = coord-counts ]; then
 
 
 fi
 
-if [ "$NUM_SIMULTANEOUS" -eq 1 -o -z "$outfile" -o "$cansplit" = "no" ]; then
+if [ -z "$action" ]; then
+  : # do nothing
+elif [ "$NUM_SIMULTANEOUS" -eq 1 -o -z "$outfile" -o "$cansplit" = "no" ]; then
 
   # Operate in non-split mode
-  echo "Beginning at `date`:"
   echo "$action ..."
   if [ -n "$outfile" ]; then
-    echo "Executing: bzcat $OUT_DUMP_FILE | $PROCESSWIKI $args $OTHEROPTS > $outfile"
+    cmd="bzcat $OUT_DUMP_FILE | $PROCESSWIKI $args $OTHEROPTS > $outfile"
+    echo "Executing at `date`: $cmd"
     bzcat $OUT_DUMP_FILE | $PROCESSWIKI $args $OTHEROPTS > $outfile
+    echo "Ended at `date`: $cmd"
   else
-    echo "Executing: bzcat $OUT_DUMP_FILE | $PROCESSWIKI $args $OTHEROPTS"
+    cmd="bzcat $OUT_DUMP_FILE | $PROCESSWIKI $args $OTHEROPTS"
+    echo "Executing at `date`: $cmd"
     bzcat $OUT_DUMP_FILE | $PROCESSWIKI $args $OTHEROPTS
+    echo "Ended at `date`: $cmd"
   fi
   echo "$action ... done."
-  echo "Ended at `date`."
 
 else
 
       split_outfile="$outfile.split-processwiki.$i"
       splits="$splits $split_outfile"
       splits_removable="$splits_removable $split_outfile"
-      echo "Beginning at `date`:"
-      echo "Executing: cat $SPLIT_PREFIX.prolog $SPLITFILE $SPLIT_PREFIX.epilog | $PROCESSWIKI $args $OTHEROPTS > $split_outfile &"
-      cat $SPLIT_PREFIX.prolog $SPLITFILE $SPLIT_PREFIX.epilog | $PROCESSWIKI $args $OTHEROPTS > $split_outfile &
-      echo "Ended at `date`."
+      cat_args="$SPLIT_PREFIX.prolog $SPLITFILE $SPLIT_PREFIX.epilog"
+      cmd="cat $cat_args | $PROCESSWIKI $args $OTHEROPTS > $split_outfile &"
+      echo "Executing at `date`: $cmd"
+      cat $cat_args | $PROCESSWIKI $args $OTHEROPTS > $split_outfile &
+      echo "Ended at at `date`: $cmd"
       numleft=`expr $numleft - 1`
       numrun=`expr $numrun + 1`
     fi
     if [ "$numleft" -eq 0 ]; then
       echo "Waiting for $numrun processes to finish..."
       wait
-      echo "Ended at `date`."
+      echo "Ended at `date`: Waiting."
       numleft="$NUM_SIMULTANEOUS"
       numrun=0
     fi
   if [ "$numrun" -gt 0 ]; then
     echo "Waiting for $numrun processes to finish..."
     wait
-      echo "Ended at `date`."
+    echo "Ended at `date`: Waiting."
     numrun=0
   fi
   echo "$action, combining the files ..."
   all_files="$splits"
   echo "$action, concatenating all files ($all_files) ..."
-  echo "Beginning at `date`:"
-  echo "Executing: cat $all_files > $outfile"
+  cmd="cat $all_files > $outfile"
+  echo "Executing at `date`: $cmd"
   cat $all_files > $outfile
-  echo "Ended at `date`."
+  echo "Ended at `date`: $cmd"
   echo "$action, removing intermediate split files ($splits_removable) ..."
   rm -f $splits_removable
   echo "$action ... done."

python/README.preprocess

-This file describes how to preprocess the Wikipedia dump to get the various
-necessary files.  The most important script is 'run-processwiki', which
-mostly makes use of 'processwiki.py'.
-
-=========== Quick start ==========
-Use 'preprocess-dump' to run all of the steps below.
-This calls 'run-processwiki'.
-
-=========== Introduction ==========
-For 'run-processwiki', output goes to the current directory, and input mostly
-comes from the current directory.  There are several steps to run, which
-are described below.
-
-The original Wikipedia dump file needs to be in the current directory,
-and it's strongly suggested that this script is run in a newly-created
-directory, empty save for the dump file (or a symlink to it), with the
-dump file marked read-only through 'chmod a-w'.
-
-All files other than the original dump file (and the disambig-id file
-mentioned below) are generated by the script.  The original dump file has
-a name like enwiki-20100905-pages-articles.xml.bz2; we also generate a permuted
-dump file with a name like enwiki-20100905-permuted-pages-articles.xml.bz2.
-
-The disambig-id file comes from $IN_DISAMBIG_ID_FILE, which is located
-in $TG_WIKIPEDIA_DIR (the directory where the results of preprocessing
-end up getting stored; set 'config-geolocate' in $TEXTGROUNDIR/bin).
-The reason for the exception regarding this particular file is that it's
-generated not by us but by Wikiprep, which may take several weeks to run.
-This file is also not especially important in the scheme of things --
-and in fact the relevant data is not currently used at all.  When the
-file is present, it lists articles that are identified as "disambiguation"
-pages, and this fact goes into one of the fields of the combined article
-data file.  If not present, all articles will have "no" in this field.
-As just mentioned, no current experiment apps make use of this info.
-
-Other important environment variables (with default settings in
-'config-geolocate', but which you might want to override):
-
-WP_VERSION       Specifies which dump file to use, e.g. "enwiki-20100905".
-NO_USE_PERMUTED  If set, uses the non-permuted version of the dump file.
-
-
-============== How to do preprocessing from scratch ===============
-
-The following is a possible set of steps to use to generate the necessary
-data files from scratch.
-
-1. Create a new directory to work in, where you have a lot of free space.
-   (For example, the /scratch dir on Longhorn.) Either download a dump file
-   from Wikipedia, or symlink an existing dump file into the new directory.
-   Let's say the dump file has the dump prefix 'enwiki-2011007' --
-   the English Wikipedia, dump of October 7, 2011.  Also assume that for
-   this and all future commands, we're in the new directory.
- 
-   If we want to download it, we might say
-
-wget http://dumps.wikimedia.org/enwiki/20111007/enwiki-20111007-pages-articles.xml.bz2
-
-   If we want to symlink from somewhere else, we might say
-
-ln -s ../../somewhere/else/enwiki-20111007-pages-articles.xml.bz2 .
-
-2. Generate the basic and combined article data files for the non-permuted dump
-
-WP_VERSION=enwiki-20111007 NO_USE_PERMUTED=t run-processwiki combined-article-data
-
-3. Generate a permuted dump file; all future commands will operate on the
-   permuted dump file, because we won't use NO_USE_PERMUTED.
-
-WP_VERSION=enwiki-20111007 run-permute all
-
-4. Generate the basic and combined article data files for the permuted dump
-
-WP_VERSION=enwiki-20111007 run-processwiki combined-article-data
-
-5. Generate the counts file for articles with coordinates -- this is the info
-   needed by most of the Geolocate experiments.
-
-WP_VERSION=enwiki-20111007 run-processwiki coord-counts
-
-6. Generate the counts and words files for all articles, splitting the dump
-   file so we can run in parallel.
-
-WP_VERSION=enwiki-20111007 split-dump
-WP_VERSION=enwiki-20111007 NUM_SIMULTANEOUS=8 run-processwiki all-counts all-words
-
-7. Move all final generated files (i.e. not including intermediate files) into
-   some final directory, e.g. $TG_WIKIPEDIA_DIR.
-
-mv -i *.bz2 *.txt $TG_WIKIPEDIA_DIR
-chmod a-w $TG_WIKIPEDIA_DIR/*
-
-   Note the use of '-i', which will query you in case you are trying to
-   overwrite an existing while.  We also run 'chmod' afterwards to make all
-   the files read-only, to lessen the possibility of accidentally overwriting
-   them later in another preprocessing run.
-
-============== How to rerun a single step ===============
-
-If all the preprocessing has already been done for you, and you simply want
-to run a single step, then you don't need to do all of the above steps.
-However, it's still strongly recommended that you do your work in a fresh
-directory, and symlink the dump file into that directory -- in this case the
-*permuted* dump file.  We use the permuted dump file for experiments because
-the raw dump file has a non-uniform distribution of articles, and so we can't
-e.g. count on our splits being uniformly distributed.  Randomly permuting
-the dump file and article lists takes care of that.  The permuted dump file
-has a name like
-
-enwiki-20111007-permuted-pages-articles.xml.bz2
-
-For example, if want to change processwiki.py to generate bigrams, and then
-run it to generate the bigram counts, you might do this:
-
-1. Note that there are currently options `output-coord-counts` to output
-   unigram counts only for articles with coordinates (which are the only ones
-   needed for standard document geotagging), and `output-all-counts` to
-   output unigram counts for all articles.  You want to add corresponding
-   options for bigram counts -- either something like
-   `output-coord-bigram-counts` and `output-all-bigram-counts`, or an option
-   `--n-gram` to specify the N-gram size (1 for unigrams, 2 for bigrams,
-   3 for trigrams if that's implemented, etc.).  *DO NOT* in any circumstance
-   simply hack the code so that it automatically outputs bigrams instead of
-   unigrams -- such code CANNOT be incorporated into the repository, which
-   means your mods will become orphaned and unavailable for anyone else.
-
-2. Modify 'config-geolocate' so that it has additional sets of environment
-   variables for bigram counts.  For example, after these lines:
-
-COORD_COUNTS_SUFFIX="counts-only-coord-documents.txt"
-ALL_COUNTS_SUFFIX="counts-all-documents.txt"
-
-   you'd add
-
-COORD_BIGRAM_COUNTS_SUFFIX="bigram-counts-only-coord-documents.txt"
-ALL_BIGRAM_COUNTS_SUFFIX="bigram-counts-all-documents.txt"
-
-   Similarly, after these lines:
-
-OUT_COORD_COUNTS_FILE="$DUMP_PREFIX-$COORD_COUNTS_SUFFIX"
-OUT_ALL_COUNTS_FILE="$DUMP_PREFIX-$ALL_COUNTS_SUFFIX"
-
-   you'd add
-
-OUT_COORD_BIGRAM_COUNTS_FILE="$DUMP_PREFIX-$COORD_BIGRAM_COUNTS_SUFFIX"
-OUT_ALL_BIGRAM_COUNTS_FILE="$DUMP_PREFIX-$ALL_BIGRAM_COUNTS_SUFFIX"
-
-   And then you'd do the same thing for IN_COORD_COUNTS_FILE and
-   IN_ALL_COUNTS_FILE.
-
-3. Modify 'run-processwiki', adding new targets ("steps")
-   'coord-bigram-counts' and 'all-bigram-counts'.  Here, you would just
-   copy the existing lines for 'coord-counts' and 'all-counts' and modify
-   them appropriately.
-
-4. Now finally you can run it:
-
-WP_VERSION=enwiki-20111007 run-processwiki coord-bigram-counts
-
-   This generates the bigram counts for geotagged articles -- the minimum
-   necessary for document geotagging.
-
-   Actually, since the above might take awhile and generate a fair amount
-   of diagnostic input, you might want to run it in the background
-   under nohup, so that it won't die if your terminal connection suddenly
-   dies.  One way to do that is to use the TextGrounder 'run-nohup' script:
-
-WP_VERSION=enwiki-20111007 run-nohup --id do-coord-bigram-counts run-processwiki coord-bigram-counts
-
-   Note that the '--id do-coord-bigram-counts' is optional; all it does is
-   insert the text "do-coord-bigram-counts" into the file that it stores
-   stdout and stderr output into.  This file will have a name beginning
-   'run-nohup.' and ending with a timestamp.  The beginning and ending of the
-   file will indicate the starting and ending times, so you can see how long
-   it took.
-
-   If you want to generate bigram counts for all articles, you could use a
-   similar command line, although it might take a couple of days to complete.
-   If you're on Longhorn, where you only have 24-hour time slots, you might
-   consider using the "divide-and-conquer" mode.  The first thing is to
-   split the dump file, like this:
-
-WP_VERSION=enwiki-20111007 run-processwiki split-dump
-
-   This takes maybe 45 mins and splits the whole dump file into 8 pieces.
-   (Controllable through NUM_SPLITS.)
-
-   Then, each operation you want to do in divide-and-conquer mode, run it
-   by setting NUM_SIMULTANEOUS to something more than 1, e.g.
-
-WP_VERSION=enwiki-20111007 NUM_SIMULTANEOUS=8 run-processwiki all-bigram-counts
-
-   (although you probably want to wrap it in 'run-nohup').  Essentially,
-   this runs 8 simultaneous run-processwiki processes (which fits well with
-   the workhorse Longhorn machines, since they are 8-core), one on each of
-   the 8 splits, and then concatenates the results together at the end.
-   You can set a NUM_SIMULTANEOUS that's lower than the number of splits,
-   and you get only that much simultaneity.

python/processwiki.py

                       'Help', 'Category', 'Thread', 'Summary', 'Portal',
                       'Book']
 
+article_namespaces = {}
+article_namespaces_lower = {}
+
 article_namespace_aliases = {
   'P':'Portal', 'H':'Help', 'T':'Template',
   'CAT':'Category', 'Cat':'Category', 'C':'Category',
   for arg in args:
     m = re.match(r'(?s)(.*?)=(.*)', arg)
     if m:
-      key = m.group(1).strip().lower()
+      key = m.group(1).strip().lower().replace('_','').replace(' ','')
       value = m.group(2)
       if strip_values:
         value = value.strip()
 the string "empty macro"), so that code that parses templates need
 not worry about crashing on these syntactic errors.'''
 
-  macroargs = [foo for foo in
-              parse_balanced_text(balanced_pipe_re, macro[2:-2])
-              if foo != '|']
+  macroargs1 = [foo for foo in
+               parse_balanced_text(balanced_pipe_re, macro[2:-2])]
+  macroargs2 = []
+  # Concatenate adjacent args if neither one is a |
+  for x in macroargs1:
+    if x == '|' or len(macroargs2) == 0 or macroargs2[-1] == '|':
+      macroargs2 += [x]
+    else:
+      macroargs2[-1] += x
+  macroargs = [x for x in macroargs2 if x != '|']
   if not macroargs:
     wikiwarning("Strange macro with no arguments: %s" % macroargs)
     return ['empty macro']
   if x is None:
     return None
   try:
+    x = x.strip()
+  except:
+    pass
+  try:
     return float(x)
   except:
-    x = x.strip()
+    # In the Portuguese version at least, we have entries like
+    # {{Coor title d|51.30.23|N|0.7.35|O|type:landmark|region:GB}}
+    m = re.match(r'(-?[0-9]+)\.([0-9]+)\.([0-9]+)', x)
+    if m:
+      (deg, min, sec) = m.groups()
+      return convert_dms(1, deg, min, sec)
     if x:
       wikiwarning("Expected number, saw %s" % x)
     return None
 deg/min/sec/DIR indicators like 45/32/30/E.'''
   if arg is None:
     return None
+  arg = arg.lstrip()
   if ' ' in arg:
     arg = re.sub(' .*$', '', arg)
   if '/' in arg:
-    m = re.match('([0-9.]+)/([0-9.]+)?/([0-9.]+)?/([NSEWnsew])', arg)
+    m = re.match('([0-9.]+)/([0-9.]+)?/([0-9.]+)?(?:/([NSEWOnsewo]))?', arg)
     if m:
       (deg, min, sec, offind) = m.groups()
-      offind = offind.upper()
-      if offind in convert_ns:
-        off = convert_ns[offind]
+      if offind:
+        offind = offind.upper()
+        if offind in convert_ns:
+          off = convert_ns[offind]
+        else:
+          off = convert_ew_german[offind]
       else:
-        off = convert_ew[offind]
+        off = 1
       return convert_dms(off, deg, min, sec)
     wikiwarning("Unrecognized DEG/MIN/SEC/HEMIS-style indicator: %s" % arg)
     return None
   else:
     return safe_float(arg)
 
-def convert_dms(nsew, d, m, s):
+def convert_dms(nsew, d, m, s, decimal = False):
   '''Convert a multiplier (1 or N or E, -1 for S or W) and degree/min/sec
 values into a decimal +/- latitude or longitude.'''
   lat = get_german_style_coord(d)
   if lat is None:
     return None
-  return nsew*(lat + safe_float(m, zero_on_error = True)/60. +
-      safe_float(s, zero_on_error = True)/3600.)
+  min = safe_float(m, zero_on_error = True)
+  sec = safe_float(s, zero_on_error = True)
+  if min < 0: min = -min
+  if sec < 0: sec = -sec
+  if min > 60:
+    wikiwarning("Out-of-bounds minutes %s" % min)
+    return None
+  if sec > 60:
+    wikiwarning("Out-of-bounds seconds %s" % sec)
+    return None
+  return nsew*(lat + min/60. + sec/3600.)
 
 convert_ns = {'N':1, 'S':-1}
 convert_ew = {'E':1, 'W':-1, 'L':1, 'O':-1}
+# Blah!! O=Ost="east" in German but O=Oeste="west" in Spanish/Portuguese
+convert_ew_german = {'E':1, 'W':-1, 'O':1}
 
 # Get the default value for the hemisphere, as a multiplier +1 or -1.
-# We need to handle Australian places specially, as S latitude, E longitude.
-# We need to handle Pittsburgh neighborhoods specially, as N latitude, W longitude.
+# We need to handle the following as S latitude, E longitude:
+#   -- Infobox Australia
+#   -- Info/Localidade de Angola
+#   -- Info/Município de Angola
+#   -- Info/Localidade de Moçambique
+
+# We need to handle the following as N latitude, W longitude:
+#   -- Infobox Pittsburgh neighborhood
+#   -- Info/Assentamento/Madeira
+#   -- Info/Localidade da Madeira
+#   -- Info/Assentamento/Marrocos
+#   -- Info/Localidade dos EUA
+#   -- Info/PousadaPC
+#   -- Info/Antigas freguesias de Portugal
 # Otherwise assume +1, so that we leave the values alone.  This is important
 # because some fields may specifically use signed values to indicate the
 # hemisphere directly, or use other methods of indicating hemisphere (e.g.
 # "German"-style "72/50/35/W").
 def get_hemisphere(temptype, is_lat):
-  if temptype.lower().startswith('infobox australia'):
-    if is_lat: return -1
-    else: return 1
-  elif temptype.lower().startswith('infobox pittsburgh neighborhood'):
-    if is_lat: return 1
-    else: return -1
-  else: return 1
+  for x in ('infobox australia', 'info/localidade de angola',
+      u'info/município de angola', u'info/localidade de moçambique'):
+    if temptype.lower().startswith(x):
+      if is_lat: return -1
+      else: return 1
+  for x in ('infobox pittsburgh neighborhood', 'info/assentamento/madeira',
+      'info/assentamento/marrocos', 'info/localidade dos eua', 'info/pousadapc',
+      'info/antigas freguesias de portugal'):
+    if temptype.lower().startswith(x):
+      if is_lat: return 1
+      else: return -1
+  return 1
 
 # Get an argument (ARGSEARCH) by name from a hash table (ARGS).  Multiple
 # synonymous names can be looked up by giving a list or tuple for ARGSEARCH.
       val = args.get(x, None)
       if val is not None:
         return val
-    if warnifnot:
+    if warnifnot or debug['some']:
       wikiwarning("None of params %s seen in template {{%s|%s}}" % (
         ','.join(argsearch), temptype, bound_string_length('|'.join(rawargs))))
   else:
     val = args.get(argsearch, None)
     if val is not None:
       return val
-    if warnifnot:
+    if warnifnot or debug['some']:
       wikiwarning("Param %s not seen in template {{%s|%s}}" % (
         argsearch, temptype, bound_string_length('|'.join(rawargs))))
   return None
       convert = convert_ns
     else:
       convert = convert_ew
-    hemismult = convert.get(hemis, 0)
-    if hemismult == 0:
+    hemismult = convert.get(hemis, None)
+    if hemismult is None:
       wikiwarning("%s for template type %s has bad value: [%s]" %
                (offparam, temptype, hemis))
+      return None
   return convert_dms(hemismult, d, m, s)
 
-latd_arguments = ('latd', 'latg', 'lat_d',
-  'latdeg', 'lat_deg', 'lat_degrees', 'latitudedegrees',
-  'latitudinegradi', 'latitudine_gradi', 'latitudine gradi',
-  'latgradi',
-  'latitudine_d',
-  'latitudegraden',
-  'breitengrad', 'breddegrad', 'bredde_grad')
+latd_arguments = ('latd', 'latg', 'latdeg', 'latdegrees', 'latitudedegrees',
+  'latitudinegradi', 'latgradi', 'latitudined', 'latitudegraden',
+  'breitengrad', 'breddegrad', 'breddegrad')
 def get_latd_coord(temptype, args, rawargs):
   '''Given a template of type TEMPTYPE with arguments ARGS (converted into
 a hash table; also available in raw form as RAWARGS), assumed to have
 (latitude, longitude) values.'''
   lat = get_lat_long_1(temptype, args, rawargs,
       latd_arguments,
-      ('latm', 'latmin', 'lat_min', 'lat_m', 'lat_minutes', 'latitudeminutes',
-         'latitudineprimi', 'latitudine_primi', 'latitudine primi',
-         'latprimi',
-         'latitudineminuti', 'latitudine_minuti', 'latitudine minuti',
-         'latminuti',
-         'latitudine_m',
-         'latitudeminuten',
-         'breitenminute', 'bredde_min'),
-      ('lats', 'latsec', 'lat_sec', 'lat_s', 'lat_seconds', 'latitudeseconds',
-         'latitudinesecondi', 'latitudine_secondi', 'latitudine secondi',
-         'latsecondi',
-         'latitudine_s',
-         'latitudeseconden',
+      ('latm', 'latmin', 'latminutes', 'latitudeminutes',
+         'latitudineprimi', 'latprimi',
+         'latitudineminuti', 'latminuti', 'latitudinem', 'latitudeminuten',
+         'breitenminute', 'breddemin'),
+      ('lats', 'latsec', 'latseconds', 'latitudeseconds',
+         'latitudinesecondi', 'latsecondi', 'latitudines', 'latitudeseconden',
          'breitensekunde'),
-      ('latns', 'latp', 'lap', 'lat_dir', 'lat_direction',
-         'latitudinens', 'latitudine_ns', 'latitudine ns'),
+      ('latns', 'latp', 'lap', 'latdir', 'latdirection', 'latitudinens'),
       is_lat=True)
   long = get_lat_long_1(temptype, args, rawargs,
       # Typos like Longtitude do occur in the Spanish Wikipedia at least
-      ('longd', 'lond', 'longg', 'long',
-         'londeg', 'lon_deg', 'long_d', 'long_degrees',
-         'longitudinegradi', 'longitudine_gradi', 'longitudine gradi',
-         'longgradi',
-         'longitudine_d',
+      ('longd', 'lond', 'longg', 'long', 'longdeg', 'londeg',
+         'longdegrees', 'londegrees',
+         'longitudinegradi', 'longgradi', 'longitudined',
          'longitudedegrees', 'longtitudedegrees',
          'longitudegraden',
-         u'längengrad', 'laengengrad', 'lengdegrad', u'længde_grad'),
-      ('longm', 'lonm', 'lonmin', 'lon_min', 'long_m', 'long_minutes',
-         'longitudineprimi', 'longitudine_primi', 'longitudine primi',
-         'longprimi',
-         'longitudineminuti', 'longitudine_minuti', 'longitudine minuti',
-         'longminuti',
-         'longitudine_m',
+         u'längengrad', 'laengengrad', 'lengdegrad', u'længdegrad'),
+      ('longm', 'lonm', 'longmin', 'lonmin',
+         'longminutes', 'lonminutes',
+         'longitudineprimi', 'longprimi',
+         'longitudineminuti', 'longminuti', 'longitudinem',
          'longitudeminutes', 'longtitudeminutes',
          'longitudeminuten',
-         u'längenminute', u'længde_min'),
-      ('longs', 'lons', 'lonsec', 'lon_sec', 'long_s', 'long_seconds',
-         'longitudinesecondi', 'longitudine_secondi', 'longitudine secondi',
-         'longsecondi',
-         'longitudine_s',
+         u'längenminute', u'længdemin'),
+      ('longs', 'lons', 'longsec', 'lonsec',
+         'longseconds', 'lonseconds',
+         'longitudinesecondi', 'longsecondi', 'longitudines',
          'longitudeseconds', 'longtitudeseconds',
          'longitudeseconden',
          u'längensekunde'),
-      ('longew', 'longp', 'lonp', 'lon_dir', 'long_direction',
-         'longitudineew', 'longitudine_ew', 'longitudine ew'),
+      ('longew', 'lonew', 'longp', 'lonp', 'longdir', 'londir',
+         'longdirection', 'londirection', 'longitudineew'),
       is_lat=False)
   return (lat, long)
 
 a latitude/longitude specification in it using stopniN/etc. (where the
 direction NSEW is built into the argument name), extract out and return a
 tuple of decimal (latitude, longitude) values.'''
-  if getarg(built_in_latd_north_arguments) is not None:
+  if getarg(built_in_latd_north_arguments, temptype, args, rawargs) is not None:
     mult = 1
-  elif getarg(built_in_latd_south_arguments) is not None:
+  elif getarg(built_in_latd_south_arguments, temptype, args, rawargs) is not None:
     mult = -1
   else:
     wikiwarning("Didn't see any appropriate stopniN/stopniS param")
       ('minutn', 'minuts'),
       ('sekundn', 'sekunds'),
       mult)
-  if getarg(built_in_longd_north_arguments) is not None:
+  if getarg(built_in_longd_north_arguments, temptype, args, rawargs) is not None:
     mult = 1
-  elif getarg(built_in_longd_south_arguments) is not None:
+  elif getarg(built_in_longd_south_arguments, temptype, args, rawargs) is not None:
     mult = -1
   else:
     wikiwarning("Didn't see any appropriate stopniE/stopniW param")
   return (lat, long)
 
 latitude_arguments = ('latitude', 'latitud', 'latitudine',
-    # NOTE: We want to prefer breitengrad over breite because islands may
-    # have both, with breite simply specifying the width while breitengrad
-    # specifies the latitude.  But sometimes breitengrad occurs with
-    # breitenminute, so we list it in the latd arguments as well, which
-    # we check first.
-    'breitengrad', 'breite',
+    'breitengrad',
+    # 'breite', Sometimes used for latitudes but also for other types of width
     #'lat' # Appears in non-article coordinates
-    #'lat_dec' # Appears to be associated with non-Earth coordinates
+    #'latdec' # Appears to be associated with non-Earth coordinates
     )
 longitude_arguments = ('longitude', 'longitud', 'longitudine',
-    u'längengrad', u'laengengrad', u'länge', u'laenge'
+    u'längengrad', u'laengengrad',
+    # u'länge', u'laenge', Sometimes used for longitudes but also for other lengths
     #'long' # Appears in non-article coordinates
-    #'long_dec' # Appears to be associated with non-Earth coordinates
+    #'longdec' # Appears to be associated with non-Earth coordinates
     )
 
 def get_latitude_coord(temptype, args, rawargs):
     temptype, args, rawargs))
   return (lat, long)
 
+def get_infobox_ort_coord(temptype, args, rawargs):
+  '''Given a template 'Infobox Ort' with arguments ARGS, assumed to have
+a latitude/longitude specification in it, extract out and return a tuple of
+decimal (latitude, longitude) values.'''
+  # German-style (e.g. 72/53/15/E) also occurs with 'latitude' and such,
+  # so just check for it everywhere.
+  lat = get_german_style_coord(getarg((u'breite',),
+    temptype, args, rawargs))
+  long = get_german_style_coord(getarg((u'länge', u'laenge'),
+    temptype, args, rawargs))
+  return (lat, long)
+
 # Utility function for get_coord().  Extract out the latitude or longitude
 # values out of a Coord structure.  Return a tuple (OFFSET, VAL) for decimal
 # latitude or longitude VAL and OFFSET indicating the offset of the next
 # argument after the arguments used to produce the value.
-def get_coord_1(args, nsew, convert_nsew):
-  if args[1] in nsew:
+def get_coord_1(args, convert_nsew):
+  if args[1] in convert_nsew:
     d = args[0]; m = 0; s = 0; i = 1
-  elif args[2] in nsew:
+  elif args[2] in convert_nsew:
     d = args[0]; m = args[1]; s = 0; i = 2
-  elif args[3] in nsew:
+  elif args[3] in convert_nsew:
     d = args[0]; m = args[1]; s = args[2]; i = 3
-  else: return (1, args[0])
+  else:
+    # Will happen e.g. in the style where only positive/negative are given
+    return (1, convert_dms(1, args[0], 0, 0))
   return (i+1, convert_dms(convert_nsew[args[i]], d, m, s))
 
 # FIXME!  To be more accurate, we need to look at the template parameters,
         country plus next-level subdivision (state, province, etc.)
 globe: which planet or satellite the coordinate is on (esp. if not the Earth)
 '''
-  if debug['some']: errprint("Passed in args %s" % args)
+  if debug['some']: errprint("Coord: Passed in args %s" % args)
   # Filter out optional "template arguments", add a bunch of blank arguments
   # at the end to make sure we don't get out-of-bounds errors in
   # get_coord_1()
   filtargs = [x for x in args if '=' not in x]
   if filtargs:
     filtargs += ['','','','','','']
-    (i, lat) = get_coord_1(filtargs, ('N','S'), convert_ns)
-    (_, long) = get_coord_1(filtargs[i:], ('E','W'), convert_ew)
+    (i, lat) = get_coord_1(filtargs, convert_ns)
+    (_, long) = get_coord_1(filtargs[i:], convert_ew)
     return (lat, long)
   else:
     (paramshash, _) = find_template_params(args, True)
     long = safe_float(long)
     return (lat, long)
 
-def get_coordinate_coord(temptype, rawargs):
+def check_for_bad_globe(paramshash):
+  if debug['some']: errprint("check_for_bad_globe: Passed in args %s" % paramshash)
+  globe = paramshash.get('globe', "").strip()
+  if globe:
+    if globe == "earth":
+      wikiwarning("Interesting, saw globe=earth")
+    else:
+      wikiwarning("Rejecting as non-earth, in template 'Coordinate/Coord/etc.' saw globe=%s"
+          % globe)
+      return True
+  return False
+
+def get_coordinate_coord(extract_coords_obj, temptype, rawargs):
   '''Parse a Coordinate template and return a tuple (lat,long) for latitude and
 longitude.  TEMPTYPE is the template name.  ARGS is the raw arguments for
 the template.  These templates tend to occur in the German Wikipedia. Examples:
 '''
   if debug['some']: errprint("Passed in args %s" % rawargs)
   (paramshash, _) = find_template_params(rawargs, True)
+  if check_for_bad_globe(paramshash):
+    extract_coords_obj.notearth = True
+    return (None, None)
   lat = get_german_style_coord(getarg('ns', temptype, paramshash, rawargs))
   long = get_german_style_coord(getarg('ew', temptype, paramshash, rawargs))
   return (lat, long)
   if debug['some']: errprint("Passed in args %s" % args)
   # Filter out optional "template arguments"
   filtargs = [x for x in args if '=' not in x]
+  if debug['some']: errprint("get_coord_params: filtargs: %s" % filtargs)
+  hash = {}
   if filtargs and ':' in filtargs[-1]:
-    coord_params = [tuple(x.split(':')) for x in filtargs[-1].split('_')]
-    return coord_params
-  else:
-    return []
-
-def get_coord_params(temptype, args):
-  '''Parse a Coord template and return a list of tuples of coordinate
-parameters (see comment under get_coord).'''
-  if debug['some']: errprint("Passed in args %s" % args)
-  # Filter out optional "template arguments"
-  filtargs = [x for x in args if '=' not in x]
-  if filtargs and ':' in filtargs[-1]:
-    coord_params = [tuple(x.split(':')) for x in filtargs[-1].split('_')]
-    return coord_params
-  else:
-    return []
+    for x in filtargs[-1].split('_'):
+      if ':' in x:
+        (key, value) = x.split(':', 1)
+        hash[key] = value
+  return hash
 
 def get_geocoordenadas_coord(temptype, args):
   '''Parse a geocoordenadas template (common in the Portuguese Wikipedia) and
 
   def __init__(self):
     self.coords = []
+    self.notearth = False
 
   def process_template(self, text):
     # Look for a Coord, Infobox, etc. template that may have coordinates in it
     if debug['some']: errprint("Template type: %s" % temptype)
     lowertemp = temptype.lower()
     rawargs = tempargs[1:]
+    if (lowertemp.startswith('info/crater') or
+        lowertemp.endswith(' crater data') or
+        lowertemp.startswith('marsgeo') or
+        lowertemp.startswith('encelgeo') or
+        # All of the following are for heavenly bodies
+        lowertemp.startswith('infobox feature on ') or
+        lowertemp in (u'info/acidente geográfico de vênus',
+                      u'infobox außerirdische region',
+                      'infobox lunar mare', 'encelgeo-crater',
+                      'infobox marskrater', 'infobox mondkrater',
+                      'infobox mondstruktur')):
+        self.notearth = True
+        wikiwarning("Rejecting as not on Earth because saw template %s" % temptype)
+        return []
     # Look for a coordinate template
     if lowertemp in ('coord', 'coordp', 'coords',
                      'koord', #Norwegian
         or lowertemp.startswith('mapit') \
         or lowertemp.startswith('koordynaty'): # Coordinates in Polish:
       (lat, long) = get_coord(temptype, rawargs)
+      coord_params = get_coord_params(temptype, tempargs[1:])
+      if check_for_bad_globe(coord_params):
+        self.notearth = True
+        return []
     elif lowertemp == 'coordinate':
-      (lat, long) = get_coordinate_coord(temptype, rawargs)
+      (lat, long) = get_coordinate_coord(self, temptype, rawargs)
     elif lowertemp in ('geocoordenadas', u'coördinaten'):
       # geocoordenadas is Portuguese, coördinaten is Dutch, and they work
       # the same way
       (paramshash, _) = find_template_params(rawargs, True)
       if getarg(latd_arguments, temptype, paramshash, rawargs, warnifnot=False) is not None:
         #errprint("seen: [%s] in {{%s|%s}}" % (getarg(latd_arguments, temptype, paramshash, rawargs), temptype, rawargs))
-        templates_with_coords[lowertemp] += 1
         (lat, long) = get_latd_coord(temptype, paramshash, rawargs)
       # NOTE: DO NOT CHANGE ORDER.  We want to check latd first and check
       # latitude afterwards for various reasons (e.g. so that cases where
       # suffice.
       elif getarg(latitude_arguments, temptype, paramshash, rawargs, warnifnot=False) is not None:
         #errprint("seen: [%s] in {{%s|%s}}" % (getarg(latitude_arguments, temptype, paramshash, rawargs), temptype, rawargs))
-        templates_with_coords[lowertemp] += 1
         (lat, long) = get_latitude_coord(temptype, paramshash, rawargs)
       elif (getarg(built_in_latd_north_arguments, temptype, paramshash,
                    rawargs, warnifnot=False) is not None or
                    rawargs, warnifnot=False) is not None):
         #errprint("seen: [%s] in {{%s|%s}}" % (getarg(built_in_latd_north_arguments, temptype, paramshash, rawargs), temptype, rawargs))
         #errprint("seen: [%s] in {{%s|%s}}" % (getarg(built_in_latd_south_arguments, temptype, paramshash, rawargs), temptype, rawargs))
-        templates_with_coords[lowertemp] += 1
         (lat, long) = get_built_in_lat_coord(temptype, paramshash, rawargs)
+      elif lowertemp in ('infobox ort', 'infobox verwaltungseinheit'):
+        (lat, long) = get_infobox_ort_coord(temptype, paramshash, rawargs)
 
-    if debug['some']: errprint("Saw coordinate %s,%s in template type %s" %
+    if debug['some']: wikiwarning("Saw coordinate %s,%s in template type %s" %
               (lat, long, temptype))
     if lat is None and long is not None:
-      errprint("Saw longitude %s but no latitude in template: %s" %
+      wikiwarning("Saw longitude %s but no latitude in template: %s" %
           (long, bound_string_length(text)))
     if long is None and lat is not None:
-      errprint("Saw latitude %s but no latitude in template: %s" %
+      wikiwarning("Saw latitude %s but no longitude in template: %s" %
           (lat, bound_string_length(text)))
     if lat is not None and long is not None:
-      self.coords.append((lowertemp,lat,long))
+      if lat == 0.0 and long == 0.0:
+        wikiwarning("Rejecting coordinate because zero latitude and longitude seen")
+      elif lat > 90.0 or lat < -90.0 or long > 180.0 or long < -180.0:
+        wikiwarning("Rejecting coordinate because out of bounds latitude or longitude: (%s,%s)" % (lat, long))
+      else:
+        if lat == 0.0 or long == 0.0:
+          wikiwarning("Zero value in latitude and/or longitude: (%s,%s)" %
+              (lat, long))
+        self.coords.append((lowertemp,lat,long))
+        templates_with_coords[lowertemp] += 1
     # Recursively process the text inside the template in case there are
     # coordinates in it.
     return self.process_source_text(text[2:-2])
         or lowertemp.startswith('mapit'):
       params = get_coord_params(temptype, tempargs[1:])
       if params:
+        # WARNING, this returns a hash table, not a list of tuples
+        # like the others do below.
         self.loctype += [['coord-params', params]]
     else:
       (paramshash, _) = find_template_params(tempargs[1:], True)
       if lowertemp == 'infobox settlement':
         params = []
-        for x in ['settlement_type',
-                  'subdivision_type', 'subdivision_type1', 'subdivision_type2',
-                  'subdivision_name', 'subdivision_name1', 'subdivision_name2',
-                  'coordinates_type', 'coordinates_region']:
+        for x in ['settlementtype',
+                  'subdivisiontype', 'subdivisiontype1', 'subdivisiontype2',
+                  'subdivisionname', 'subdivisionname1', 'subdivisionname2',
+                  'coordinatestype', 'coordinatesregion']:
           val = paramshash.get(x, None)
           if val:
             params += [(x, val)]
         self.loctype += [['infobox-settlement', params]]
-      elif ('latd' in paramshash or 'lat_deg' in paramshash or
+      elif ('latd' in paramshash or 'latdeg' in paramshash or
           'latitude' in paramshash):
         self.loctype += \
             [['other-template-with-coord', [('template', temptype)]]]
   if m:
     # Something like [[Image:...]] or [[wikt:...]] or [[fr:...]]
     namespace = m.group(1).lower()
-    if namespace in ('image', 'file'):
+    namespace = article_namespaces_lower.get(namespace, namespace)
+    if namespace in ('image', 6): # 6 = file
       # For image links, filter out non-interesting args
       for arg in tempargs[1:]:
         # Ignore uninteresting args
       # A fairly arbitrary list of "interesting" parameters.
       if re.match(r'(last|first|authorlink)[1-9]?$', key) or \
          re.match(r'(author|editor)[1-9]?-(last|first|link)$', key) or \
-         key in ('coauthors', 'others', 'title', 'trans_title',
-                 'quote', 'work', 'contribution', 'chapter', 'trans_chapter',
+         key in ('coauthors', 'others', 'title', 'transtitle',
+                 'quote', 'work', 'contribution', 'chapter', 'transchapter',
                  'series', 'volume'):
         yield value
   elif re.match(r'infobox', temptype):
     # Handle Infoboxes.
     for (key,value) in paramhash.items():
       # A fairly arbitrary list of "interesting" parameters.
+      # Remember that _ and space are removed.
       if key in ('name', 'fullname', 'nickname', 'altname', 'former',
-                 'alt', 'caption', 'description', 'title', 'title_orig',
-                 'image_caption', 'imagecaption', 'map_caption', 'mapcaption',
+                 'alt', 'caption', 'description', 'title', 'titleorig',
+                 'imagecaption', 'imagecaption', 'mapcaption',
                  # Associated with states, etc.
                  'motto', 'mottoenglish', 'slogan', 'demonym', 'capital',
                  # Add more here
 def extract_coordinates_from_article(text):
   handler = ExtractCoordinatesFromSource()
   for foo in handler.process_source_text(text): pass
-  if len(handler.coords) > 0:
+  if handler.notearth:
+    return None
+  elif len(handler.coords) > 0:
     # Prefer a coordinate specified using {{Coord|...}} or similar to
     # a coordinate in an Infobox, because the latter tend to be less
     # accurate.
     yesno = {True:'yes', False:'no'}
     listof = self.title.startswith('List of ')
     disambig = self.id in disambig_pages_by_id
-    list = listof or disambig or namespace in ('Category', 'Book')
+    nskey = article_namespace_aliases.get(namespace, namespace)
+    list = listof or disambig or nskey in (14, 108) # ('Category', 'Book')
     outprint("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" %
              (self.id, self.title, cur_split_name, redirtitle, namespace,
               yesno[listof], yesno[disambig], yesno[list]))
   def __init__(self, output_handler):
     errprint("Beginning processing of Wikipedia dump...")
     self.curpath = []
+    self.curattrs = []
     self.curtext = None
     self.output_handler = output_handler
     self.status = StatusMessage('article')
     
   def startElement(self, name, attrs):
     '''Handler for beginning of XML element.'''
-    if debug['sax']: errprint("startElement() saw %s/%s" % (name, attrs))
+    if debug['sax']:
+      errprint("startElement() saw %s/%s" % (name, attrs))
+      for (key,val) in attrs.items(): errprint("  Attribute (%s,%s)" % (key,val))
     # We should never see an element inside of the Wikipedia text.
     if self.curpath:
       assert self.curpath[-1] != 'text'
     self.curpath.append(name)
+    self.curattrs.append(attrs)
     self.curtext = []
     # We care about the title, ID, and redirect status.  Reset them for
     # every page; this is especially important for redirect status.
     eltext = ''.join(self.curtext) if self.curtext else ''
     self.curtext = None # Stop tracking text
     self.curpath.pop()
+    attrs = self.curattrs.pop()
     if name == 'title':
       self.title = eltext
     # ID's occur in three places: the page ID, revision ID and contributor ID.
       self.id = eltext
     elif name == 'redirect':
       self.redirect = True
+    elif name == 'namespace':
+      key = attrs.getValue("key")
+      if debug['sax']: errprint("Saw namespace, key=%s, eltext=%s" %
+          (key, eltext))
+      article_namespaces[eltext] = key
+      article_namespaces_lower[eltext.lower()] = key
     elif name == 'text':
       # If we saw the end of the article text, join all the text chunks
       # together and call process_article_text() on it.
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.