Source

textgrounder / bin / run-processwiki

#!/bin/sh

if [ -z "$TEXTGROUNDER_DIR" ]; then
  echo "Must set TEXTGROUNDER_DIR to top level of TextGrounder distribution"
  exit 1
fi

. $TEXTGROUNDER_DIR/bin/config-geolocate

TG_PYTHON_DIR="$TEXTGROUNDER_DIR/python"

PROCESSWIKI="$TG_PYTHON_DIR/processwiki.py"
GENERATE_COMBINED="$TG_PYTHON_DIR/generate_combined.py"

LOGFILE="generate-all-data.log"

OTHEROPTS="$MAXTIME $DEBUG"

if [ -z "$NUM_SPLITS" ]; then
  NUM_SPLITS=8
  echo "Setting number of splits to default value of $NUM_SPLITS"
else
  echo "Setting number of splits to $NUM_SPLITS, taken from env. var. NUM_SPLITS"
fi

if [ -z "$NUM_SIMULTANEOUS" ]; then
  NUM_SIMULTANEOUS=1
  echo "Setting number of simultaneous processes to default value of $NUM_SIMULTANEOUS"
else
  echo "Setting number of simultaneous processes to $NUM_SIMULTANEOUS, taken from env. var. NUM_SIMULTANEOUS"
fi

SPLIT_PREFIX="$WP_VERSION-split-processwiki"

if [ -z "$*" ]; then
  cat <<FOO
Usage: $0 [STEPS ...]

Generate the various necessary data files.

Possible steps:

article-data = Generate basic article data file
coords = Generate article coordinates
coord-links = Generate article incoming links, only for articles with
              coordinates or redirects to such articles
combine-article-data = Combine the previous three outputs into a combined
        article data file
split-dump = Split the dump into pieces
coord-counts = Generate counts file, articles with coordinates only
all-counts = Generate counts file, all articles
coord-woords = Generate words file (i.e. raw text of articles), articles
               with coordinates only
all-words = Generate words file, all articles
coord-woords-untok = Same as 'coord-words' but split only on whitespace;
                     don't attempt further tokenization (e.g. separating out
                     periods that are likely to be end-of-sentence markers).
all-words-untok = Same as 'all-words' but without further tokenization, as in
                  'coord-words-untok'.
toponym-eval = Generate data file for use in toponym evaluation.  The file
               is similar in format to a counts file, but also has internal
               links marked specially, indicating both the surface text of
               the link and the article linked to, providing the article
               linked to has a geotag.  These links can be taken to be
               toponyms to be resolved, particularly when the surface text
               and article name are not the same; e.g. the surface text
               "Georgia" may variously refer to the U.S. state, the country
               in the Caucasus, or various other places.

Also possible are combinations of steps, e.g.

combined-article-data = article-data coords coord-links combine-article-data
all = article-data coords coord-links combine-article-data coord-counts coord-words all-counts all-words

Input comes from the current directory, except for the single exception of
$IN_DISAMBIG_ID_FILE, which comes from $TG_WIKIPEDIA_DIR (set by the
environment variable TG_WIKIPEDIA_DIR or similar; see 'config-geolocate' in
$TEXTGROUNDIR/bin).  The reason for the exception regarding this particular
file is that it's generated not by us but by Wikiprep, which may take
several weeks to run.  This file is also not especially important in the
scheme of things -- and in fact the relevant data is not currently used at all.
When the file is present, it lists articles that are identified as
"disambiguation" pages, and this fact goes into one of the fields of the
combined article data file.  If not present, all articles will have "no"
in this field.  As just mentioned, no current experiment apps make use of this
info.

All files other than the original dump file (and the disambig-id file
mentioned above) are generated by these scripts.  The original dump file has
a name like enwiki-20100905-pages-articles.xml.bz2; we also generate a permuted
dump file with a name like enwiki-20100905-permuted-pages-articles.xml.bz2.

The original dump file needs to be in the current directory, and it's strongly
suggested that this script is run in a newly-created directory, empty save
for the dump file (or a symlink to it), with the dump file marked read-only
through 'chmod a-w'.

Other important environment variables (with default settings in
'config-geolocate', but which you might want to override):

WP_VERSION       Specifies which dump file to use, e.g. "enwiki-20100905".
USE_PERMUTED     If set to "false", uses the non-permuted version of the dump
                 file.  If set to "true", always try to use the permuted
                 version.  If blank, use permuted version if it appears to
                 exist, non-permuted otherwise.

Output files are in the current directory.


The following is a possible set of steps to use to generate the necessary
data files from scratch.

1. Create a new directory to work in, where you have a lot of free space.
   (For example, the /scratch dir on Longhorn.) Either download a dump file
   from Wikipedia, or symlink an existing dump file into the new directory.
   Let's say the dump file has the dump prefix 'enwiki-2011007' --
   the English Wikipedia, dump of October 7, 2011.  Also assume that for
   this and all future commands, we're in the new directory.
 
   If we want to download it, we might say

wget http://dumps.wikimedia.org/enwiki/20111007/enwiki-20111007-pages-articles.xml.bz2

   If we want to symlink from somewhere else, we might say

ln -s ../../somewhere/else/enwiki-20111007-pages-articles.xml.bz2 .

2. Generate the basic and combined article data files for the non-permuted dump

WP_VERSION=enwiki-20111007 USE_PERMUTED=false run-processwiki combined-article-data

3. Generate a permuted dump file; all future commands will operate on the
   permuted dump file, because we won't specify a value for USE_PERMUTED.

WP_VERSION=enwiki-20111007 run-permute all

4. Generate the basic and combined article data files for the permuted dump

WP_VERSION=enwiki-20111007 run-processwiki combined-article-data

5. Generate the counts file for articles with coordinates -- this is the info
   needed by most of the Geolocate experiments.

WP_VERSION=enwiki-20111007 run-processwiki coord-counts

6. Generate the counts and words files for all articles, splitting the dump
   file so we can run in parallel.

WP_VERSION=enwiki-20111007 split-dump
WP_VERSION=enwiki-20111007 NUM_SIMULTANEOUS=8 run-processwiki all-counts all-words

7. Move all final generated files (i.e. not including intermediate files) into
   some final directory, e.g. $TG_WIKIPEDIA_DIR.

mv -i *.bz2 *.txt $TG_WIKIPEDIA_DIR
chmod a-w $TG_WIKIPEDIA_DIR/*

   Note the use of '-i', which will query you in case you are trying to
   overwrite an existing while.  We also run 'chmod' afterwards to make all
   the files read-only, to lessen the possibility of accidentally overwriting
   them later in another preprocessing run.

FOO
  exit 1
fi

if [ "$*" = "all" ]; then
  steps="article-data coords coord-links combine-article-data coord-counts coord-words all-counts all-words"
elif [ "$*" = "combined-article-data" ]; then
  steps="article-data coords coord-links combine-article-data"
else
  steps="$*"
fi

echo "Steps are $steps"
echo "Using dump file $OUT_DUMP_FILE"

for step in $steps; do
echo "Executing step '$step' ..."

action=
cansplit=yes
outfile=
args=

if [ "$step" = article-data ]; then

# Use a listing of disambiguation pages if it exists, but not otherwise
if [ -e "$IN_DISAMBIG_ID_FILE" ]; then
  disambig_arg="--disambig-id-file $IN_DISAMBIG_ID_FILE"
else
  disambig_arg=
fi

action="Generating article data"
args="$disambig_arg --split-training-dev-test foobar --generate-article-data"
outfile="$OUT_ORIG_DOCUMENT_DATA_FILE"
# Don't split because there is a prolog line.
cansplit=no

elif [ "$step" = coords ]; then

action="Generating coordinate data"
args="--output-coords"
outfile="$OUT_COORDS_FILE"

elif [ "$step" = location-type ]; then

action="Generating location-type data"
args="--output-location-type"
# Don't split because we output to separate split files (FIXME why?).
cansplit=no

elif [ "$step" = coord-links ]; then

action="Generating link data"
args="--coords-file $OUT_COORDS_FILE \
  --article-data-file $OUT_ORIG_DOCUMENT_DATA_FILE \
  --find-coord-links"
outfile="$OUT_COORD_LINKS_FILE"
# Don't split because we output link info at the very end.
cansplit=no

elif [ "$step" = combine-article-data ]; then

# Uses a different program, not processwiki.
echo "Combining article data ..."
args="--links-file $OUT_COORD_LINKS_FILE \
  --coords-file $OUT_COORDS_FILE \
  --article-data-file $OUT_ORIG_DOCUMENT_DATA_FILE"
outfile="$OUT_COMBINED_DOCUMENT_DATA_FILE"
cmd="$GENERATE_COMBINED $args > $outfile"
echo "Executing at `date`: $cmd"
$GENERATE_COMBINED $args > $outfile
echo "Ended at `date`: $cmd"

elif [ "$step" = split-dump ]; then

PERMUTE_WIKI="$TG_PYTHON_DIR/permute_wiki.py"

# Uses a different program, not processwiki.
echo "Splitting dump file ..."
args="--mode=split --article-data-file $OUT_ORIG_DOCUMENT_DATA_FILE \
  --split-prefix $SPLIT_PREFIX \
  --number-of-splits $NUM_SPLITS $OTHEROPTS"
cmd="bzcat $OUT_DUMP_FILE | $PERMUTE_WIKI $args"
echo "Executing at `date`: $cmd"
bzcat $OUT_DUMP_FILE | $PERMUTE_WIKI $args
echo "Ended at `date`: $cmd"

elif [ "$step" = coord-counts ]; then

action="Generating word count data, coord articles only"
args="--output-coord-counts"
outfile="$OUT_COORD_COUNTS_FILE"

elif [ "$step" = all-counts ]; then

action="Generating word count data, all articles"
args="--output-all-counts"
outfile="$OUT_ALL_COUNTS_FILE"

elif [ "$step" = toponym-eval ]; then

action="Generating toponym eval data"
args="--coords-file $OUT_COORDS_FILE \
  --article-data-file $OUT_ORIG_DOCUMENT_DATA_FILE \
  --generate-toponym-eval"
outfile="$OUT_TOPONYM_EVAL_FILE"

elif [ "$step" = coord-words ]; then

action="Generating raw text, coord articles only"
args="--output-coord-words --raw-text"
outfile="$OUT_COORD_WORDS_FILE"

elif [ "$step" = coord-words-untok ]; then

action="Generating raw text, coord articles only, untokenized"
args="--output-coord-words --raw-text --no-tokenize"
outfile="$OUT_COORD_WORDS_UNTOK_FILE"

elif [ "$step" = all-words ]; then

action="Generating raw text, all articles"
args="--output-all-words --raw-text"
outfile="$OUT_ALL_WORDS_FILE"

elif [ "$step" = all-words-untok ]; then

action="Generating raw text, all articles, untokenized"
args="--output-all-words --raw-text --no-tokenize"
outfile="$OUT_ALL_WORDS_UNTOK_FILE"

else
echo "Unrecognized step $step"

fi

if [ -z "$action" ]; then
  : # do nothing
elif [ "$NUM_SIMULTANEOUS" -eq 1 -o -z "$outfile" -o "$cansplit" = "no" ]; then

  # Operate in non-split mode
  echo "$action ..."
  if [ -n "$outfile" ]; then
    cmd="bzcat $OUT_DUMP_FILE | $PROCESSWIKI $args $OTHEROPTS > $outfile"
    echo "Executing at `date`: $cmd"
    bzcat $OUT_DUMP_FILE | $PROCESSWIKI $args $OTHEROPTS > $outfile
    echo "Ended at `date`: $cmd"
  else
    cmd="bzcat $OUT_DUMP_FILE | $PROCESSWIKI $args $OTHEROPTS"
    echo "Executing at `date`: $cmd"
    bzcat $OUT_DUMP_FILE | $PROCESSWIKI $args $OTHEROPTS
    echo "Ended at `date`: $cmd"
  fi
  echo "$action ... done."

else

  echo "$action ..."
  echo "  ... operating in divide-and-conquer mode!"

  # Operate in split mode (aka divide-and-conquer mode).  Assumes that
  # we previously split the dump using the 'split-dump' step, and that
  # the action is amenable to this kind of processing (basically, it
  # simply outputs some data for each input article).  We run on each
  # split simultaneously (to the limit of NUM_SIMULTANEOUS), then
  # concatenate the results.
  numleft="$NUM_SIMULTANEOUS"
  numrun=0
  i=0
  splits=""
  splits_removable=""
  while [ "$i" -lt "$NUM_SPLITS" ]; do
    SPLITFILE="$SPLIT_PREFIX.$i"
    if [ ! -e "$SPLITFILE" ]; then
      echo "Error: Can't find split file $SPLITFILE" >&2
      exit 1
    fi
    SPLITARTS="$SPLITFILE.articles"
    echo "$action, split #$i ..."
    if [ "$numleft" -gt 0 ]; then
      split_outfile="$outfile.split-processwiki.$i"
      splits="$splits $split_outfile"
      splits_removable="$splits_removable $split_outfile"
      cat_args="$SPLIT_PREFIX.prolog $SPLITFILE $SPLIT_PREFIX.epilog"
      cmd="cat $cat_args | $PROCESSWIKI $args $OTHEROPTS > $split_outfile &"
      echo "Executing at `date`: $cmd"
      cat $cat_args | $PROCESSWIKI $args $OTHEROPTS > $split_outfile &
      echo "Ended at at `date`: $cmd"
      numleft=`expr $numleft - 1`
      numrun=`expr $numrun + 1`
    fi
    if [ "$numleft" -eq 0 ]; then
      echo "Waiting for $numrun processes to finish..."
      wait
      echo "Ended at `date`: Waiting."
      numleft="$NUM_SIMULTANEOUS"
      numrun=0
    fi
    i=`expr $i + 1`
  done
  if [ "$numrun" -gt 0 ]; then
    echo "Waiting for $numrun processes to finish..."
    wait
    echo "Ended at `date`: Waiting."
    numrun=0
  fi
  echo "$action, combining the files ..."
  all_files="$splits"
  echo "$action, concatenating all files ($all_files) ..."
  cmd="cat $all_files > $outfile"
  echo "Executing at `date`: $cmd"
  cat $all_files > $outfile
  echo "Ended at `date`: $cmd"
  echo "$action, removing intermediate split files ($splits_removable) ..."
  rm -f $splits_removable
  echo "$action ... done."

fi

done