Source

twools / twitter-pull / pull-tweets

Full commit
#!/usr/bin/env bash

# We need to use bash so that we can use bash's array features to properly
# handle spaces inside of arguments.

# SETUP:
#
# (1) Create any needed Twitter accounts.
# (2) Create a PRIVATE file 'private.passwords' in this directory, listing
#     the usernames and passwords of all Twitter accounts you want to use,
#     in USERNAME:PASSWORD format (i.e. one per line, with a colon separating
#     user name and password).  This file should have permissions 600 or 400.
# (3) Create a PRIVATE file 'private.usernames' in this directory, listing
#     the usernames to use for each "tweet area", in USERID:TWEETAREA:USERNAME
#     format (i.e. one per line, with colons separating the user ID, tweet area
#     and username, where the user ID is an arbitrary number).

usage()
{
  cat <<'FOO'
Usage:

  pull-tweets [-n|--dry-run] [-u ID|--user ID] [--i TIME|--pull-interval TIME] [--spritzer] [--name NAME] [--area TWEETAREA] [--track TRACKEXPR] [--no-bzip] DESTDIR

This program streams tweets from Twitter using the Streaming API
(https://dev.twitter.com/docs/streaming-apis).  It runs non-stop until
terminated, saving the tweets into a file in DESTDIR, normally bzipped
(although that can be overridden using --no-bzip).  The name of the file
includes a timestamp as well as an identifying "name" that is usually
taken from the --name or --area parameters; see below.  The program will
stream tweets for only a set period of time (determined by --pull-interval,
normally one day) before starting a new file, to make it easier to locate
tweets by time and to avoid creating overly large files.

If an error happens, this program automatically restarts, using exponential
backoff.  This type of backoff is mandated by Twitter, and in fact a program
that does not use it will get locked out until it does start using it.

Streaming from Twitter must be done in the context of a particular Twitter
user.  The user name is usually specified using --user, which specifies an
ID that is mapped to an actual user name and password using local files,
which should definitely be unreadable except by the owner (chmod 600 or
chmod 400).  (An ID is used for privacy, to avoid having the actual user name
appear in the command line, which may be visible to other users on the
system.) An alternative mechanism for locating the user exists using the
value of --area, although this may be removed at some point.  If no
user name can be located, the program will refuse to run.  See below for more
discussion of users.

In addition, some sort of other restriction on tweets must be given, either
using --spritzer, --area or --track. (This is mandated by Twitter.)


More details:

If -u or --user is given, it specifies a numeric ID of a Twitter user to be
used. (User ID's are arbitrary numbers, used for privacy, so other users on
the system don't ever see the actual user names in command-line strings.)
You must use a Twitter user when requesting data from the streaming API,
and only one stream at a time can use a given user.  If no user is given,
it will be taken from TWEETAREA, from pseudo-area "spritzer", or from any
name given using --name.  Note that the same area or name is used as the
prefix of the output files.  The mapping from user and tweet areas to stored
in the file 'private.usernames', with lines of the form
USERID:TWEETAREA:USERNAME.  This file should be unreadable except by the owner
(chmod 600).

Once the user name is found, the associated password is located by
looking in 'private.passwords', with lines of the form USERNAME:PASSWORD.
This file should *DEFINITELY* be unreadable except by the owner
(chmod 600).

If --spritzer is given, the spritzer will be used to retrieve tweets
(sample.json); otherwise the normal filter mechanism will be used
(filter.json).

If --area is given, tweets are restricted by location.  TWEETAREA is an area
of the earth containing locations; the bounding box(es) are retrieved from a
file 'TWEETAREA.locations' in the same dir as this script.

If --track is given, tweets are filtered by the presence of words in the
stream.  The format is one or more "phrases" separated by commas, where each
"phrase" is one or more words separated by spaces.  A tweet will be returned
if any phrase matches; a phrase matches if all words are in the tweet,
regardless of order and ignoring case. NOTE: If you want to include
multiple space-separated words in a phrase, you *must* URL-encode the spaces
as %20.  In fact, in general you should URL-encode punctuation, although
at least for the symbols # and @ (commonly occurring in tweets as hashtags
and reply-tos, respectively), it appears to work currently whether or not
you URL-encode them or leave them as-is.  Conversions for common symbols:

@	%40
#	%23
!	%21
$	%24
%	%25
(	%28
)	%29
/	%2F
=	%3D
?	%3F

If -n or --dry-run is given, the script will output exactly what it
would do, but not do anything. (NOTE NOTE NOTE: When doing a dry run, the
output will include the exact command run by 'curl' for diagnostic purposes,
including the user name and password.  In other circumstances, the user
name and password will be censored so that they don't appear in log files
made of the operation of this script.)

If --no-bzip is given, don't compress output using bzip2.

If -i or --pull-interval is given, it specifies the maximum time that
a single operation of Tweet-pulling will occur.  After that time,
another operation will begin, but saving to a separate file, named by
the then-current date and time.  By using this option, you can get files
containing tweets in more-or-less regularly spaced intervals of time.
Possible values for TIME are e.g. '30m' (30 minutes), '36h' (36 hours),
'2d' (2 days), '3w' (3 weeks).  Fractional values are possible.  If the
unit is unspecified, days are assumed.

DESTDIR is where to save the tweets.
FOO
  exit 1
}

if [ -z "$*" ]; then
  usage
fi

DIR="`dirname $0`"

STREAM='filter.json'

# This sets CMDOPTS to an empty array.  Capsule summary of bash arrays:
# 1. foo=(x y z) sets $foo to be an array of items.
# 2. "${foo[@]}" (quotes necessary) expands to the whole set of items in $foo,
#    with as many words as there are items in foo, with spaces embedded
#    in words handled properly.  No other way handles spaces properly (e.g.
#    leaving the quotes out or using * in place of @).
# 3. foo=("${foo[@]}" q r) adds q and r to $foo while properly preserving
#    previous elements, including spaces (quotes necessary).
# 4. Just plain $foo expands only to the first element, NOT all of them.
CMDOPTS=()

# Parse options
DRYRUN=
TWEETAREA=
USERID=
NAME=
BZIP=bzip2

while true; do
  case "$1" in
    -n | --dry-run ) DRYRUN=yes ; shift ;;
    -i | --pull-interval ) PULL_INTERVAL="$2"; shift 2 ;;
    --spritzer ) STREAM='sample.json'; shift ;;
    --area ) TWEETAREA="$2"; CMDOPTS=("${CMDOPTS[@]}" -d "@$DIR/$2.locations"); shift 2 ;;
    --name ) NAME="$2"; shift 2 ;;
    -u | --user ) USERID="$2"; shift 2 ;;
    --track ) CMDOPTS=("${CMDOPTS[@]}" -d "track=$2"); shift 2 ;;
    --no-bzip ) BZIP="cat"; shift ;;
    -* ) usage ;;
    * ) break ;
  esac
done

# Convert time input as given above, with various units, into seconds.
time_to_sec() {
  time="$1"
  case "$time" in
    *s ) factor='1'          ;;
    *m ) factor='60'         ;;
    *h ) factor='60*60'      ;;
    *d ) factor='60*60*24'   ;;
    *w ) factor='60*60*24*7' ;;
    * )  factor='60*60*24'
         time="${time}d"     ;;
  esac
  if ! echo $time | perl -ne 'chop; chop; if ($_ !~ /^[0-9]+(\.[0-9]*)?$/) { exit(1); }'; then
    echo "Invalid time specification: $time."
    echo ""
    usage
  fi
  echo $time | perl -ne 'chop; chop; print int($_*'"$factor);"
}

find_field() {
  regexp="$1"
  shift
  key=$1
  file=$2
  keyname=$3
  valuename=$4
  wholeline=$5
  regexp="${regexp}${key}:"
  howmany=`grep "$regexp" $file | wc -l`
  if [ "$howmany" -eq 0 ]; then
    echo "Can't find $valuename for $keyname $key in $file" >&2
    exit 1
  fi
  if [ "$howmany" -gt 1 ]; then
    echo "Multiple entries for $keyname $key in $file" >&2
    exit 1
  fi
  if [ -n "$wholeline" ]; then
    grep "$regexp" $file
  else
    grep "$regexp" $file | sed "s/^.*://"
  fi
}

find_first() {
  find_field "^" ${1+"$@"}
}

find_second() {
  find_field "^[^.]*:" ${1+"$@"}
}

# Make sure that errors even in calls to `...` cause things to stop,
# while processing command-line arguments.
set -e

if [ -n "$NAME" ]; then
  :
elif [ -n "$TWEETAREA" ]; then
  NAME="$TWEETAREA"
elif [ "$STREAM" = "sample.json" ]; then
  NAME="spritzer"
else
  NAME="stream"
fi

PULLDIR=$1
if [ -z "$PULLDIR" ]; then
  echo "Need to specify directory to store tweets in as argument" >&2
  exit 1
fi
UNAME=
if [ -n "$USERID" ]; then
  UNAME=`find_first $USERID $DIR/private.usernames key area`
elif [ -n "$NAME" ]; then
  UNAME=`find_second $NAME $DIR/private.usernames key area`
fi
if [ -z "$UNAME" ]; then
  echo "Can't find user name" >&2
  exit 1
fi
USERPASS=`find_first $UNAME $DIR/private.passwords user password wholeline`

# Go back to normal error-handling, since curl itself may exit 1.
set +e

datesuff() {
  date '+%F.%H%M'
}

compute_prefix() {
  DATESUFF=`datesuff`
  echo $PULLDIR/$NAME.tweets.$DATESUFF
}

ORIG_PREFIX=`compute_prefix`

max_time_arg=
if [ -n "$PULL_INTERVAL" ]; then
  secs=`time_to_sec $PULL_INTERVAL`
  max_time_arg="--max-time $secs"
fi
CURL_CMD="curl --silent --show-error $max_time_arg"

ERROR_FILE="$ORIG_PREFIX.errors"

# Minimum successful run time, in seconds
MINIMUM_SUCCESSFUL_RUN_TIME=3600
# Minimum amount to delay after an error, in seconds; we implement an
# exponential back-off algorithm, doubling the delay each time until
# we run at last MINIMUM_SUCCESSFUL_RUN_TIME.
MINIMUM_DELAY_AFTER_ERROR=1
# Most recent delay, in seconds, after error
last_delay=$MINIMUM_DELAY_AFTER_ERROR
# Last start time, in seconds since Epoch
last_start_time=

{
while true; do
  echo "Logging error output to $ERROR_FILE ..."
  PREFIX=`compute_prefix`
  if [ "$BZIP" = "cat" ]; then
    TWEETS_FILE="$PREFIX"
  else
    TWEETS_FILE="$PREFIX.bz2"
  fi
  echo "Sending tweets to $TWEETS_FILE"
  echo "Beginning retrieval of tweets for area $NAME at `date` ..."
  last_start_time=`date +%s`
  cmdline_nopass=($CURL_CMD "${CMDOPTS[@]}" "https://stream.twitter.com/1/statuses/$STREAM")
  cmdline=("${cmdline_nopass[@]}" "-u$USERPASS")
  # Censor the username and password so they don't end up in log files, etc.
  cmdline_censored=("${cmdline_nopass[@]}" "-u<censored>")
  if [ -n "$DRYRUN" ]; then
    echo "${cmdline[@]} |$BZIP >> $TWEETS_FILE"
    #To test that spaces are being properly passed in
    #for x in "${cmdline[@]}"; do echo "Argument: $x"; done
  else
    echo "${cmdline_censored[@]} |$BZIP >> $TWEETS_FILE"
    "${cmdline[@]}" |$BZIP >> $TWEETS_FILE
  fi
  echo "Ending retrieval of tweets for area $NAME at `date` ..."
  last_end_time=`date +%s`
  run_length=`expr $last_end_time - $last_start_time`
  if [ $run_length -lt $MINIMUM_SUCCESSFUL_RUN_TIME ]; then
    echo "Unsuccessful run: $run_length seconds < $MINIMUM_SUCCESSFUL_RUN_TIME seconds"
    last_delay=`expr $last_delay '*' 2`
    echo "Doubling delay to $last_delay seconds"
  else
    echo "Successful run at $run_length seconds, resetting delay to $MINIMUM_DELAY_AFTER_ERROR second(s)"
    last_delay=1
  fi
  sleep $last_delay 
  echo "Trying again after having delayed $last_delay seconds ..."
done
} | tee -a $ERROR_FILE 2>&1