Overview

HTTPS SSH

Requirements Status

   __    _   _ __
  / /__ (_) (_) /_
 /  '_// / / /  '_/
/_/\_\/_/_/ /_/\_\
       |___/
 ___ ___ ___  ___
/ -_) -_) _ \(_-<
\__/\__/_//_/___/

A Tool for Researchers

Kijkeens is a tool for researchers who want to collect social media data and save it for further analysis. It is designed to run as a background job on a server, continuously polling the platform API for new posts of interest. Posts are either stored in a database or handed off to a queue for delayed processing.

As of version 0.2.0, kijkeens works for Twitter only. Instagram support has been dropped following the changes to the Instagram platform that became effective on 1 June 2016 that bar noncommercial uses of the API.

Sample Work Flow

Jobs are defined in configuration files. Here's an example of a configuration file for a Twitter tracking job:

database: "postgresql://user:password@localhost/tweets"
table: "myfavoritehashtag"
queue: "twitter-myfavoritehashtag"
token: "/home/user/twitter_tokens"
query: "#myfavoritehashtag"

A few things to note:

  • Only Postgres backends are handled at this point.
  • Databases have to already exist, but tables are created by kijkeens. In fact, you should not specify existing tables to ensure tables are initialized with the correct schema.
  • The token key can either point to a directory that contains API token files, or to a specific token file. Like configuration files, token files use YAML format (see below).

Invoking kijkeens like this will track your query (in this case the hashtag #myfavoritehashtag) using the Streaming API:

$ kijkeens twitter myfavoritehashtag.yaml

(Add the -s flag if you'd like to search retrospectively using the Search API instead.)

When you store tweets immediately as they are published, you miss some metadata you might be interested in, such as the number of retweets and favorites received. You will also store tweets that may no longer be active sometime later, because users may have opted to delete them shortly after posting. For this reason you might want to delay storing tweets in your database table for some time. kijkeens helps you do so:

$ kijkeens twitter --queue=6hours myfavoritehashtag.yaml

This will place the IDs of the tweets that match your query in a queue. You will then have to run a "worker" to fetch the tweets after the specified delay period has elapsed and store them in your database table:

$ kijkeens twitter worker myfavoritehashtag.yaml

Installing

You must have Python (2.7 or 3.x), Postgresql (9.4+) and Redis installed. On Debian GNU/Linux, you can install these by running the following command:

# apt-get install python2.7 python-pip postgresql redis-server

To install the latest version of kijkeens, run this:

$ virtualenv -p $(which python2.7) kijkeens_env
$ source kijkeens_env/bin/activate
$ pip install https://bitbucket.org/jboy1/kijkeens/get/master.tar.gz

This will install kijkeens along with all Python library dependencies in a virtual environment.

Python 3 Compatibility

As of 0.3.1, everything should work when using Python 3.

Token Files

Template for Twitter tokens:

api_key: xxxxxxxxxxxxxxxxxxxxxxxxx
api_secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
token: xxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
token_secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Ensuring Continuous Execution

Use supervisord. Example configuration:

[program:twitter_tracking]
command=kijkeens twitter -v --queue=1day myjob.yaml
directory=/home/user/data_collection
autorestart=true
user=user

You can then run supervisorctl to check the status of your jobs.

Disclaimer

Because it deals with a moving target (social media platform APIs), this software breaks often and requires frequent updates. It is currently successfully being used for research by the author, but your mileage may vary.