__ _ _ __ / /__ (_) (_) /_ / '_// / / / '_/ /_/\_\/_/_/ /_/\_\ |___/ ___ ___ ___ ___ / -_) -_) _ \(_-< \__/\__/_//_/___/
A Tool for Researchers
Kijkeens is a tool for researchers who want to collect social media data and save it for further analysis. It is designed to run as a background job on a server, continuously polling the platform API for new posts of interest. Posts are either stored in a database or handed off to a queue for delayed processing.
As of version 0.2.0,
kijkeens works for Twitter only. Instagram support has
been dropped following the changes to the Instagram platform
that became effective on 1 June 2016 that bar noncommercial uses of the API.
Sample Work Flow
Jobs are defined in configuration files. Here's an example of a configuration file for a Twitter tracking job:
database: "postgresql://user:password@localhost/tweets" table: "myfavoritehashtag" queue: "twitter-myfavoritehashtag" token: "/home/user/twitter_tokens" query: "#myfavoritehashtag"
A few things to note:
- Only Postgres backends are handled at this point.
- Databases have to already exist, but tables are created by
kijkeens. In fact, you should not specify existing tables to ensure tables are initialized with the correct schema.
tokenkey can either point to a directory that contains API token files, or to a specific token file. Like configuration files, token files use YAML format (see below).
kijkeens like this will track your query (in this case the hashtag
#myfavoritehashtag) using the Streaming API:
$ kijkeens twitter myfavoritehashtag.yaml
-s flag if you'd like to search retrospectively using the Search API
When you store tweets immediately as they are published, you miss some metadata
you might be interested in, such as the number of retweets and favorites
received. You will also store tweets that may no longer be active sometime
later, because users may have opted to delete them shortly after posting. For
this reason you might want to delay storing tweets in your database table for
kijkeens helps you do so:
$ kijkeens twitter --queue=6hours myfavoritehashtag.yaml
This will place the IDs of the tweets that match your query in a queue. You will then have to run a "worker" to fetch the tweets after the specified delay period has elapsed and store them in your database table:
$ kijkeens twitter worker myfavoritehashtag.yaml
You must have Python (2.7 or 3.x), Postgresql (9.4+) and Redis installed. On Debian GNU/Linux, you can install these by running the following command:
# apt-get install python2.7 python-pip postgresql redis-server
To install the latest version of
kijkeens, run this:
$ virtualenv -p $(which python2.7) kijkeens_env $ source kijkeens_env/bin/activate $ pip install https://bitbucket.org/jboy1/kijkeens/get/master.tar.gz
This will install
kijkeens along with all Python library dependencies in a
Python 3 Compatibility
As of 0.3.1, everything should work when using Python 3.
Template for Twitter tokens:
api_key: xxxxxxxxxxxxxxxxxxxxxxxxx api_secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx token: xxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx token_secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Ensuring Continuous Execution
supervisord. Example configuration:
[program:twitter_tracking] command=kijkeens twitter -v --queue=1day myjob.yaml directory=/home/user/data_collection autorestart=true user=user
You can then run
supervisorctl to check the status of your jobs.
Because it deals with a moving target (social media platform APIs), this software breaks often and requires frequent updates. It is currently successfully being used for research by the author, but your mileage may vary.