HTTPS SSH

Easy to Use Twitter Crawler

This is an easy to use Twitter crawler.

Setting up

Make a copy of the tokens.config.template file and call it tokens.config. Then enter each key in one line as the existing lines suggest.

Make a folder data/ parallel to bin/ .

Compile

mvn compile assembly:single

Run streaming

./bin/stream.sh streaming.txt data 10
  • streaming.txt contains one word per line. Streaming is hard coded to respect english, german and turkish language in the moment.
  • In the folder data, a file is generated to write the output to.
  • 10 means the streaming runs for 10 minutes. I suggest to run the streamer for close to 23*60 (or similar) minutes with a cron job starting every day at a specific time.

Run crawler via REST api

Please check ./bin/crawl.sh as well. The easiest way to start is, however

./bin/crawl-by-offset.sh terms.txt -1 20
  • terms.txt is the list of terms. Please check the format in the example in this repository. The second column are the actual search terms.
  • -1 means the crawler checks for all Tweets posted yesterday. -2 would mean the day before yesterday and so on. You cannot go back more than 7 days, I think.
  • 20 means, 20 pages of Tweets are crawled.