A *very* hacky attempt to grab tweets associated with a hashtag.

Some assumptions:

  - the number of tweets is low, so there is no need to poll
    frequently or worry too much about writing quick code or
    using the streaming API

  - have a background process running that can run the search

  - dump to ASCII files rather than have a more-robust system such as
    a database, although I'm tempted to try storing them in a Riak
    or some other key-value store instance just because ...

  - there is a rather cavalier approach to error handling, which can
    be summarised as "ignore it and try again later"

There's also additional tools for dealing with the output. The
original versions of these work on the JSON output from grabtweets but
newer versions are being introduced that convert the JSON to RDF and
then work on the RDF. Some of these are now obsolete. I have also
moved to storing the RDF graph in an external store with a SPARQL
interface, using that interface to form queries, rather than using
the Swish module.

# The workflow is something like:

- grabtweets

Create the search*.json files. Also creates search.log and
search.state files; the latter is used to restart the search.

- tweetstordf

Takes all the search*.json files in the working directory and writes
out the RDF version to a file.

Will contain blank nodes if there are any tweets that are identified
as being in response to a user (since we don't necessarily have, at
this stage, the information needed to create the URI that represents
the Tweeter). Is this statement still true?

As of the AAS 219 meeting (version or thereabouts), we now
use the include_entities option in the search API which provides
much, if not all, of the ancillary information previously added
via addusers.

- identifyusers

Queries Twitter for information on users mentioned within a Tweet
- via the @username syntax - for which we have no information
(real name and proper capitalisation of the Twitter handle).

Also looks for users who are missing one of sioc:id, sioc:name,
and foaf:name predicates and tries to fill them in.

The output is a RDF graph of the new information.

The search is done over all named graphs in the SPARQL store
that are of type

(the default graph is used only to find the named graphs).

- findretweets

Identify those tweets which are actually retweets (the REST API
does not provide any such information and the Streaming API only
matches some tweets that begin with "RT "). This is a multi-step
process (i.e. the code is run multiple times in different modes
to identify matches).

- getuserconnections

Given one or more graphs create a set of


lines which indicate that the given Tweeter mentions the user called
references via an explicit "@references" in the text of the Tweet,
where count is the number of times this occurs.

It can also output the data in JSON format as used by d3.js's
force-directed display code.

This has been updated to use a SPARQL query to do all the heavy

- counttweets

Return a list of

  username count

that list the number of tweets made by each user in the input
graph(s). It can also output in JSON, in a format usable by
my AAS219 html pages.

- calchistogram

Create histograms of the tweet frequency; both raw and "smoothed"
(using a kernel-density estimate) and write the data out in JSON

- gettweets

Display the text from the tweets in time order. This is a replacement
for the gettext routine.

# The old set of commands:

These have been removed from the repository.

- gettext

Extract the tweet text from the search*json files in the working
directory and print to stdout.

- extracttweets

Write the time, username, and text from the tweets into one file and
username, logo URL into a second file (the file names are arguments to
the tool). The search*json files in the working directory are used.

- cleantweets

Get the tweet text (the output of gettext) and clean up for simple
tag-cloud analysis. The command takes an optional number of words so
that ngrams, rather than a single word, can be used. There is also an
attempt to normalize some common word and phrases (e.g. plural <->
single and people's names).

- getusers

From the search*json files in the working directory, calculate the
number of tweets a user has made and a list of users referenced by
each tweet.  The output is to two separate files.

The user names output here are in lower case (they do not match
the canonical version of the user name) and there may be a case where


has multiple versions of the same username in ref1 .. refN.

- query

Allow a simple query to be made against one or more RDF graphs. The
query is just the simplest supported by Swish and does not contain
support for useful concepts like negation or optional keywords.

Very experimental.

# Other:

From Gephi, 

Robert Tarjan, Depth-First Search and Linear Graph Algorithms, in SIAM Journal on Computing 1 (2): 146–160 (1972)

is used to determine connected components (both weakly and strongly connected).

For the AAS218 data it gives

  Number of Weakly Connected Components: 3
  Number of Stronlgy Connected Components: 335

The weakly-connected components are not that useful since essentially all nodes
are linked to each other (so most are in the same component).

The modularity detection - from

Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre, Fast unfolding of communities in large networks, in Journal of Statistical Mechanics: Theory and Experiment 2008 (10), P1000

- looks like it may be more useful. With randomize on

  Modularity: 0.581
  Number of Communities: 14