# AstroSearch

Executables to allow me to access the Twitter Streaming API and
look for a search term. At present there's nothing Astronomy specific 
in the code, apart from perhaps the implicit assumption that the 
search term you give isn't going to result in a *large* return rate,
or end up with a particularly large number of stored Tweets.

## TO DO

Given the issues I had with the tweet server during the AAS225 run -
memory issues and possible/probable loss of data - I think it's time
to retire the ACID-based astroserver/astrosearch approach and use
a standard database, such as Postgres.

Work out how to convert the RDF output of tordf into SPARQL update
format (it's not a hard conversion; just need to write something).

Work out how to include friend/follower network into the RDF output.


There is no warranty with this software. It may do stupid things, like
crash your computer or fill your hard drive.

## Building

The file oauth needs to be set up with a line containing the key and
secret that you get by registering your application with Twitter. This
is used to fill-in the fields in the file during the
configure stage (very clunky).

You need the twitter-types package which is available on GitHub at

although I am using my fork, at

and I used version 0.4.0 which was downloaded from

Note that my code is now well behind the master branch of
twitter-types; I have not looked to see whether I can just move back
to the official build.

I also use my version of HaSparql-Client, which you can get from

and version is

The latest version was built using ghc version 7.8.3 and 7.84, using a
cabal sandbox.

## Basics

 1 - Start the server (run in screen/tmux)

    % ./astroserver 8123 'aas225' 'aas 225' 'hackaas' 'aasviz'
    Starting server on port 8123
    Search term: aas 225
    Search term: aas225
    Search term: hackaas

  Unlike earlier versions you supply the search term when starting the server,
  not when you start the search.

  If you are re-starting the server then you must either use the same terms,
  but can be in any order, or supply no terms. To add or delete terms then
  stop the search, and restart the server using either the --add or --delete
  options, along with the terms.

  The data is being stored to disk in the ./tweetstore/ directory.

 2 - Start the search (run in screen/tmux)

  The use of multiple cores is so that the logging thread can be run on a
  separate thread to the search logic; probably not needed.

    % ./astrosearch 8123 +RTS -N2
    Port: 8123
    Search: aas 221
    Search: aas221
    Search: hackaas
    There are no existing tweets.
    Wed Jan 02 15:20:39 +0000 2013 time to make my #AAS221 talk slides","source":"\u003ca href=\"http:\/\/\" rel=\"nofollow\"\u003eTweetDeck\u003c\/a\u003e","

  Unlike previous versions you just see portions of the JSON rather
  than the decoded text (user name and text). The chosen sections
  should be the time and the status, but changes to the order that
  Twitter sends the fields will change this.

  A "basic" launch daemon can be created with something like

    set out = restart.log
    touch $restart
    while (1)
        echo "# `date`" >> $out
        ./astrosearch 8123 +RTS -N2

  I was, when running on OS-X, using launchctl to make sure that the
  search is re-started on error, so it was more like

    % launchctl load plist/com.dburke.astrosearch.plist


    % cat plist/com.dburke.astrosearch.plist 
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "">
    <plist version="1.0">

  where INSTALL_DIR is replaced by the location of the code/database.

  However, it is now a simple wrapper script:

    TODO: insert details of wrapper

 3 - a quick look at the database

    % ./astroquery
    Usage: astroquery <port number> args..
      where args... is one of
        size            - report size of database
        terms           - what are the search terms
        times           - display the search start/end times
        validate        - report number of unconvertable tweets
        info [[s] n]    - id,name,time,text
        show [[s] n]    - user name + text
        raw [[s] n]     - raw text
        convert [[s] n] - convert to Haskell and dump output
        json [[s] n]    - convert to JSON and dump output
        search <term>  - simple text search (tweet contents, case sensitive)
        isearch <term> - simple text search (tweet contents, case insensitive)
        checkpoint   - create a checkpoint
        For [[s] n] arguments, no argument means all tweets, one argument
        means the last n tweets and two arguments is a subset, where the first
        argument is the first tweet to display (s=0 means the first tweet) and
        the second argument is the number of tweets.

    % ./astroquery 8123 size
    Number of tweets:    13

  and a dump of the last n tweets

    % ./astroquery 8080 show 2
    # There are 2 out of 2 valid tweets.
    #1 peterdedmonds: I suggest 2 tweet-ups at #AAS221, on Mon &amp; Tue at 5:45pm.  Anyone free at those times? We can meet at registration desk.
    #2 (RT AstroKatie of astrobetter) RT @astrobetter: Looking to share a ride to #AAS221? There's a wiki for that: &amp; a hash tag: #AASride

  The info command outputs a little-more information:

    % ./astroquery 8123 info 2
    286589446676164608,242844365,kevinschawinski,Jan 02 21:47:03 +0000 2013,n/a,Need a ride from LAX to the #aas221? @astrobetter wiki has a cab sharing sign-up sheet. #aasride
    286578661438681088,460272173,peterdedmonds,Jan 02 21:04:11 +0000 2013,n/a,I suggest 2 tweet-ups at #AAS221, on Mon &amp; Tue at 5:45pm.  Anyone free at those times? We can meet at registration desk.

  and the raw command displays the full JSON package for each message.

  If you want to access a subset of tweets - e.g. the hundredth to
  hundred and second, use <start number> <n tweets>, where <start
  number> uses 0 for the first tweet - e.g.

    % ./dist/build/astroquery/astroquery 8123 info 99 3

  The times command lists the start/stop times of the search

    % ./dist/build/astroquery/astroquery 8123 times
    Search start/stop times
      start  2012-12-21 01:54:59.268744 UTC
      stop   2012-12-21 01:54:52.997153 UTC
      start  2012-12-20 21:30:53.198915 UTC

  which will hopefully provide more-accurate information about lost
  data than previous (this relies on the search program exiting in
  such a way that the stop time can be written to the database). In
  the case above I have used launchd to make sure the search is
  re-started in case of error.

  The terms option lists all the search terms, labelling those that
  are currently unused with "closed" (i.e. these have been searched on
  but have since been deleted from the search using the --delete
  option of astroserver).

  Note that astroquery now validates that each tweet matches the
  expected results - i.e. those assumed by the twitter-types package -
  and you will see a note about any items that do not match.
  [TODO: check this statement]

 4 - get avatar/profile images <not tested in a while>

  This could have been included in the search process but for now a
  separate step.

    % ./avatars 8123 10
    Reading last 10 tweets.
    Looking for existing avatars from 10 tweets
    >> Trying to download 8 avatars
    Downloading to avatar-dir/KaytlynMatthews/56bd3c0c970821cdc38f8c82fc12a1e6_normal.jpeg
    Downloading to avatar-dir/saralizabeth07/8CYXNRTk_normal

  If called with only the port number then it will attempt to process
  all the tweets, otherwise it is the last n tweets in the store.

 5 - check for repeated tweets

  As a check to see if the same tweet has been reported multiple

    % ./validate 8123
    >> 9 tweets
    Found 9 separate tweets

  If there are repeated tweets then these get reported and a check is
  made to see if the text content is the same.

 5.1 - download the friend/follower network (run in screen/tmux)

  Finding friends and followers of users who tweet is a time-consuming
  task since the number of calls is limited (at present 15 requests
  per 15 minutes, and it can take multiple requests to process a
  single tweeter if they have a large number of friends or followers).

  This is now split up so that it can be run as a server, if required
  (i.e. so that the network results can be queried whilst the search
  is still running). The code has been written to try and handle the
  case where a lot of people suddenly start re-tweeting an account -
  e.g. a NASA tweet - by selecting users who have made a tweet before
  those that have just retweeted. 
  If using a server:

    a) in one screen/tmux session

      % ./networkserver 8124

    b) in a second screen/tmux session

      % ./findnetwork 8123 8124 +RTS -N

  If runnning locally

    a) in a screen/tmux session

      % ./findnetwork 8123 local +RTS -N

  Either way, it needs access to the astrosearch server; note that
  this can also be set to "local" or "none". If "local", then the
  twitter database is accessed directly; if "none" then no attempt is
  made to look at the twitter database - that is, the code will
  "drain" the friend/follower queues but will not add any more entries
  to these queues.

  The code is multi-threaded, hence the "+RTS -N" option. I don't
  think it really needs to be threaded - i.e. can be run on a single
  core - but in this case it should probably be re-compiled without
  the "-threaded" option. There are threads for logging screen
  messages [*], filling up the queue of users, processing the follower
  network, and processing the friend network (the latter two "drain"
  the queue).

  [*] most of the screen output uses the logging instance, but not the
  actual twitter calling code, which displays the URI fragment it is
  using (which is useful for checking that large networks, requiring
  multiple calls, are being processed).

  Note that the code will refuse to identify networks that can not be
  eueried within a single twitter timeout period (i.e. 15 minutes).

  The data is written to the ./networkstore/ directory.

 5.2 - query the network

  List the size or provide information on users for which the network
  query failed, or the number of people in the network is too large to
  realistically query (i.e. would take more than one 15-minute period
  to run the query), or that are in the "to do" list.

   ./networkquery local|<port> size|failed|large|todo

  The code stores user ids and not names, which makes it a bit awkward
  to manually inspect the results.

 6 - Canonical Links (run in screen/tmux)

    % ./getcanonicallinks <port>
    % ./getcanonicallinks <port> <num>
    % ./getcanonicallinks <port> <start> <num>

  Create and populate the uristore/ database used by tordf to convert
  links into a 'canonical' form. See also section 9.1.

  To view and potentially clean out this store use the viewuristore

    % ./viewuristore
    Usage: viewuristore args..
      where args... is one of
        size         - report size of database
        dump         - dump all URIs
        failed       - show those URIs which ended up failing to resolve (i.e. timeout)
        delfailed    - delete the failed URIs in the store
        match <frag> - show all URIs (original) that contain frag
        delete <frag> - delete any URI (original) that matches the fragment
        checkpoint   - create a checkpoint
        <frag> is case sensitive
        You are asked for each URI to delete; answer y or n.

 7 - convert to RDF

  To convert all tweets

    % ./tordf 8123

  To convert the last n tweets

    % ./tordf 8123 n

  To convert a subset of tweets (start=0 is the first tweet)

    % ./tordf 8123 start num

  This will write to stdout a Turtle representation of the data. At
  present this contains no blank nodes.

  The links referenced within a tweet are checked against the uristore
  in the current working directory and - if it exists and contains a
  match - the canonical version from the store used. To create,
  populate, and update this store use getcanonicallinks. This means
  that it should be run from the directory containing the uristore/
  directory; that is, the same directory used to run

  Note: I am currently checking that using a chunked approach - that
  is, making multiple calls to tordf to process all the tweets -
  produces the same RDF graph as the all-in-one approach. It seems to
  (in general) but I have not been able to do a proper comparison due
  to issues retrieving data from the twitter store.

 8 - Start SPARQL store

  I am moving over to using StarDog (version 2.2.4 for AAS 225) since
  it is being developed whereas fourstore has limited updates, in
  particular regarding possible duplicate statements when adding data.

  *) set up for Stardog

    STARDOG_HOME should contain the stardog-license-key.bin file:

    % export STARDOG_HOME=.../stardog_data
    % export PATH=.../stardog-2.2.4/bin:$PATH

  *) set up the stardog server

  The following uses the default settings - e.g. port and passwords,
  so user=admin password=admin or user=anonymous password=anonymous.

    % stardog-admin server start
    This copy of Stardog is licensed to Doug Burke (, Astronomer
    This is a Community license
    This license does not expire.
                                          ;;                   `;`:   
      `'+',    ::                        `++                    `;:`  
     +###++,  ,#+                        `++                    .     
     ##+.,',  '#+                         ++                     +    
    ,##      ####++  ####+:   ##,++` .###+++   .####+    ####++++#    
    `##+     ####+'  ##+#++   ###++``###'+++  `###'+++  ###`,++,:     
     ####+    ##+        ++.  ##:   ###  `++  ###  `++` ##`  ++:      
      ###++,  ##+        ++,  ##`   ##;  `++  ##:   ++; ##,  ++:      
        ;+++  ##+    ####++,  ##`   ##:  `++  ##:   ++' ;##'#++       
         ;++  ##+   ###  ++,  ##`   ##'  `++  ##;   ++:  ####+        
    ,.   +++  ##+   ##:  ++,  ##`   ###  `++  ###  .++  '#;           
    ,####++'  +##++ ###+#+++` ##`   :####+++  `####++'  ;####++`      
    `####+;    ##++  ###+,++` ##`    ;###:++   `###+;   `###++++      
                                                        ##   `++      
                                                       .##   ;++      
    Stardog server 2.2.4 started on Sat Jan 10 15:10:56 EST 2015.
    Stardog server is listening on all network interfaces.
    SNARL server available at snarl://localhost:5820.
    HTTP server available at http://localhost:5820.

    And then to monitor the server:

    % tail -f /home/naridge/code/stardog_data/stardog.log

  *) set up the database 

    This can be done using the server, at http://localhost:5820/
    or on the command-line using the stardog command:

    % stardog-admin db create -n aas225

    Change aas225 to the database name.

  *) Add data to the database: command line

    % stardog data add -g urn:aas225-streaming aas225 file1.ttl ... fileN.ttl

    Note that in Stardog 2.2.4 the import will fail if there is a
    statememt including the objects "false, true". The simple fix is
    to manually edit the turtle file to say "false , true" (or perhaps
    to convert to NTriples). This is a bug in underlying code; see

  *) Add data to the database: SPARQL update

    This requires re-writing the turtle files into SPARQL update

      a) change the '@prefix a: <b> .' lines to 'prefix a: <b>'
      b) before the statements, insert 'INSERT DATA { GRAPH <graph-uri> {'
         (with the appropriate graph name)
      c) append to the end '}}'

    Note that in Stardog 2.2.4 the import will fail if there are strings
    with unicode statements in them (not entirely clear what is the trigger);
    this turns out to be an issue with underlying code: see

    So I used the command-line version.

 9 - Add to SPARQL store

  Note the named graph storing the data:

    % cat > metadata.ttl
    <urn:aas225-streaming> a <> .

  Note that this metadata document will increase in size later. Now add to the

    % stardog data add -g urn:aas225-streaming aas225 metadata.ttl

 9.1 - Create the RDF (incrementally)

  To process the whole store; use

    % ./bin/tordf 8123 > aas.ttl

  but this will become *very* slow once the number of tweets gets large
  (above 10000 or so). The alternative is to use an incremental approach,
  which can also be used to deal with the canonical links

    % ./bin/tordf 8123 0 1000 > aas.0.ttl
    % ./bin/getcanonicallinks 8123 0 1000
    % ./bin/tordf 8123 1000 1000 > aas.1.ttl
    % ./bin/getcanonicallinks 8123 1000 1000
    % ./bin/tordf 8123 2000 1000 > aas.2.ttl
    % ./bin/getcanonicallinks 8123 1000 1000

  and then add these files; note that this doesn't account for the "last"
  file, which will almost-certainly be incomplete and so need to be
  handled differently in a loop.

  The data can be added (note that Stardog does not have the same issues
  with fourstore v1.1.5 over adding duplicate statements) using
  the command-line tool or via SPARQL update (which will require
  editing the turtle file): for Stardog version 2.2.4 both approaches
  have been shown to fail with the AAS225 dataset (but can be worked
  around for the command-line version).

    % stardog data add -g urn:aas225-streaming aas225 aas.0.ttl aas.1.ttl ...

 10 - Query database (manually)

  The hquery app allows you to write a SPARQL query and send it to the
  server: the user input occurs after the '# ?user ...' line and is
  ended with a control-d character. The query is made and the results
  displayed. Note that the Stardog query URI contains

    - user-name and password
    - the database name

    % ./hquery http://anonymous:anonymous@localhost:5820/aas225/query/
    prefix sioc: <>
    prefix sioct: <>
    prefix foaf: <>
    prefix dcterms: <>
    prefix tw: <>
    prefix rdfs: <>
    prefix lode: <>
    # ?user a sioc:UserAccount . ?tweet a sioct:MicroblogPost ; sioc:has_creator ?user .
    select distinct ?ht { [] lode:illustrate ?ht . }
    *** Running select query: HGET
    *** Results

  You can also specify a format for the output (only really useful for CONSTRUCT
  calls) - e.g. 

      NOTE: this has not been tested with Stardog

    % ./hquery http://anonymous:anonymous@localhost:5820/aas225/query/ raw turtle
    prefix sioc: <>
    prefix sioct: <>
    prefix foaf: <>
    prefix dcterms: <>
    prefix tw: <>
    prefix rdfs: <>
    prefix lode: <>
    # ?user a sioc:UserAccount . ?tweet a sioct:MicroblogPost ; sioc:has_creator ?user .
    CONSTRUCT { ?s ?p ?o } where { ?s ?x "Douglas Burke" ; ?p ?o . }
    *** Running raw query: HGET
    *** Results
    <> <> "doug_burke" .
    <> <> "Astronomer. Apparently not-so reluctant tweeter.\r\n" .
    <> <> "101775511"^^<> .
    <> <> <> .
    <> <> "177"^^<> .
    <> <> <> .
    <> <> "164"^^<> .
    <> <> "165"^^<> .
    <> <> <> .
    <> <> "en" .
    <> <> "Douglas Burke" .
    <> <> "doug_burke" .

 11 - Create data

  At this point the metadata graph should look like

    % cat metadata.ttl
    <urn:aas225-streaming> a <> .
    <urn:aas225-followers> a <> .
    <urn:aas225-retweet-simple> a <> .
    <urn:aas225-retweet-distance> a <> .
    <urn:aas225-retweet-complex> a <> .
    <urn:aas225-retweet-unknown> a <> .

  so that

    % echo "select * { GRAPH <urn:graph-metadata> { ?s ?p ?o . } }" | ./bin/hquery http://anonymous:anonymous@localhost:5820/aas225/query/
    *** Results
    <urn:aas225-retweet-distance> <> <>
    <urn:aas225-streaming> <> <>
    <urn:aas225-retweet-unknown> <> <>
    <urn:aas225-followers> <> <>
    <urn:aas225-retweet-simple> <> <>
    <urn:aas225-retweet-complex> <> <>

  First identify retweets/replies:

    % set aas = aas225
    % set stardog = http://anonymous:anonymous@localhost:5820/${aas}/
    % set sparql = ${stardog}/query/
    % set update = ${stardog}/update/

    foreach m ( simple distance complex unknown )
        set g = urn%3Aa${aas}-retweet-$m
        echo "*** Deleting graph $g"
        curl -d 'update=CLEAR+GRAPH+%3C${g}%3E' $update
    foreach m ( simple distance complex unknown )
      echo "*** Finding retweets: $m"
      set o = aas.rt.${m}.ttl
      if ( -e $o ) rm $o
      ./bin/findretweets http://localhost:8001/sparql/ $m > $o
      stardog data add -g urn:${aas}-retweet-$m $aas $o
  and now create the data for the web pages (*NOTE* these programs are
  designed for a small number of tweets, not large or long-running

    % set minuser = 25
    % set mintoken1 = 50
    % set mintoken2 = 20
    % set minstats = 5
    % set minbiotoken = 10

    % ./bin/calchistogram json $sparql > ${aas}.freq.json
    % ./bin/countusertweets json $sparql $minuser > ${aas}.user-count.json
    % ./bin/getuserconnections json $sparql > ${aas}.user-conn.json
    % ./bin/tokenize json $sparql $mintoken1 $mintoken2 > ${aas}.word-count.json
    % ./bin/simplestats json $sparql $minstats > ${aas}.overview.json

    % ./bin/hquery $sparql raw myjson < sparql/list-time-userid.sparql > ${aas}.time-cumulative.json

    % set bios = ${aas}.bios
    % if ( -e $bios ) rm $bios
    % ./bin/hquery $sparql raw mytsv < sparql/dump-bios.sparql > $bios
    % ./bin/tokenizetext $bios $minbiotoken > ${aas}.bio-cloud

  Note that simplestats creates a local store (based on the Haskell
  ACID package) to store the JSON for the OEmbed representation of the
  most-popular tweets (since there's a limit to the number of requests
  that can be made). This data is placed in the simplestats/
  directory.  The tool can be run with the --noembed option which just
  returns an empty list for the twitter information (i.e. does not
  query Twitter or require the OEmbed store).

  The above code does *not* add in the friend network that the
  findfriends routine did. This code has been replaced by the
  programs networkserver and findnetwork - which find both the friend
  and follower networks of users - and networkquery, which queries
  the database. There is currently *NO* code to convert the data
  into RDF. That is

## GrabTweets

I have now included the GrabTweets program that was originally in a
separate repository, namely

The code now uses the v1.1 Twitter API for searching (which means it
needs authentication, just like astrosearch/callauth). This means that
the results are comparable to those returned by astrosearch, although
I have kept the output as separate JSON files rather than an Acid
Store. The conversion to RDF is done by grabtordf.

## Post search

As of version you can run many of the tools without having to
start up the server; this is useful for accessing the data for
previous searches. In this case you use the identifier "local", rather
than a port number. The tweetstore/ directory in the current directory
is used as the store, and it is *strongly* suggested that you do not
try to use the local access whilst a remote server is using the same