Source

astrosearch / README

# AstroSearch

Executables to allow me to access the Twitter Streaming API and
look for a search term. At present there's nothing Astronomy specific 
in the code, apart from perhaps the implicit assumption that the 
search term you give isn't going to result in a *large* return rate,
or end up with a particularly large number of stored Tweets.

## Building

The file oauth needs to be set up with a line containing the key and
secret that you get by registering your application with Twitter. This
is used to fill-in the fields in the Common.hs.in file during the
configure stage (very clunky).

You need the twitter-types package which is available on GitHub at

https://github.com/himura/twitter-types.git

and I used version 0.1.20120908 which was downloaded from

https://github.com/himura/twitter-types/commit/475e5f4ee76371ef7d1d510dde12306fdd75a272

## Basics

 1 - Start the server

    % ./astroserver 8123 'aas221' 'aas 221' 'hackaas'
    Starting server on port 8123
    Search term: aas 221
    Search term: aas221
    Search term: hackaas

  Unlike earlier versions you supply the search term when starting the server,
  not when you start the search. The program will not let you change the search
  term once a server has been started, even if the database is empty. In this
  case either run in a different directory or delete the tweetstore directory.

  The data is being stored to disk in the ./tweetstore/ directory.

 2 - Start the search

    % ./astrosearch 8123
    Port: 8123
    Search: aas 221
    Search: aas221
    Search: hackaas
    There are no existing tweets.
    Wed Jan 02 15:20:39 +0000 2013 time to make my #AAS221 talk slides","source":"\u003ca href=\"http:\/\/www.tweetdeck.com\" rel=\"nofollow\"\u003eTweetDeck\u003c\/a\u003e","
    ...

  Unlike previous versions you just see portions of the JSON rather
  than the decoded text (user name and text). The chosen sections
  should be the time and the status, but changes to the order that
  Twitter sends the fields will change this.

  I am using launchctl to make sure that the search is re-started on error,
  so it is more like

    % launchctl load plist/com.dburke.astrosearch.plist

  where

    % cat plist/com.dburke.astrosearch.plist 
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
    <plist version="1.0">
    <dict>
    <key>KeepAlive</key>
    <dict>
    <key>SuccessfulExit</key>
    <false/>
    </dict>
    <key>Label</key>
    <string>com.dburke.astrosearch</string>
    <key>ProgramArguments</key>
    <array>
    <string>INSTALL_DIR/bin/astrosearch</string>
    <string>8123</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>StandardOutPath</key>
    <string>INSTALL_DIR/log/out.astrosearch</string>
    <key>StandardErrorPath</key>
    <string>INSTALL_DIR/log/error.astrosearch</string>
    </dict>
    </plist>

  where INSTALL_DIR is replaced by the location of the code/database.

 3 - a quick look at the database

    % ./astroquery
    Usage: astroquery <port number> args..
      where args... is one of
        size         - report size of database
        terms        - what are the search terms
        info [n]     - report on the last n tweets (name,text,time,...)
        show [n]     - dump n latest tweets (user name + text)
        raw [n]      - dump n latest tweets (raw text)
        validate [n] - can last n tweets be converted?
        times        - display the search start/end times
        checkpoint   - create a checkpoint

    % ./astroquery 8123 size
    Number of tweets:    13

  and a dump of the last n tweets

    % ./astroquery 8080 show 2
    # There are 2 out of 2 valid tweets.
    #1 peterdedmonds: I suggest 2 tweet-ups at #AAS221, on Mon &amp; Tue at 5:45pm.  Anyone free at those times? We can meet at registration desk.
    #2 (RT AstroKatie of astrobetter) RT @astrobetter: Looking to share a ride to #AAS221? There's a wiki for that: http://t.co/ZMqZ9c1W &amp; a hash tag: #AASride

  The info command outputs a little-more information:

    % ./astroquery 8123 info 2
    286589446676164608,242844365,kevinschawinski,Jan 02 21:47:03 +0000 2013,n/a,Need a ride from LAX to the #aas221? @astrobetter wiki has a cab sharing sign-up sheet. #aasride http://t.co/xunH9Qli
    286578661438681088,460272173,peterdedmonds,Jan 02 21:04:11 +0000 2013,n/a,I suggest 2 tweet-ups at #AAS221, on Mon &amp; Tue at 5:45pm.  Anyone free at those times? We can meet at registration desk.

  and the raw command displays the full JSON package for each message.

  The times command lists the start/stop times of the search

    % ./dist/build/astroquery/astroquery 8123 times
    Search start/stop times
      start  2012-12-21 01:54:59.268744 UTC
      stop   2012-12-21 01:54:52.997153 UTC
      start  2012-12-20 21:30:53.198915 UTC

  which will hopefully provide more-accurate information about lost
  data than previous (this relies on the search program exiting in
  such a way that the stop time can be written to the database). In
  the case above I have used launchd to make sure the search is
  re-started in case of error.

  Note that astroquery now validates that each tweet matches the
  expected results - i.e. those assumed by the twitter-types package -
  and you will see a note about any items that do not match.

 4 - get avatar/profile images

  This could have been included in the search process but for now a
  separate step.

    % ./avatars 8123 10
    Reading last 10 tweets.
    Looking for existing avatars from 10 tweets
    >> Trying to download 8 avatars
    Downloading to avatar-dir/KaytlynMatthews/56bd3c0c970821cdc38f8c82fc12a1e6_normal.jpeg
      from http://a0.twimg.com/profile_images/2995669282/56bd3c0c970821cdc38f8c82fc12a1e6_normal.jpeg
    Downloading to avatar-dir/saralizabeth07/8CYXNRTk_normal
      from http://a0.twimg.com/profile_images/2590401947/8CYXNRTk_normal
    ...

  If called with only the port number then it will attempt to process
  all the tweets, otherwise it is the last n tweets in the store.

 5 - check for repeated tweets

  As a check to see if the same tweet has been reported multiple
  times:

    % ./validate 8123
    >> 9 tweets
    Found 9 separate tweets

  If there are repeated tweets then these get reported and a check is
  made to see if the text content is the same.
 
 6 - convert to RDF

  To convert all tweets

    % ./tordf 8123

  To convert the last n tweets

    % ./tordf 8123 n

  This will write to stdout a Turtle representation of the data. At
  present this contains no blank nodes.

 7 - Start SPARQL store

  Check the database doesn't exist (since the 4s-backend-setup call
  will overwrite any existing data).

    % 4s-backend-info aas221
    metadata.c:73 failed to touch metadata file file:///usr/local/var/fourstore/aas221/metadata.nt: No such file or directory
     1: 0x100007e84 <fs_metadata_open+184> 4s-backend-info
     2: 0x100001e24 <fs_backend_init+85> 4s-backend-info
     3: 0x100000b8f <main+87> 4s-backend-info
    backend.c:90 cannot read metadata file for kb aas221
     1: 0x10000201a <fs_backend_init+587> 4s-backend-info
     2: 0x100000b8f <main+87> 4s-backend-info

  Create the four-store database

    % 4s-backend-setup aas221
    4store[13165]: backend-setup.c:186 erased files for KB aas221
    4store[13165]: backend-setup.c:318 created RDF metadata for KB aas221

  This creates the data in /usr/local/var/fourstore/aas221

  Check on the database:

    % 4s-backend-info aas221
       64M	/usr/loca disk usage

  Start the server (the 4s-info command will time out if there is a
  problem contacting the server):

    % 4s-backend aas221
    % 4s-info aas221 noop
    NO-OP took 0.000062s
    % 4s-httpd -p 8001 -s -1 aas221

  Note that the "-s -1" option is to remove the soft limit for queries.

  You can now query the store using http://localhost:8001/test/ but
  this writes the results to disk, rather than having the option of
  displaying on screen.

  The SPARQL server can be removed (there should be two processes) with

    % killall 4s-httpd

  The server can be run in the foreground, which may give better diagnostics:

    % 4s-httpd -D -s -1 -p 8001 aas221
    4store[30166]: httpd.c:1849 4store HTTP daemon v1.1.5 started on port 8001
    4store[30168]: httpd.c:112 couldn't open query log '/var/log/4store/query-aas221.log' for appending: No such file or directory

 8 - Add to SPARQL store

  Note the named graph storing the data:

    % cat > metadata.ttl
    <urn:aas221-streaming> a <http://purl.org/net/djburke/demo/twitter#TweetStore> .

    % curl --data-urlencode data@metadata.ttl -d 'graph=urn%3Agraph-metadata' -d 'mime-type=application/x-turtle' http://localhost:8001/data/
    200 added successfully
    This is a 4store SPARQL server v1.1.5

    % ./bin/tordf 8123 > aas.ttl

      I thought the following would only add new statements, but it seems
      to increase the number of triples even if the same file is added:

        % curl --data-urlencode data@aas.ttl -d 'graph=urn%3Aaas221-streaming' -d 'mime-type=application/x-turtle' http://localhost:8001/data/
        200 added successfully
        This is a 4store SPARQL server v1.1.5

  Try this to replace the existing graph:

    % curl -T aas.ttl -H 'Content-Type: application/x-turtle' 'http://localhost:8001/data/urn:aas221-streaming'
    201 imported successfully
    This is a 4store SPARQL server v1.1.5

 9 - Query database (manually)

  The hquery app allows you to write a SPARQL query and send it to the
  server: the user input occurs after the '# ?user ...' line and is
  ended with a control-d character. The query is made and the results
  displayed.

    % ./hquery http://localhost:8001/sparql/
    prefix sioc: <http://rdfs.org/sioc/ns#>
    prefix sioct: <http://rdfs.org/sioc/types#>
    prefix foaf: <http://xmlns.com/foaf/0.1/>
    prefix dcterms: <http://purl.org/dc/terms/>
    prefix tw: <http://purl.org/net/djburke/demo/twitter#>
    prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    prefix lode: <http://linkedevents.org/ontology/>
    
    # ?user a sioc:UserAccount . ?tweet a sioct:MicroblogPost ; sioc:has_creator ?user .
    
    select distinct ?ht { [] lode:illustrate ?ht . }
    
    *** Running select query: HGET
    *** Results
    jwst
    notreally
    zooniverse
    cs17
    adsarticle
    camphogg
    aapf13
    aas221
    kidding
    aas
    aasride
    liveandprocrastinate
    exopag7
    hubble
    hackaas
    dotastro
    adsarticleoftheday
    ***

  You can also specify a format for the output (only really useful for CONSTRUCT
  calls) - e.g.

    % ./hquery http://localhost:8001/sparql/ raw turtle
    prefix sioc: <http://rdfs.org/sioc/ns#>
    prefix sioct: <http://rdfs.org/sioc/types#>
    prefix foaf: <http://xmlns.com/foaf/0.1/>
    prefix dcterms: <http://purl.org/dc/terms/>
    prefix tw: <http://purl.org/net/djburke/demo/twitter#>
    prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    prefix lode: <http://linkedevents.org/ontology/>
    
    # ?user a sioc:UserAccount . ?tweet a sioct:MicroblogPost ; sioc:has_creator ?user .
    
    CONSTRUCT { ?s ?p ?o } where { ?s ?x "Douglas Burke" ; ?p ?o . }
    
    *** Running raw query: HGET
    *** Results
    <http://twitter.com/doug_burke> <http://rdfs.org/sioc/ns#name> "doug_burke" .
    <http://twitter.com/doug_burke> <http://xmlns.com/foaf/0.1/status> "Astronomer. Apparently not-so reluctant tweeter.\r\n" .
    <http://twitter.com/doug_burke> <http://rdfs.org/sioc/ns#id> "101775511"^^<http://www.w3.org/2001/XMLSchema#integer> .
    <http://twitter.com/doug_burke> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdfs.org/sioc/ns#UserAccount> .
    <http://twitter.com/doug_burke> <http://purl.org/net/djburke/demo/twitter#numFollowers> "177"^^<http://www.w3.org/2001/XMLSchema#integer> .
    <http://twitter.com/doug_burke> <http://xmlns.com/foaf/0.1/homepage> <http://hea-www.harvard.edu/~dburke/> .
    <http://twitter.com/doug_burke> <http://purl.org/net/djburke/demo/twitter#numFriends> "164"^^<http://www.w3.org/2001/XMLSchema#integer> .
    <http://twitter.com/doug_burke> <http://purl.org/net/djburke/demo/twitter#numFriends> "165"^^<http://www.w3.org/2001/XMLSchema#integer> .
    <http://twitter.com/doug_burke> <http://rdfs.org/sioc/ns#avatar> <http://a0.twimg.com/profile_images/609576508/me_normal.png> .
    <http://twitter.com/doug_burke> <http://purl.org/net/djburke/demo/twitter#langCode> "en" .
    <http://twitter.com/doug_burke> <http://xmlns.com/foaf/0.1/name> "Douglas Burke" .
    <http://twitter.com/doug_burke> <http://www.w3.org/2000/01/rdf-schema#label> "doug_burke" .
    
    ***

 10 - Create data

    % ./calchistogram json http://localhost:8001/sparql/ > ~/www/aas221/aas221.freq.json
    % ./countusertweets json http://localhost:8001/sparql/ > ~/www/aas221/aas221.user-count.json

## WARNING

There is no warranty with this software. It may do stupid things, like
crash your computer or fill your hard drive.