HTTPS SSH

BibSonomy <-> Internet Archive Connector

The goal of this project is to archive recent posts from BibSonomy in the Internet Archive

Design Issues

Web Archive API

https://web.archive.org/save/http://www.uni-hannover.de/ supports archiving web pages

  • check how the "Accept" header influences behavior
  • check how many of the redirects one needs to follow to safely archive the web page

BibSonomy Interaction

<bibsonomy stat="ok">
  <user name="dblp" realname="DBLP Mirror"
    homepage="http://dblp.uni-trier.de/" spammer="false"
    toClassify="0" href="https://www.bibsonomy.org/api/users/dblp">
    <groups start="0" end="0"/>
  </user>
</bibsonomy>

Identifying Posts

  • queue URLs of new (recent) posts
  • later extensions could als retroactively archive (old) posts
  • handle changed posts as new posts
  • only archive public posts

Queuing

avoid "congestion" - web archive access might be slow:

  • have a fixed size queue (FIFO?!)
  • have several threads
  • ensure that threads do not run berserk

delayed archiving:

  • retrieve entries from queue after a fixed (or dynamic?) time
  • query for user status
  • only archive URLs for classified users (at least one of the users of a URL)

queue design:

  • fixed size
  • entries: URL, timestamp, users
  • sort by timestamp - oldest entries are thrown away
  • if a URL is in the queue and another user posts it, its timestamp is updated and the user is added

Monitoring and Control

provide statistics:

  • queue length
  • archived urls/hour
  • ignored urls/hour
  • use frameworks for collecting and displaying statistics