Wiki

Clone wiki

BibSonomy / development / modules / synchronization / User Synchronization

User-Synchronization

A framework to synchronize user accounts in BibSonomy. For a general description see Sync.

Description

The aim of the framework is the regular synchronization of a BibSonomy user account (server), with another service (client), e.g. a Puma instance and also parts like bookmarks with Delicious.

Prior knowledge

Requirements

  • Using the REST-API BibSonomy side.
  • Extension to have the possibility to change / save multiple posts necessary? (useful in any case!)
  • A general framework that enables the implementation of multiple services.

Services

This listing is neither complete nor is it really necessary to provide an implementation for each service .The idea is to collect possible services and to note their peculiarities. The order of the services describes their importance , because the synchronization must be working primarily with PUMA / BibSonomy. Mozilla Firefox would be a nice-to-have, but it will show if it's possible in context of the framework.

PUMA / BibSonomy

  • Main application, should also consider the PDFs (documents) of the publications to be synchronized

JabRef

  • If the REST-API provides the synchronization, then each Tool may use it, even JabRef, but also a Puma.

DBLP

  • There is a rather complex implementation.
  • Takes high requirements due to the large amounts of data (millions of posts).

Mozilla Firefox

  • An Add-On exists which supports synchronization, hence this service is temporally not so important.
    (eventually has to be implemented in JavaScript)

BibTeX-File

  • ToDo: A small programm to sync a BibTeX-File with BibSonomy.

Synchronization Sub-Assembly

  • Synchronizes a given user account (one time) with a service
  • Is service-independent - Access to the service by a well-defined interface for which, depending on the service, there are different implementations.

Error-Handling

A failure occurring during the synchronization process should not corrupt the data consistency of one of the participating stores. A failed attempt of synchronization that is later restarted and completes successfully has to lead to the same state as if it had never failed.

Further Tasks

Correct use of date/update_date-columns

  • Currently when changing an existing post, the "date" column is changed also. In the future only the "update \ _date" column should change.
  • We must ensure that the values ​​of the two columns are being displayed correctly in the object tree and also being read correctly (adjust SQL queries if necessary!)
  • The two values ​​must be made available via the REST API (there may be problems with existing clients - tests necessary)
  • Currently posts are always sorted by the "date" column - this should be possible for the "update \ _date" column, too.This would require the creation of new indexes, which is extremely complicated in a productive-system. Furthermore, this option should be made available via the REST-API or Web Interface.

  • Do all synchronized services have an "update" or "create" date?

Implementation for a specific service

  • Ideally, the implementation for a specific service is done parallel to the development of the entire framework. Giving priority to the Synchronization with Puma.
  • The various services have very different requirements. They must in particular be observed in the implementation of synchronization component.

Infrastructure

  • Save the user and service data.
  • Regularly perform the synchronization for all users and services.
  • Storing log information (when and which user was synchronized with which service, how many posts have been inserted / deleted / changed, how many / which errors occured, etc.)

User-Interface

  • /settings
    • User must select a service and enter authentication data
  • Display for user: last synchronized time (possibly an error, number of posts) and next sync time
  • /admin (bsc?)
    • Admins should be able to see which users are being synchronized with which service
    • last synchronization time, how many posts, when will be synchronized again, etc.
    • Managing the shared services and their Auth*-key.
  • User could choose what exact time to sync daily - Attention, note time zones! American users will probably want to sync at night, for example.

Document-Sync

  • Database

    • Documents are uniquely referencedare by its filename, (post) ResourceHash and username
    • A new column "change \ _date" must be the added to"document" table
      • Just like the "tas" implementation
    • The previous SQL query (getSyncBibTex) includes the "tas" table to find the last modified date.
      • The query is expanded to include a JOIN on "documents", so also changes to documents of a post are considered
  • Sync

    • Document informations (PostResourceHash, username, filename) are appended to the post via "getPostDetails" and transmitted to the client
  • adjustments to Renderer(CSL) necessary?
    • Client decides what documents he needs to download and accordingly sends requests to the server
      • 2 options:
      1. Client generally deletes all documents of the corresponding posts and sends requests for all documents
      2. Simple but massive overhead
      1. Client compares "change_date" of each document and downloads only changed ones newly

Scenarios

Initial position is a correct synced system

Server Client
md5hash filename change_date md5hash filename change_date
111 doc1 1.1.2011 111 doc1 1.1.2011

Filename changed

Server Client
md5hash filename change_date md5hash filename change_date
111 doc2 22.1.2011 111 doc1 1.1.2011
  • Rename: doc1 --> doc2
Server Client
md5hash filename change_date md5hash filename change_date
111 doc2 22.1.2011 111 doc2 22.1.2011

File changed on server- and client-side, name stays the same

Server Client
md5hash filename change_date md5hash filename change_date
333 doc1 26.1.2011 222 doc2 22.1.2011
  • Problem:
    • which file will be synced?
  • (suboptimal) solution:
    • "last change_date" wins, changes on the other side are lost
Server Client
md5hash filename change_date md5hash filename change_date
333 doc1 26.1.2011 333 doc1 26.1.2011

File was uploaded on server side once again but with another name

Server Client
md5hash filename change_date md5hash filename change_date
111 doc1 11.1.2011 111 doc1 11.1.2011
111 doc2 12.1.2011 x x x
  • Is (yet) possible because only the file name by a post is unique
  • Uncritical --> File is uploaded once again
    • Optimally, however, md5hashes should be clear per post
Server Client
md5hash filename change_date md5hash filename change_date
111 doc1 11.1.2011 111 doc1 11.1.2011
111 doc2 12.1.2011 111 doc2 12.1.2011

In addition to observe / Open Questions

  • How do we treat faulty posts (e.g. posts without tags)?
  • We must obtain and save service IDs
    • URIs as service-IDs?
    • Keys for each service should be visible for each service
  • We need to ensure that the client and server have the same system-time, convserion of time zones working properly
  • Problem: Changes made during a synchronization; actually it should be one big transaction

Tests Biblicious / build.puma

Currently, the synchronization between Biblicious and bulid.puma.uni-kassel.de can be tested. The following conditions must be met: - Accounts on both systems - The Biblicious user must have administrator rights, for Puma a normal user-account - The Puma account must be entered in Biblicious: - Settings --> Synchronization tab of the Puma name and enter related APIKEY

Attention! Sometimes there is data loss !

Updated