Wiki
Clone wikiBibSonomy / development / modules / synchronization / User Synchronization
User-Synchronization
A framework to synchronize user accounts in BibSonomy. For a general description see Sync.
Description
The aim of the framework is the regular synchronization of a BibSonomy user account (server), with another service (client), e.g. a Puma instance and also parts like bookmarks with Delicious.
Prior knowledge
- http://stackoverflow.com/questions/271610/strategy-for-offline-online-data-synchronization
- http://www.opensync.org/
- http://www.research.ibm.com/sync-msg/RC21774.pdf
- http://de.wikipedia.org/wiki/SyncML
Requirements
- Using the REST-API BibSonomy side.
- Extension to have the possibility to change / save multiple posts necessary? (useful in any case!)
- A general framework that enables the implementation of multiple services.
Services
This listing is neither complete nor is it really necessary to provide an implementation for each service .The idea is to collect possible services and to note their peculiarities. The order of the services describes their importance , because the synchronization must be working primarily with PUMA / BibSonomy. Mozilla Firefox would be a nice-to-have, but it will show if it's possible in context of the framework.
PUMA / BibSonomy
- Main application, should also consider the PDFs (documents) of the publications to be synchronized
JabRef
- If the REST-API provides the synchronization, then each Tool may use it, even JabRef, but also a Puma.
DBLP
- There is a rather complex implementation.
- Takes high requirements due to the large amounts of data (millions of posts).
Mozilla Firefox
- An Add-On exists which supports synchronization, hence this service is temporally not so important.
(eventually has to be implemented in JavaScript)
BibTeX-File
- ToDo: A small programm to sync a BibTeX-File with BibSonomy.
Synchronization Sub-Assembly
- Synchronizes a given user account (one time) with a service
- Is service-independent - Access to the service by a well-defined interface for which, depending on the service, there are different implementations.
Error-Handling
A failure occurring during the synchronization process should not corrupt the data consistency of one of the participating stores. A failed attempt of synchronization that is later restarted and completes successfully has to lead to the same state as if it had never failed.
Further Tasks
Correct use of date/update_date-columns
- Currently when changing an existing post, the "date" column is changed also. In the future only the "update \ _date" column should change.
- We must ensure that the values of the two columns are being displayed correctly in the object tree and also being read correctly (adjust SQL queries if necessary!)
- The two values must be made available via the REST API (there may be problems with existing clients - tests necessary)
-
Currently posts are always sorted by the "date" column - this should be possible for the "update \ _date" column, too.This would require the creation of new indexes, which is extremely complicated in a productive-system. Furthermore, this option should be made available via the REST-API or Web Interface.
-
Do all synchronized services have an "update" or "create" date?
Implementation for a specific service
- Ideally, the implementation for a specific service is done parallel to the development of the entire framework. Giving priority to the Synchronization with Puma.
- The various services have very different requirements. They must in particular be observed in the implementation of synchronization component.
Infrastructure
- Save the user and service data.
- Regularly perform the synchronization for all users and services.
- Storing log information (when and which user was synchronized with which service, how many posts have been inserted / deleted / changed, how many / which errors occured, etc.)
User-Interface
- /settings
- User must select a service and enter authentication data
- Display for user: last synchronized time (possibly an error, number of posts) and next sync time
- /admin (bsc?)
- Admins should be able to see which users are being synchronized with which service
- last synchronization time, how many posts, when will be synchronized again, etc.
- Managing the shared services and their Auth*-key.
- User could choose what exact time to sync daily - Attention, note time zones! American users will probably want to sync at night, for example.
Document-Sync
-
Database
- Documents are uniquely referencedare by its filename, (post) ResourceHash and username
- A new column "change \ _date" must be the added to"document" table
- Just like the "tas" implementation
- The previous SQL query (getSyncBibTex) includes the "tas" table to find the last modified date.
- The query is expanded to include a JOIN on "documents", so also changes to documents of a post are considered
-
Sync
- Document informations (PostResourceHash, username, filename) are appended to the post via "getPostDetails" and transmitted to the client
- adjustments to Renderer(CSL) necessary?
- Client decides what documents he needs to download and
accordingly sends requests to the server
- 2 options:
-
- Client generally deletes all documents of the corresponding posts and sends requests for all documents
- Simple but massive overhead
-
- Client compares "change_date" of each document and downloads only changed ones newly
- Client decides what documents he needs to download and
accordingly sends requests to the server
Scenarios
Initial position is a correct synced system
Server | Client | ||||
---|---|---|---|---|---|
md5hash | filename | change_date | md5hash | filename | change_date |
111 | doc1 | 1.1.2011 | 111 | doc1 | 1.1.2011 |
Filename changed
Server | Client | ||||
---|---|---|---|---|---|
md5hash | filename | change_date | md5hash | filename | change_date |
111 | doc2 | 22.1.2011 | 111 | doc1 | 1.1.2011 |
- Rename: doc1 --> doc2
Server | Client | ||||
---|---|---|---|---|---|
md5hash | filename | change_date | md5hash | filename | change_date |
111 | doc2 | 22.1.2011 | 111 | doc2 | 22.1.2011 |
File changed on server- and client-side, name stays the same
Server | Client | ||||
---|---|---|---|---|---|
md5hash | filename | change_date | md5hash | filename | change_date |
333 | doc1 | 26.1.2011 | 222 | doc2 | 22.1.2011 |
- Problem:
- which file will be synced?
- (suboptimal) solution:
- "last change_date" wins, changes on the other side are lost
Server | Client | ||||
---|---|---|---|---|---|
md5hash | filename | change_date | md5hash | filename | change_date |
333 | doc1 | 26.1.2011 | 333 | doc1 | 26.1.2011 |
File was uploaded on server side once again but with another name
Server | Client | ||||
---|---|---|---|---|---|
md5hash | filename | change_date | md5hash | filename | change_date |
111 | doc1 | 11.1.2011 | 111 | doc1 | 11.1.2011 |
111 | doc2 | 12.1.2011 | x | x | x |
- Is (yet) possible because only the file name by a post is unique
- Uncritical --> File is uploaded once again
- Optimally, however, md5hashes should be clear per post
Server | Client | ||||
---|---|---|---|---|---|
md5hash | filename | change_date | md5hash | filename | change_date |
111 | doc1 | 11.1.2011 | 111 | doc1 | 11.1.2011 |
111 | doc2 | 12.1.2011 | 111 | doc2 | 12.1.2011 |
In addition to observe / Open Questions
- How do we treat faulty posts (e.g. posts without tags)?
- We must obtain and save service IDs
- URIs as service-IDs?
- Keys for each service should be visible for each service
- We need to ensure that the client and server have the same system-time, convserion of time zones working properly
- Problem: Changes made during a synchronization; actually it should be one big transaction
Tests Biblicious / build.puma
Currently, the synchronization between Biblicious and bulid.puma.uni-kassel.de can be tested. The following conditions must be met: - Accounts on both systems - The Biblicious user must have administrator rights, for Puma a normal user-account - The Puma account must be entered in Biblicious: - Settings --> Synchronization tab of the Puma name and enter related APIKEY
Attention! Sometimes there is data loss !
Updated