migrate file names of stored documents to MD5 content hashes

Issue #1686 open
Robert Jäschke created an issue

When a user uploads a PDF document to a post, it is stored on disk using a file name that is a random (!) MD5 hash (see FileUtil.getRandomFileHash()). Instead, it would be more beneficial, if we would use the files MD5-hashed content as file name (which is already stored in the database!). Thereby, for every document exactly one instance would be stored on our server (instead of one instance per user).

benefits: - uses less disk space - previews of documents that already have been uploaded would be immediately available - other tools (e.g., an importer) could skip the upload of files that BibSonomy already knows

drawbacks: - please let me know! (privacy should not be a problem, since nobody knows who has uploaded a document before)

what must be also changed: - filenames of the generated QR code PDFs (there now must be different files for each user - we could append the user's numeric ID) - filenames of the preview images (which are generated by a separate Perl/Shell script) - be careful with other files that are stored in the directory structure (IIRC JabRef layouts and maybe user photos) - maybe we need to change their code, too.

what must be done: - a migration script must be written (difficult for QR code files, but we could just delete and regenerate them) - update database table (the old filename column could possibly be deleted)

Before starting the implementation, seek for changes that must be done and report them here. It is important to not miss anything, otherwise we can screw things up!

Comments (2)

  1. Daniel Zoller

    i prefer to keep the old filename (aka hash column) because 2006 the document hash was used as part of the document request url /docuemnts/<FILEHASH>. To correlate these requests with the corresponding publication we need this column.

    My suggestion is to rename the column to old_hash. New documents could leave this column blank.

  2. Log in to comment