Databases accessed by multiple processes

Issue #98 new
andrew_peterson repo owner created an issue

We're getting into domains where we almost certainly will have multiple processes accessing the same database files (for fingerprints, etc.) simultaneously. For now, we've been using python's "shelve" module, which is an easy (pickle-like) way of storing data simultaneously on memory and in disk, but I believe it is not robust to multiple processes accessing the database at the same time. Less severely, I believe it is essentially creating pickled objects, which are not so permanent.

We looked at some other solutions such as SQLite3 and MongoDB, but they are not altogether straightforward either. As a straightforward, but perhaps slightly more cluttered solution, we could consider just having a separate text file for each fingerprint, as in

amp-fingerprints/
    6f0b3324f6001d810afbab9f85a6ea5f
    aeaaa21e5faccc62bae94c5c48b04031
    ...

where each of those hashes is a text file with an image's fingerprint.

Then each processes's Data class would behave about as it does now. That is, when a finerprint is requested: (1) check a dictionary in memory to see if the hash exists, (2) if not, check the file structure above to see if the hash exists, if so, read it from the file and add it to the memory dictionary (3) if not, calculate it, and add it to both the memory dictionary and as a new file in the directory above. (And at the end, return the requested item from the memory dictionary.)

This would create a lot of files, but I don't know if that is inherently any more memory or process intensive than creating a database, and they'd be neatly organized in a folder. This is essentially what happens with ASE's bundle trajectory.

@akhorshi, What do you think?

Comments (6)

  1. Alireza Khorshidi

    @andrewpeterson I have not worked before with ASE's bundle trajectory. Does it create a folder with many trajectories. I.e. if we have 1000 images, then does the folder have 1000 files?

    From a previous discussion, we discussed parallelization in the image level or atom level (which I guess can speed up calculation of fingerprint derivatives). At the current stage, we have parallelization over images, in which way each text file of the folder will be written by only one process. Though, if we later decide to parallelize over atoms (and not images), then this method still needs writing a single text file by multiple processes.

  2. andrew_peterson reporter

    I tried bundle trajectory, and it creates ~3 files per image (plus some other miscellaneous files), so roughly 3000 files for 1000 images.

    Yeah, I guess if we parallelize over atoms we have a few other issues to work through. But I'm not sure we'll prioritize that real soon. So I may take this approach to re-write Data if and when we start having issues with multiple processes.

  3. andrew_peterson reporter

    The sqlitedict approach didn't seem to be working much better than the shelve approach, so I went with a pure document-style approach that saves each item (e.g., fingerprint) as a file with its filename corresponding to the image hash. It is implemented now as FileDatabase in commit 8f6fea5 and a few subsequent. This works just fine when there are many processes going, at least so far in my tests. The only drawback is this creates a folder with perhaps thousands of individual files in it; this can take up a lot of memory and also makes a lot of files for indexing systems (like dropbox or a virus scanner) to go through. To solve this, I have made a utility called amp-compress that you can run on the command line. This will take all the individual files and add them to a file called archive.tar.gz that contains the identical information. The FileDatabase class will continue to work fine when all or some of the entries are compressed in this manner in both read and write mode. For now, you have to tell amp-compress which specific files to compress, but we could also easily change it so that it recursively compresses all data files below the current directory, for example.

    It could be possible to automatically make the archive.tar.gz file at the end of a process, but I don't know how safe that would be if multiple processes are running. We could eventually implement that. Perhaps we can even intelligently guess when it's needed: if the user doesn't specify a dblabel, then we presume theirs is the only process using that database and we automatically compress, if they do specify a dblabel then we leave it alone.

    I am leaving this ticket open for now as I need to write documentation for this: both how the format works and how to use amp-compress. I will delay doing that until we're confident this format is working.

  4. andrew_peterson reporter

    I have now documented this, as it seems to be functional. Commit f5fe91f.

    However, I'm still leaving this ticket open as amp-compress needs to be made easier to use as described here.

  5. andrew_peterson reporter

    The recursive option has now been added; just the dblabel part needs to be addressed. We should probably wait until after the release for this, so it can be tested in use for a while.

  6. Log in to comment