PRELOAD breaks things like exist()

Issue #106 new
idangazit created an issue

When working with S3Boto as the storage backend for something like django staticfiles, a common workflow breaks when you set AWS_PRELOAD_METADATA = True.

Using something like CachedFilesMixin, collectstatic has the following steps:

Upload files that are present locally and missing on S3

iterate over every file in S3 and post_process it

If you use AWS_PRELOAD_METADATA and have a file that isn't present on S3, you'll get an exception during post_processing, because exists() will return false based on the preloaded data.

Basically, file saves should invalidate / update preload data, so subsequent calls see the changes.

Comments (8)

  1. Ian Lewis

    I think preload metadata was designed for when the files don't change so the work around would be to simply turn it off. Invalidating the cache would be ok, but if you write often at all it wouldn't make much sense. Updating seems like it would be a bit much to implement and get it write.

    What kind of use case do you have where you need to preload the filelist and still be able to update it?

  2. idangazit reporter

    As I wrote above, tools like django-staticfiles with CachedFilesMixin is my scenario. Preloading is a *major* improvement in performance, for even the smallest project it can shave minute(s) off collectstatic, because it isn't making one API call per file in the first phase (copy-if-missing).

    Without preload, I eat that performance penalty a second time when post_process is run, each file is once again iterated over, resulting in an addition API hit per file.

    This isn't an esoteric scenario. Anybody using django-staticfiles who wants their static media hosted on S3 will be using django-storages and will run into this issue. The CachedFilesMixin is coming to Django in 1.4 and that will only make this setup more popular.

    Just sayin' — at the least, a documentation note might be appropriate.

  3. Bouke Haarsma

    This problem not only affects collectstatic, but all scenarios where files can be added to S3 after preloading all entries. In our case, this caused serious performance problems (outage) as a thumbnail plugin falsely regenerated all thumbnails over-and-over again for newly uploaded images. Switching to non-preload was also not an option due to the large number of round-trips to S3.

    We have worked around this bug by firstly checking the preloaded entries but use the fallback of a complete roundtrip to S3. The preloaded entries cache was invalidated once the roundtrip learned that the file did indeed existed on S3, so a subsequent check would rebuild the preloaded entries cache and thumbnails would be generated only once. The workaround below can be used as drop-in replacement as it extends the normal `S3BotoStorage` backend.

    class S3AutoReloadPreloadStorage(S3BotoStorage):
        def exists(self, name):
            """Check if the file exists in cache or upstream (patched).
            By default, exists() will only check the preloaded file list if
            pre-loading was enabled. As this fails for new files since the cache was
            built, it will return false negatives. This results in major performance
            issues as all image thumbnails will be re-created, staticfiles failing
            to post-process css files with new images, etc. Not good."""
            name = self._normalize_name(self._clean_name(name))
            if self.entries and name in self.entries:
                return True
            k = self.bucket.new_key(self._encode_name(name))
            if k.exists():
                if self.entries:
                    # The cache was wrong, delete the cache so it will be rebuild
                    # next time
                    self._entries = None
                return True
            return False
  4. Matt Fisher

    The issue is not limited to adding files, it also occurs when deleting. (This aspect isn't fixed by Bouke's S3AutoReloadPreloadStorage)

    >>> from import default_storage as ds
    >>> ds.preload_metadata
    >>> f ='storage_test', 'w')
    >>> f.write('storage contents')
    >>> f.close()
    >>> d.exists('storage_test')
    >>> d.delete('storage_test')
    >>> d.exists('storage_test')
    True # !!!

    There at least needs to be a documentation note because the tests on RTD break when AWS_PRELOAD_METADATA = True

  5. Bouke Haarsma

    Matt Fisher, you are correct that it only works for detecting false negatives and not for false positives. Deleting could be fixed by overriding delete. However, this will still fail if there are multiple caches set-up. So I think the problem needs a bigger approach: full test-suite and documentation about setting up caches.

  6. Matt Fisher

    Yes overriding delete() to remove the entry from entries would work in combination with your exists().

    How would you set up multiple caches in this context?

  7. Log in to comment