[s3boto] Skip save when content is identical to remote content

Issue #209 new
Gordon NA created an issue

It would be much nicer if saving a file could skip early if the same content already exists on S3.

Overwriting each file with collectstatic is particularly painful... especially when a project has thousands of static files.

There are several ways this could be accomplished. My initial thought is to use key metadata and store a content hash that can be checked before the file is pushed to s3.

I am not sure of any benefit for reuploading identical content, but if there is maybe a inplace copy could be used.

S3BotoStorage._save_content seems like a good place for this to happen.

Any thoughts/comments/ideas?

Comments (2)

  1. Gordon NA reporter

    So I couldn't figure out how to access the "Content-MD5" metadata from S3. I guess it isn't stored that way and the etag can change depending on the size and how a file is uploaded. I've put the following together and it seems to work well so far. It would be trivial to make this an opt-in feature.

    def _save_content(self, key, content, headers):
        aws_checksum = key.get_metadata('checksum')
        md5_tuple = key.compute_md5(content)
        kwargs = {"md5": md5_tuple}
        our_checksum = "md5;%s" % md5_tuple[1]
        if aws_checksum != our_checksum:
            key.set_metadata('checksum', our_checksum)
            # re-implement the super method passing along the md5 tuple
            # only pass backwards incompatible arguments if they vary from the default
            if self.encryption:
                kwargs['encrypt_key'] = self.encryption
            key.set_contents_from_file(content, headers=headers,
                                       rewind=True, **kwargs)
            logger.info("Skip uploading '%s' because it already exists with identical content remotely." % content.name)

    Is this something that would be merged? I think it has a much wider use than just my current project.

  2. Log in to comment