[s3boto] MemoryError: out of memory

Issue #69 new
Former user created an issue

I do not think it is a good idea to load the entire file into memory like the following code from the s3boto backend does: {{{ #!python @property def file(self): if self._file is None: self._file = StringIO() if 'r' in self._mode: self._is_dirty = False self.key.get_contents_to_file(self._file) self._file.seek(0) return self._file }}}

I think it would be better to either use a temporary file on disk or stream the file as it is read. Reading the whole file into memory was causing out of memory errors on my ec2 micro instance until I did some dirty hacks to temporarily get around the issue.

Comments (17)

  1. BeProud

    This will really depend on what you want to do with the file.

    Django pretty much assumes that the file object returned by open() is a Django File object and supports all the methods that the Django File objects supports. In order to do that it's hard not to load the file into memory.

    One way to get around it is to subclass the S3BotoStorage and override the _open() method so that it returns a File object that does exactly what you want.

  2. BeProud

    BTW, I'm not really in a position to look at this bug in depth any time soon. However, I would be willing to review and accept patches that would improve this.

  3. Former user Account Deleted

    I will try to make time to work on it then.

    I think it will be simpler to implement a local temporary file on large keys. I am thinking about adding two config options. One to specify the max memory file size, and the other to set the temp file path. Then if file.size is larger than the configured max size for memory files it would be loaded to disk instead of memory and removed when file.close() is called.

    Have any ideas that would make it better?

  4. BeProud

    The SpooledTemporaryFile is unfortunately something that was added in Python 2.6 so it would be necessary to copy the class from the Python 2.6 tempfile module, or integrate the functionality of SpooledTemporaryFile into the S3BotoStorageFile class.

    I would like to go for the later as using the SpooledTemporaryFile doesn't depend on anything but StringIO and TemporaryFile and using it would add another layer of indirection to the API which already wraps file objects multiple times.

  5. BeProud

    The proposed fix using SpooledTemporaryFile could also potentially transfer and create large temporary files on disk. Ideally there would be a way to get a file object that could simply be chunked and streamed to the client in order to avoid memory or disk costs.

    Maybe something like adding a new S3BotoStreamingFile and a new setting to set the File object class? Obviously things like seek() and "rw" mode would not work with this class. I hate to add another setting though.

    In any case the SpooledTemporaryFile functionality is better than StringIO so I think it will get put in here at some point.

  6. Gordon NA

    A S3BotoStreamingFile would be ideal. But I haven't had time to work on it and the SpooledTemporaryFile worked better than StringIO by itself for now. I would imagine in most scenarios that there will be less memory available than disk space... but yes that could potentially be problematic.

    I think that a streaming file class could be made to function just like the file object though. For reads anyways it shouldn't be very difficult because the boto library supports byte range requests. You would just need to keep up with the current file position.

    I am currently using the following for streaming large files from s3 through my ec2 instances:

    if isinstance(file,S3BotoStorageFile):
                headers = {}
                headers["Range"] = "bytes=%d-%d" % (start,end)
                for bytes in file.key:
                    yield bytes 
  7. Ian Lewis

    Essentially the best thing to do if you want to do something like streaming is to just return the boto Key object wrapped in a Django File object since boto's Key is a file like object and can be streamed directly to the client. If we add a setting for the returned file class, adding an option to simply return the Key might be another thing to think about.

    Though you might run into problems if you are using ConditionalGetMiddleware, GZipMiddleware, CommonMiddleware since they will peek at the response content in order to gzip it or get the content length or whatever. That would cause the entire file to get loaded into the response object in memory.

    Anyway, I think the current issue can be solved by using the SpooledTemporaryFile. I wanted to to use the functionality with boto.Key's get_contents_to_file() so I copied SpooledTemporaryFile from Python 2.7 (to support 2.5). Check out this changeset in my fork and see if it works for you. I'll probably merge this later if it seems good.


  8. Gordon NA


    I just saw this update. I will swap over from my fork and try this next week.

    I think that the code should be modified slightly to attempt importing SpooledTemporaryFile from the system's python installation before using the compatibility fallback.

    I think for readability it would be a good idea to move stuff like this into a utility submodule, but that is just my preference.

    Thanks for the update.

  9. Luis Alberto Santana

    Is this issue resolved? I'm getting MemoryError in a EC2 micro instance when trying to upload a big file.

  10. Gordon NA

    I just encountered this again when I updated to master from a fork.

    It doesn't look like the patch made it in. I have updated it so there are no conflicts with master. I will create a pull request soon. I am having some issues with memory due to large files again now that I've updated to master. I want to make sure they are all gone before I create the pull request.

  11. Log in to comment