Image processing in a request handler

Issue #6 invalid
Dinh Pham
created an issue

Hi,

I am working on a WSGI app powered by gunicorn (gevent worker). A a specific URL handler (controller), I need to

  1. Download some images (I/O bound)
  2. Merge them together using PIL (CPU bound)

I understand that gipc is designed to handle Step 2: convert CPU bound operations into I/O bound ops

To implement Step 1 I have the following code

    with gevent.Timeout(seconds=2, exception=False):
        # 100 x 100 thumbnail only
        for url in user_photo_urls[:4]:
            app.logger.info('Downloading %s', url)
            pool.spawn(download_file, url.replace('p', 'p100'), tracker, normalized_room_id)
        pool.join()

I believe that pool.join() will not block the event loop. The greenlets that handles the request is blocked there but the control is returned to the event loop. Is that correct?

In Step 2 I have the following code:

create_montage_file(montage_file_path, tracker.downloaded_files)

The function uses PIL to merge images. It requires more CPU cycles and not I/O bound. Therefore, when it runs, the event loop will be blocked. No greenlets can be scheduled to run until the function in the greenlet completes. That's not good thing. I want to solve it using gipc.

Basically I want to fork a process inside a greenlet and put the blocking code there. The process will be scheduled by the OS and will not block the greenlet from returning control to the event loop. Is that correct?

    with gipc.pipe() as (reader, writer):
        p = gipc.start_process(create_montage_file, args=(montage_file_path, tracker.downloaded_files, writer))
        p.join() # will return control to event loop
        ret = json.loads(reader.get()) # result
        print(ret['status'])

def create_montage_file(montage_file_path, tracker.downloaded_files, writer):
       new_file_path = merge_images(montage_file_path, tracker.downloaded_files)
       # Do other things here
       writer.put(json.dumps({'path': new_file_path, 'status': 'ok'}))

I understand that gipc support text as data serialization format only. Do you think that there is a better way to do it?

Thanks

Dinh

Comments (7)

  1. Jan-Philip Gehrcke repo owner

    I believe that pool.join() will not block the event loop. The greenlets that handles the request is blocked there but the control is returned to the event loop. Is that correct?

    Depends on what the pool actually is. If it is a greenlet pool, then yes.

    Basically I want to fork a process inside a greenlet and put the blocking code there. The process will be scheduled by the OS and will not block the greenlet from returning control to the event loop. Is that correct?

    Correct, this is the main motivation of gipc. Your application is a very good example of a situation where it absolutely makes sense to outsource the heavy computation part to another process and let two processes talk to each other gevent-cooperatively.

    I understand that gipc support text as data serialization format only. Do you think that there is a better way to do it?

    "Text only" is wrong. You can basically pass any Python object through the pipe. The gipc documentation says (put method):

    Pickle object o and write it to the pipe. o: a pickleable Python object.

    Internally, the object becomes pickled (encoded), is sent through the pipe, and unpickled (decoded) on the other side again. This happens transparently. Just send a Python object. Of course, you should make sure that it is "small", i.e. only contains the minimum amount of information required for the worker to perform the job. As a general rule, the communication time (sending job information, retrieving result information) should be very small compared to the computation time.

    In your case, this rule seems to be very much fulfilled, since you only send a small string containing a file path and retrieve a tiny status code.

    Hope that helps.

  2. Ivan Smirnov

    @Jan-Philip Gehrcke Thanks for your comments to this question, that definitely clears things up.

    I've recently faced a similar kind of problem, however slightly different:

    Same as above, there's a gevent-based WSGI server (e.g. gevent-socketio) that manages requests and posts updates back, all done via sockets.

    The difference is that now there is some complex State object per each client connected which runs CPU-intensive computations when it receives new inputs and posts updates back via callbacks (observers). This state would obviously be too unwieldy to pickle and pass through pipes.

    What I can't figure out is what's a good way to perform the computations without blocking the greenlet that handles the request? In this case we can't just fork a new process from within the request greenlet because it's not stateless anymore (unless, of course, there's something like Redis hanging in the background and providing persistent storage for everyone, but that's another story). Should each computing process be forked upon connection, wrapped in an endless loop with blocking get()'s, and the handle/pipe be saved for the entire session?

    Another question is -- is it possible to use some other protocol rather that Python built-in pickling/unpickling?

    Example: a 1000x1000 pandas DataFrame can be serialized to msgpack in 9ms while pickling takes 500+ms (pickling the string containing the serialized dataframe takes just as long). That's quite a difference.

    This could be a really handy extension and as I see it, it's quite easy to integrate: the most user-friendly would be to add an optional encoder/decoder options to _GIPCWriter and _GIPCReader which they would use instead of pickle.dumps/pickle.loads -- however, that would add a possibility of an exception being thrown in user code. Alternatively, enabled verbatim mode where you send arbitrary binary data and receive this data unharmed on the other side; the end-user would then then responsible for encoding/decoding the data.

  3. Log in to comment