1. Jan-Philip Gehrcke
  2. gipc
  3. Pull requests

Pull requests

#2 Declined
Repository
aldanor
Branch
default
Repository
jgehrcke
Branch
default

A pass-through (byte) mode

Author
  1. Ivan Smirnov
Reviewers
Description

More a proof of concept than the real code since it's too trivial. Might of course want to add tests and think about string encodings if this (or something like this) is to be merged. API-wise, a boolean flag could be replaced by a mode='bytes' or maybe a completely separate function bytes_pipe()?

I see it as a very useful feature as the end-user would be no longer tied to Python pickling as the one and only mean of serialization -- what if you just need to pass strings (but millions of them), or numbers, or raw buffers?

P.S. I've noticed the gipc author had first started with JSON serialization himself :)

Comments (7)

  1. Jan-Philip Gehrcke repo owner

    Thanks for this suggestion. The main motivation behind such a change would be an increase in throughput performance for certain messaging scenarios, right? I think I once had implemented this (I think I called the argument raw) and decided against it because pickle is so damn fast that it did not make a difference in performance. And, if it does not make any difference there, then it's not worth introducing the switch/overhead.

    So my question for you would be: can you come up with an example showing that such a change actually significantly changes performance in some way?

  2. Ivan Smirnov author

    Sure. I think I left a comment in one of the issues that was closed, not sure you've read through it.

    Example: pickling a 1000x1000 pandas DataFrame takes circa 500ms, while converting to msgpack is about 9ms. Besides, you can just dump a raw numpy memory buffer which is well... kind of instant. Even pickling a long enough (say a million characters) string takes unnecessarily long time, actually quite long.

    That's one side of the story (pickling large / buffer-type objects). Another one would be pickling millions of small ones, then the flat overhead of pickle would still come out and affect the performance. You might want to just pass string messages (i.e. byte arrays), why pickle them at all? All this is even more ironical given that gevent is aimed at lightweight threads, fast switching and eliminating all kind of overheads in general :)

    And finally, for most serious applications I guess you can safely assume the developers know exactly what they want serialize and how they want it serialized; obviously, there are also well-established ways of serializing different particular kinds of objects, be it numpy memmaps, json, raw buffers or whatever else does the job.

    That being said, having an easy option of just passing a Python object so it "just works" in a safe and consistent way is definitely a must have and should probably remain a default.

  3. Jan-Philip Gehrcke repo owner

    In case we use multiple processes and implement IPC, I think what we agree on is that communication overhead must always be small compared to actual computation time, otherwise the entire approach is dubious. I see that this requirement can still be fulfilled considering huge data structures such as large numpy arrays and -- most importantly -- I entirely agree that the developer himself should be able to decide which encoding scheme is best for his type of application. You created enough motivation, I'll work on this.

  4. Jan-Philip Gehrcke repo owner

    I want to have this in the next release and started implementing an API where you can explicitly chose functions for encoding and decoding, for instance r,w = pipe(encoder=foo, decoder=bla). Of course these can be no-op, so for convenience I added raw_pipe() in the lines of what you proposed. See https://bitbucket.org/jgehrcke/gipc/commits/b87254801a0b5f839a995f450a46ee0157989f76?at=default

    If pure performance was the goal, then we should save function calls and have distinct handlers for performing raw messaging. However, in the current form of gipc, function calls will never be the bottleneck for messaging performance. Actually, the write loop in the _write method at https://bitbucket.org/jgehrcke/gipc/src/b87254801a0b5f839a995f450a46ee0157989f76/gipc/gipc.py?at=default#cl-668 currently is a severe bottleneck for sending large messages. I am not sure how to significantly improve this if not via outsourcing this to native compiled code.

  5. Jan-Philip Gehrcke repo owner

    Regarding the bottleneck in _write: this function actually scaled badly with increasing msg size, indicating that something was not well implemented. I found the problem to be frequent copying/modification of the data to be transmitted in memory. I fixed this by using Python's buffer interface -- now the message transmission performance is good also for very large messages (I tried messages of about 1 GB in size). Relevant commit: https://bitbucket.org/jgehrcke/gipc/commits/5aae1fb8cbddd08276fc39b0827a2e88f4db6747?at=default