Issue #598 resolved

socket problems on Mac OS X Tiger

guest
created an issue

There is a big problem with CP in conjunction with Python 2.4.4 on Tiger!

Did you ever try to respond an image (1 Mpx) directly by cherrypy?

I installed Python 2.4.4 Universal Binary Installer, then CherryPy 2.2.1 on Mac OS X Tiger 10.4.8 (Mac Mini Dual Core 2).

My CherryPy App should deliver images (jub -- dynamicly generates ones), but I always receive only the first part of the image in my browser. Cherrypy throws an exception within the write-method, which is delegated to socket-write which ends up in a sendall in socket.py:

error 35: Resource temporarily unavailable.

Seams that this is a problem caused by some changes in the socket api between Python 2.4.3 and 2.4.4. If this is true and I didn't make a mistake, CherryPy can not be used upon P 2.4.4 on Tiger (if you try to respond something bigger than a small HTML page).

I am thinking on Kevin Dangoors nice Turbogears Screencast, running on Max OS X... How should people redo this cool stuff he shows...?

Comments (37)

  1. guest reporter

    thx for help, but it doesn't work. I get still the following:

    10.1.50.222 - - [06/Nov/2006:08:18:58] "GET /png/www.bechtle.de HTTP/1.1" 200 353821 "" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1) Gecko/20061010 Firefox/2.0" Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/cherrypy/_cpwsgiserver.py", line 205, in run request.write(line) File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/cherrypy/_cpwsgiserver.py", line 149, in write self.wfile.write(d) File "/Library/Frameworks/Python.framework/Versions/2.4lib/python2.4/socket.py", line 248, in write self.flush() File "/Library/Frameworks/Python.framework/Versions/2.4lib/python2.4/socket.py", line 235, in flush self._sock.sendall(buffer) error: (35, 'Resource temporarily unavailable')

    I inserted -->

    import errno socket_errors_to_ignore = [] --> socket_errors_to_ignore.append("EAGAIN")

    1. Not all of these names will be defined for every platform. for _ in ("EPIPE", "ETIMEDOUT", "ECONNREFUSED", "ECONNRESET", "EHOSTDOWN", "EHOSTUNREACH", "WSAECONNABORTED", "WSAECONNREFUSED", "WSAECONNRESET", "WSAENETRESET", "WSAETIMEDOUT"): if _ in dir(errno): socket_errors_to_ignore.append(getattr(errno, _))

    at the beginning of _cpwsgiserver.py

    when I call my app from localhost, I get the whole image. when I try to fetch the image with firefox from another machine within the local network, I still get only the first part of the image.

    by the way: we do not have this problem with an old PowerPC-based mac mini, running Python 2.4.2 and CP 2.1.

    we urgently need some help because we expanded a productive system with 3 new mac minis in order the deliver the dynamicly rendered images (using cocoa binding with pyobc 1.4)

  2. Robert Brewer

    Hm, yeah. I wasn't complete enough in my reply; you usually want to '''retry''' the write operation, not discard it, when you get EAGAIN. The attached patch does that. It's quick and dirty, just for testing; if it "fixes" the issue, it probably needs a sleep call and a timeout, and should be applied to all write calls in that module.

  3. guest reporter

    the patch does not work, but I successfully patched it this way:

    def write(self, d): if not self.sent_headers: self.sent_headers = True self.send_headers()

    1. self.wfile.write(d)
    2. self.wfile.flush() chunk_size = 1000*1024 totally_sent = 0 i = 0 while len(d) > 0: c = d[totally_sent:totally_sent+chunk_size] if len(c) < 1: break; try: currently_sent = self.socket.send(d[totally_sent:totally_sent+chunk_size]) except socket.error, e: errno = e.args[0] print "Error", errno, str(e) currently_sent = 0 totally_sent += currently_sent print "%i.: %i, %i" % (i, currently_sent, totally_sent) i += 1

    the debug printings show:

    0.: 65511, 65511 1.: 60049, 125560 2.: 64240, 189800 3.: 11680, 201480 Error 35 (35, 'Resource temporarily unavailable') 4.: 0, 201480 5.: 65700, 267180 6.: 2920, 270100 Error 35 (35, 'Resource temporarily unavailable') 7.: 0, 270100 8.: 59860, 329960 9.: 4380, 334340 10.: 30660, 365000 11.: 30660, 395660 12.: 64240, 459900

    ...

    and so on.

    as you can see, socket.send() throws socket errno 35. seems that socket lib throws wrong error(s) on mac os tiger (errno 35 where obviously socket is currently not able to receive more bytes which is a common response). may be that this is caused by changes in socket module from Python 2.4.2 to 2.4.3 and higher... where socket handling has changed to non-blocking.

    we will take a deeper look to this problem and keep you informed.

    thx for help.

  4. guest reporter

    Well, I am suffering from this. I am trying to transmit the Dojo main file and it is greater than 66608 bytes and it is breaking.

    I'm wondering if the send method is overflowing a buffer by not sending in chunks smaller than system buffer size.

    -- metaperl

  5. Anonymous

    I believe this problem is fixed in CherryPy 3.x. I'm stilling using CherryPy 2.2 and the simple solution appears to be to use a blocking receiving socket. Inside def tick(), comment out these two lines:

        #if hasattr(s, 'settimeout'):
        #    s.settimeout(self.timeout)
    

    And add these two:

        if hasattr(s, 'setblocking'):
            s.setblocking(1)
    

    This appears to be what CherryPy 3.x is doing inside _cphttpserver get_request(). Apparently, using makefile and then calling write() on the file ends up doing a sendall() on the socket. From what I read elsewhere, calling sendall() is not supported for a non-blocking socket. Rather, it is not guaranteed to send all of the data.

    Hope that helps...

    Matt Bendiksen

  6. Robert Brewer

    Matt Haley,

    First, thanks for the info! This has been a very difficult problem to reproduce and debug.

    But according to the Python 2.4.2 docs at least, "s.setblocking(1) is equivalent to s.settimeout(None)". And a value of None implies "block forever". Note that all Python socket objects are blocking by default, and even if they weren't, a call to settimeout calls an internal_setblocking function inside socketmodule.c. So the sendall concerns probably don't apply.

    My guess is changing cherrypy.server.timeout to a value sufficiently larger than the default 10 (seconds) would have the same effect as setblocking(1). If that's not true, then there's probably some platform-specific weirdness going on inside internal_setblocking.

    I don't quite understand your last paragraph, since CP 3 has no _cphttpserver module, and its wsgiserver module has no get_request function...?

  7. Anonymous

    Just ran into this issue using !CherryPy 3.0.2, trying to serve the unpacked version of jQuery 1.1.4 from staticdir. So I think the milestone is wrong here, the problem still exists in CP3.

  8. Christian Wyglendowski

    From http://docs.python.org/lib/socket-objects.html:

    "Some notes on socket blocking and timeouts: A socket object can be in one of three modes: blocking, non-blocking, or timeout. Sockets are always created in blocking mode. In blocking mode, operations block until complete. In non-blocking mode, operations fail (with an error that is unfortunately system-dependent) if they cannot be completed immediately. In timeout mode, operations fail if they cannot be completed within the timeout specified for the socket. The setblocking() method is simply a shorthand for certain settimeout() calls.

    '''Timeout mode internally sets the socket in non-blocking mode.''' The blocking and timeout modes are shared between file descriptors and socket objects that refer to the same network endpoint. A consequence of this is that '''file objects returned by the makefile() method must only be used when the socket is in blocking mode'''; in timeout or non-blocking mode file operations that cannot be completed immediately will fail." (emphasis added)

    Not sure if this helps at all, but it '''does''' sound like we are implicitly using a non-blocking socket in combination with makefile().

  9. Anonymous

    I ran into this issue yesterday while trying to serve a static file of about 120KiB from CP 3.0.2.

    '''My system'''

    • Mac OS X 10.4.10 (Tiger)
    • Mac Python 2.4.4
    • CherryPy 3.0.2

    '''My observed behaviours, known facts and tests:'''

    • CP 3.0.2 logs no errors in error.log and logs 200 OK with the correct number of bytes xfered in access.log: '''this is wrong and the most critical issue -- if its not ok it must not state otherwise'''
    • Error is the same with Mac Python 2.5
    • Error does not show up with Mac OS X bundled Python 2.3
    • Error does not show up against local connections (either loopback or other physical local interface)
    • Setting socket_timeout to None makes things work (which seems to agree with notes above from dowski and fumanchu) - however this leaves us with no way to get out of hanging connections
    • After browsing the code, and with some guidance from dowski on IRC, I changed wsgiserver/init.py at line #367 and wrapped the self.write(chunk) in a try block to catch socket.errors. I managed to get the 35 (EAGAIN and EWOULDBLOCK) mentioned by someone, above -- something I never saw before on the logs as mentioned.
    • Then, in a way similar to the patch above, forced a retry on EAGAIN / EWOULDBLOCK; the result is improved but not perfect:
    • Clients get the file and the connection is not dropped halfway like before (tested with wget)
    • Server seems to hang for a few (1-3 ?) seconds between most requests -- that is: if I script the shell for a wget loop, I observe hangs between most xfers. The client, wget, states "Connection reset by peer" on starting most connections (all but the first in the loop ? not sure)

    Where to go next ? Questions, reasoning, ideas.

    • Not easy. According to python docs mentioned by dowsky, we are effectively working with a non-blocking socket
    • To which, again according to python docs, one should not use makefile() -- which CP uses to readlines() from the socket, if I understand correctly
    • And no code at all seems to be non-blocking socket ready
    • So, again, if I understand correctly, the python docs say one thing that seems to contradict CP's implementation but CP's implementation works everywhere else and only fails on Mac !
    • Something is fishy here, I agree with the above comment by fumanchu
    • Do we have a buggy 2.4.4 python socket implementation on Mac OS ?!
    • I only seem to see three ways of moving forward:
    • If I understand the docs correctly, then wsgiserver socket operations must be changed... forgetting makefile() and readline() and using recv()s and extra buffering/management code; forgetting sendall() and using send() checking both EAGAINs / EWOULDBLOCKS but also the returned number of bytes, etc, etc
    • Somehow fix the "hangs" that the quick fix I tested resulted in... (no idea how!)
    • Write down a minimal pure python socket server test program, with the same socket module calls CP does to take to Mac Python project owners and get their feedback

    Within my limitations (time to spare / tech abilities), I'd like to contribute to fixing this. While not being critical for my work (I develop on Mac but deploy on Linux) it is somewhat annoying and made me really angry after several hours of battling with html / javascript / webservers / browsers / caches / JVMs, etc !!!

    Feel free to get back to me at ex (dot) vitorino (at) gmail (dot) com

  10. Anonymous

    '''More tests and results - Python 2.4.4 vs 2.4.3:'''

    • Installed python 2.4.4 and 2.4.3 from source on Mac OS X 10.4.10
    • Behaviour is good in 2.4.3 and bad in 2.4.4 - consistent with above findings
    • Build python 2.4.4 from source with 2.4.3 socketmodule.[ch] - feeling adventurous...
    • Behaviour is good in 2.4.4 with 2.4.3 socketmodule
    • diffed socketmodule.c between 2.4.4 and 2.4.3
    • diff is 161 lines long - reasonable to evaluate
    • no changes in 2.4.4 internal_setblocking vs 2.4.3 - suspicion by fumanchu not a fact
    • new ifdefs APPLE regarding inet_aton - does not seem related
    • NEW: usage of poll() instead of select() if available -- this might be something !

    '''More tests and results - Pure sockets, no cherrypy:'''

    Created a sample (stupid) http server script based on what I believed were the socket module calls CP does. The objective was to pinpoint the behaviour in the smallest amount of code possible so as to be able to understand and decide what exactly needs to be fixed: CP, Python or Mac OS.

    Unfortunately my creation does not behave as I expected:

    • It works ok if the socket timeout is None in all tested Mac python versions (Apple bundled 2.3, Mac Python's 2.4.4 and 2.5 and the ones I built from source: 2.4.3, 2.4.4 and 2.4.4 with 2.4.3 socket module)
    • And fails with the socket EAGAIN / EWOULDBLOCK (35, 'Resource temporarily unavailable') if the socket timeout is set to anything other than None
    • It works ok on Linux for both cases under Ubuntu 7.04's python 2.4.4 and 2.5.1: socket timeout to None or not

    Here is what I coded:

    import socket
    
    
    # server parameters
    HOST = ''
    PORT = 8088
    SOCKTIMEOUT = 10
    
    # data load to serve: MANY times DATA
    DATA = ''.join(map(chr, xrange(256)))
    MANY = 1024
    
    # setup server socket
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    s.settimeout(SOCKTIMEOUT)
    s.bind((HOST,PORT))
    s.listen(1)
    
    # wait for connection
    conn, addr = s.accept()
    
    # read HTTP request headers, discard body
    # (good enough for wget)
    f = conn.makefile()
    while True:
            req = f.readline().strip()
            if req=='':
                    break
    
    # send minimal HTTP response headers
    conn.sendall('HTTP/1.0 200 OK\r\n')
    conn.sendall('Server: tigerSockets\r\n')
    conn.sendall('Content-length: %i\r\n' % (len(DATA)*MANY))
    conn.sendall('\r\n')
    
    # send HTTP response content
    for count in xrange(MANY):
            conn.sendall(DATA)
    
    # done, close connection socket
    conn.close()
    
    # close serving socket
    s.close()
    

    In a way, this code's behaviour is more consistent... It always fails on Mac if timeout is not None (socket is in non-blocking mode, according to docs). Questions still to be answered:

    • Why does CP's behaviour varies with python versions ?
    • Why does Linux always work ?
    • Obviously my code is not equivalent to basic CP socket module usage: can anyone help out ?

    Please add comments or reach me at ex (dot) vitorino (at) gmail (dot) com

  11. Anonymous

    '''Extra info regarding pure socket, non CP test code''':

    • Unlike CP, this code also fails on the loopback and local interfaces (however it doesn't fail as often/easily with the 256KiB payload)
    • Its behaviour seems very consistent: fails for socket timeouts other than None on local and remote connections under all python versions tested on Mac
  12. Anonymous

    '''Extra info regarding python 2.4.4:'''

    • Built python 2.4.4 from scratch forcing the usage of select() instead of poll():
    • ./configure --prefix=xxx/python-2.4.4-nopoll
    • set to 0 all the defines in pyconfig.h regarding POLL / POLL_H etc, used by socketmodule.c
    • result is the same for both CP test program and raw socket test program: fails if socket timeout is set to anything other than None

    Note to self: should trace the system calls just to be sure that poll() vs select() are being used under the different test scenarios... *sigh!*

  13. Anonymous

    '''quick fix - different approach - seems ok'''

    Feeling a bit baffled by the python runtime tests and the behaviour of my "no-CP-sockets-only" test script, I decided move forward based on the following assumptions:

    • The python docs are correct - we are dealing with non-blocking sockets: we should handle EAGAINs and EWOULDBLOCKS, we cannot use makefile() and thus we cannot use readline()
    • Somehow the network stacks in different OSs bahave differently thats why the non-blocking nature of the sockets result in different behaviours under different platforms
    • Forget, for now, that the changes python 2.4.4 socket module vs. 2.4.3 have impact in CPs behaviour

    That said, I hacked wsgiserver/init.py in a similar logic to what initially fumanchu did and that I tested a few days ago with half success. This time, however, I decided not to use socket.sendall() but socket.send() instead, which gives more control at the cost of a little more work. First results look promising.

    What I did:

    • Created a new method in HTTPRequest named _good_sendall
    • Changed the self.sendall definition of HTTPRequest in the constructor to point to _good_sendall

    The code for _good_sendall:

        def _good_sendall(self, buf):
            pending = len(buf)
            offset = 0
            while pending:
                try:
                    sent = self.connection.socket.send(buf[offset:])
                    pending -= sent
                    offset += sent
                except socket.error, e:
                    if e[0]!=errno.EAGAIN:
                        raise
    

    That's it. First tests give good results and I no longer experience the 1-3 sec delay between serving consecutive requests.

    '''Next:'''

    Again, according to python docs, I should eliminate the readline() usage... A quick glance at the code shows rfile and readline() at several places. This will need a little more thinking, even as quick hack, as rfile will need to be discarded (need to see if has more implications that readline() by itself).

    Anyway, the ideia will be to create a new method named _good_readline() that instead will use socket.recv() loop and a buffer, returning the line when a '\r\n' is found and keeping the remaining of the buffer until the next call.

  14. Christian Wyglendowski

    Great analysis and progress on this so far! The rfile object is fairly important, so we are going to at least need some sort of file-like interface to the socket. It is passed to cgi.FieldStorage for reading POST request bodies. If there is someway to created a non-blocking-friendly version of makefile(), that would probably be ideal.

  15. Anonymous

    Thanks for the feedback. I actually felt there was something in that rfile saying "don't dismiss me!". Now I'm certain. I'll need to take a look at the issue with some more time. Probable next steps:

    • Read wsgiserver code and improve my overall understanding of it
    • Understand 100% clearly specific usage of rfile:
    • Within wsgiserver and its interfaces/API
    • And according to the WSGI spec - need to check the PEP333
    • Possible approach might be, as suggested by dowski, create a file like object as an interface non-blocking sockets IO; then adapt the current usage of socket.makefile() to something else like _nbsocket_makefile that instantiates one of such objects.

    Week is really busy. I'll probably only get back to this on the weekend.

  16. Anonymous

    quick note:

    • the socket module already implements a file-like object - that's what socket.makefile() returns
    • it's just that the readline() in that "file" is not timeout resistant...
    • ...so it raises socket.timeout exceptions (although this is not clearly documented in python -- is this expected, stable behaviour between python versions ? is it by design or as a side effect ?)
    • socket module file object seems the same code between python 2.3.x and 2.4.x
    • maybe the solution is as simple as ensuring CP keeps trying its readlines() on socket.timeouts() ? maybe it already does this ? (I don't think so)
    • BTW, need to check to which sockets the settimeout is being applied; I'd assume both to the server socket listening to connections and to the connection sockets... maybe this could help understand the different behaviours on different platfroms...

    I'll be back!

  17. Anonymous

    extra note:

    • it seems we will really need to implement a non-blocking aware file-like object for makefile() replacement...
  18. guest reporter

    interesting old thread (raises good questions):

    • some of them shared by me while developing the non-blocking friendly file like object to replace current socket.makefile()

    extra notes/questions:

    • regarding the issue raised in the thread I linked in the previous note (readline buffering in timeout conditions) how should one go about readline() behaviour in non-blocking socket conditions ?
    • behave like blocking version ?
    • or raise timeouts but keep buffering incoming data ?

    todos:

    • need to check what WSGI spec states regarding non-blocking sockets and related behaviours
    • need to check the effective usage of timeouts under CPs wsgiserver... what are they used for ? I'd say the objective is to drop idling connections... Am I correct ?
  19. Anonymous

    reminder / restatement of the original problem:

    • used sockets have a timeout set on them
    • this makes them non-blocking sockets
    • socket.sendall() and eventually other socket calls timeout earlier that the defined timeout when run on Mac OS X
  20. Anonymous

    ...just got a new mac running 10.5.1, with python 2.5.1 bundled.

    It seems to have the same behaviour as in 10.4.10... I just did the pure python, pure socket test dated 09/15/07, 13:03:09, above, and it fails with timeouts. Mac Python 2.4.4 also fails.

    I need a way to get time to fix this once and forever... It's almost becoming a personal issue ! :-)

  21. Robert Brewer

    Okay, I made source:branches/598-sendall/cherrypy to work on the "nonblocking sendall" approach to fixing this (similar to exvito's good_sendall code above). Lakin confirmed that it fixed the original issue (sendall raising EAGAIN) on OS X.

    However, it broke a couple of tests on all platforms, namely, test_conn and test_xmlrpc. These would both hang in the middle of the test. The fix for ''those'' was to explicitly call close() on the kernel socket, since Python's socket module does not do that until the socket is GC'ed.

    However, that breaks a couple of different tests when run with the --ssl option. Still need to track those down and fix them.

  22. Anonymous

    I was able to reproduce the issue using python2.4 and the scripts in the attached zips (598_test.zip). I'm on OSX 10.5, and with the built-in 2.5.1, CP does not seem to have an issue, although the pure socket test _does_ have the issue. As fumanchu mentioned his branch fixes the issue.

    I am going to attach a few scripts. There are two python files, one is a test server which serves up a large response (1MB) on /hello You can test this from across the network using wget. Wget or curl on the local machine doesn't reproduce it. However, the included py_wget.py script does reproduce the problem locally. Interestingly enough the values of 512 and 6.0 for the recv and time.sleep calls are important. Fiddling with them will affect the ability of the script to reproduce the problem. This confirmed that producing it was a timing issue and that it was probably the EAGAIN issue. Because of the extreme specificness of reproducing the error, I'm not bothering with a Unit test, as we would only want to run it on OSX, with python2.4 :)

  23. Anonymous

    Even better news is that the test passes using python2.4 on osx if I use the 598-sendall branch.

  24. Robert Brewer

    OK, Ben Bangert had similar OSX issues with httplib2 and was kind enough to enlighten us about the fix on IRC (with some help from Philip Jenvey).

    Basically, poll() on OSX is broken. Apple's build of Python works around this issue by "undefining" HAVE_POLL; see the first line of http://www.opensource.apple.com/darwinsource/10.5.6/python-30.1.2/fix/pyconfig.ed. That causes Python's `socketmodule.c` to use select() instead of poll() in the `internal_select` function (which is called by `sock_sendall`).

    So. If you're using a build of Python other than Apple's, you probably need to apply the same remedy: make sure HAVE_POLL is not defined.

  25. Log in to comment