hang with redirection
When using redirection that leads to an ioerror, sarge can hang waiting for spawned processes that will never complete due to waiting on a full pipe that no one is reading.
A simple test case:
import sarge
pipeline = sarge.run('head -c 65K /dev/zero | cat > /does/not/exist')
The first command hangs on a write to the pipe, which is never closed and sarge hangs on a wait() for this process that will never finish.
Comments (10)
-
repo owner -
Account Deleted Right, but even with async I still seem to leak a pipe file descriptor. I can also open the file up-front and pass this to stdout (at least in this example).
It seems like there should be some way to cleanup before sarge blocks on wait(). For instance, this initial approach seems dirty, but seems to resolve the immediate issue (and passes the test suite in a linux environment):
https://bitbucket.org/abg/sarge/commits/3a639f410e1cfd99aa9ad399985629537cb6c67c
-
Account Deleted On further inspection, this is not simply an artifact of failed redirection. Even if redirecting is successful a hang can be observed with this test case:
import logging import sarge logging.basicConfig(level=logging.DEBUG) with open('issue12.input', 'wb') as fileobj: fileobj.write(b'\0'*256*1024) cmd = 'cat issue12.input | head -c 128K | cat > output' sarge.run(cmd)
This hangs regardless of whether opening the output succeeds or fails. The underlying cause in both cases is the same - dangling pipe descriptors held open by the parent process.
In the successful redirect case, the hang here is caused by
cat
blocking onwrite(stdout, ...)
-head
closes its copy and exits, but the parent still has an open descriptor so the firstcat
never sees EPIPE and blocks waiting on someone to read from the other end of the pipe.In the failed redirect case, the hang is on
head
waiting onwrite(stdout, ...)
, and again the parent has a stale pipe descriptor that no one reads from.In light of this, my previous fix is obviously insufficient. The closing of descriptors in the parent needs to happen before Command.wait(). This is very similar to the cleanup logic in Pipeline.close(), but I think this needs to happen before waiting on child processes.
-
Account Deleted Backed out the original fix, and rather than attempting cleanup in Command.wait(), I cleanup the pipe descriptors during the setup in Pipeline.run_logical_node(). This avoids the ugly closing logic of stdout/stderr and also removes the issue of a leaked descriptor when run(..., async=True) aborted on an IOError.
https://bitbucket.org/abg/sarge/commits/b69ae0ed37fe98e76a7097fa3b937296d7a31f38
-
repo owner This change seems more reasonable. However, with the
cat issue12.input | head -c 128K
command above, I get a
cat: write error: Broken pipe
error, though there's no hang. I believe it's the firstcat
which produces the message. -
Account Deleted cat obviously hits a sigpipe - head only reads the first 128KB and closes it's stdin, so this error is expected and stracing the same commandline in bash shows the same behavior.
The spurious output is from http://bugs.python.org/issue1652, I think. Restoring the sigpipe with
preexec_fn
avoids this message. Python3.2+'s subprocess also fixes this. I've had to integrate the preexec_fn workaround in some projects in the past.I fixed the sloppy exception handling in my fork so this works on py3 here:
https://bitbucket.org/abg/sarge/commits/4ae0f1a466f2043778f1ea0958113171bcaa603a
Perhaps it make sense to integrate restore_signals support from python3.2 in the sarge Popen subclass for python2.
-
repo owner The spurious output is from http://bugs.python.org/issue1652, I think.
Good catch - I missed this, seeing only that there was no output when run in bash.
Perhaps it make sense to integrate restore_signals
I'll look at this.
-
repo owner - changed status to resolved
Closes
#12: improved handling of pipes and signals. Thanks to Andrew Garner for the patch and suggestions.→ <<cset 2c251c7c03f0>>
-
repo owner - changed status to open
I just noticed that this change doesn't work on PyPy -
signal
must be called from the main thread. Strictly it shouldn't work on the other implementations either (though the tests pass), as thesignal
module docs do warn to callsignal
only from the main thread. -
repo owner - changed status to resolved
Closes
#12: Confined signal handling to main thread.→ <<cset 621cf40a6457>>
- Log in to comment
A
wait()
on a subprocess can't be timed out, unfortunately. If you passasync=True
to therun()
call, you get anIOError
raised from therun()
.