Issue #147 on hold

PGconn#async_exec loops forever if socket dies.

Kurt Stephens avatarKurt Stephens created an issue

We are losing DB connectivity in PGconn#async_exec during the PQisBusy() loop in pgconn_block().

This condition causes an infinite loop, consuming all CPU resources.

Break-down of the problem:

  • PGconn#async_exec uses pgconn_block() to relinquish control to other Ruby threads while it's waiting for results from the PG connection FD.

  • pgconn_block() loops while PQisBusy() is true, asking rb_thread_select() to allow other Ruby threads to work until there is readable data on the PG connection FD:

    (gdb) do
    #4  0xb6d0ab56 in pgconn_block (argc=0, argv=0x0, self=3023657760) at
    pg.c:2633
    2633            ret = rb_thread_select( sd+1, &sd_rset, NULL, NULL, ptimeout );
    (gdb) l
    2628        PQconsumeInput( conn );
    2629   
    2630        while ( PQisBusy(conn) ) {
    2631            FD_ZERO( &sd_rset );
    2632            FD_SET( sd, &sd_rset );
    2633            ret = rb_thread_select( sd+1, &sd_rset, NULL, NULL,
ptimeout );
    2634   
    2635            /* Return false if there was a timeout argument and the
select() timed out */
    2636            if ( ret == 0 && argc )
    2637                return Qfalse;

    (gdb) p sd_rset
    $2 = {__fds_bits = {16, 0 <repeats 31 times>}}
  • "16" corresponds to the 5th bit, meaning fd 4, which is no longer open:
    kurt@ca10:~$ sudo lsof -p 11892
    ...
    ruby    11892 user    0u   CHR       5,1            536 /dev/console
    ruby    11892 user    1w  FIFO       0,6           5861 pipe
    ruby    11892 user    2u   REG       8,1  158788 144691 __HIDDEN__
    ruby    11892 user    3r   REG       8,1      11   3734 __HIDDEN__
    ruby    11892 user    5u   REG       8,1  158788 144691 __HIDDEN__
    ruby    11892 user    6u  IPv4 327473100            TCP HIDDEN:HIDDEN (CLOSE_WAIT)
  • rb_thread_schedule() is aware of the dead file descriptor (4) because it calls fcntl() on it to check its validity before calling select():
    $ strace -p 11892
    fcntl64(4, F_GETFD)                     = -1 EBADF (Bad file descriptor)
    ...

    (gdb) up
    #1  0xb7d9b17d in fcntl () from /lib/i686/nosegneg/libc.so.6
    (gdb) up
    #2  0x08065d59 in rb_thread_schedule () at eval.c:11480
    11480                        if ((FD_ISSET(fd, &readfds) ||
    (gdb) l
    11475                if (e == EBADF) {
    11476                    int badfd = -1;
    11477                    int fd;
    11478                    int dummy;
    11479                    for (fd = 0; fd <= max; fd++) {
    11480                        if ((FD_ISSET(fd, &readfds) ||
    11481                             FD_ISSET(fd, &writefds) ||
    11482                             FD_ISSET(fd, &exceptfds)) &&
    11483    #ifndef _WIN32
    11484                            fcntl(fd, F_GETFD, &dummy) == -1 &&
  • fcntl() returns -1 and rb_thread_schedule() returns immediately to pgconn_block().

  • PQisBusy() continues to return true; it's unaware that its file descriptor has been closed, thus it hammers the ruby thread scheduler.

  • PGconn#async_exec never returns; it spins forever while polling fcntl().

=====================

This is occurring on CRuby 1.8.7. I have not yet verified on CRuby 1.9+.

See the attached patch against gem pg 0.14.1.

A possible fix is to use fcntl(fd, F_GETFD, &dummy) in pgconn_block() and raise an error if it == -1, but this is not portable to non-*nix.

I did not find a portable libpq function to do it.

Thanks, Kurt Stephens

Comments (7)

  1. larskanis

    Recently I compared PG::Connection#async_exec against PQexec() internal implementation in libpq and noticed that both work somewhat different. It makes sense to align PG::Connection#async_exec to libpq, first. I'm a little bit busy currently, so it might take some weeks...

    How did you trigger the closing of the socket? Does PG::Connection#exec work better with closed socket?

  2. larskanis

    I'm currently not able to reproduce your issue. This is the test case I tried:

        it "should not stumble across closed connections" do
            pid = @conn.async_exec( 'SELECT pg_backend_pid() AS pid' ){|res| res.to_a.first['pid'] }
    
            Thread.new do
                sleep 2
                Process.kill(9, pid.to_i)
            end
            expect do
                begin
                    @conn.async_exec('SELECT pg_sleep(4)')
                rescue PG::Error => err
                    err.result.should be_nil
                    raise
                end
            end.to raise_error(PG::Error)
        end
    

    This test finishes after 2 seconds successfully. The exception raised on the died connection has no result object and no text. Subsequent queries result in exception "PG::Error: socket not open".

    Do I miss-understand your issue? How can your described behavior become reproduced?

    Test-Environment: Ubuntu-12.10, MRI-Ruby-1.8.7-p358 and MRI-Ruby-1.9.3-p125, PostgreSQL-8.3.9 and PostgreSQL-9.2.1, ruby-pg: default-branch

  3. Kurt Stephens

    Sorry for such a delayed response. Our problem is due to pooling and/or network connectivity -- we haven't found the root cause. I have an rspec that does what Frank describes above.

    The patch I supplied has one flaw: the call to PQfinish() prevents the use of PGconn#reset by ActiveRecord, because PQfinish() discards all credentials. We are running my patch in our production env without the call to pgconn_finish().

    I wish libpq had a "PQclose()", that closes the socket and resets the internal PQConnection state, without clearing credentials; it would also improve behavior in forked Ruby processes.

  4. larskanis

    I did some more testing but was still not able to reproduce the behavior you described. What I did is:

    a) Process.kill(9, pid.to_i)

    b) @conn.close

    c) IO.new(@conn.socket).close

    before the pg_sleep-query and while the pg_sleep-query in a second thread.

    a) makes the socket to close remotely. This let rb_thread-select() return with readable fd. Then PQconsumeInput() is called which fails with server closed the connection unexpectedly in any case (1.8.7/1.9.3, before/while pg_sleep).

    b) and c) while #async_exec are IMHO no supported use cases - since the connection is altered from two threads concurrently. A local close may rather indicate a bug in the pooling library.

    b) and c) while #async_exec makes rb_thread_select() block forever on Ruby 1.9.3. This is regular behavior of a local socket close while select(). On Ruby 1.8.7 b) makes rb_thread_select() return with error and so Errno::EBADF: Bad file descriptor - rb_thread_select() is raised. c) lets rb_thread_select() directly raise IOError: stream closed.

    It is essential for me to get a proper test case for your issue.

    Nevertheless I don't believe that the socket is closed remotely. It's more that I guess that the socket is somehow closed locally by the pooling code or whatever. What do you use for pooling?

  5. Michael Granger

    I can't reproduce this either. Putting it on hold pending someone being able to make a minimal test case. Since #block doesn't use rb_thread_select() under 0.15 anyway, this may be moot.

  6. Log in to comment
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.