PQsetClientEncoding() called without releasing the GVL during initialization

Issue #245 resolved
danielnc
created an issue

I think I've spotted a bug with 0.18.4

We are scaling out our database and using several sidekiq workers.

We are currently using postgresql 9.4 with pgbouncer to allow more light-weight connections

Rails 3.2 + Latest version of Sidekiq + pg 0.18.4

Whenever we exhaust pgbouncer real connections Sidekiq just hangs and wont even respond to TTIN commands. I have to force kill sidekiq process with -9

​This happens on our production environment whenever we reach our 250 real connections and hit our database really heavy​

An easy way to reproduce locally is to create 1 sidekiq process with 25 threads and 1 real(server connection) pgbouncer connection(50000 client connections allowed)

​Here is my local gdb output​(osx) our production environments(centos) behave the same way

https://gist.github.com/danielnc/81e27b8b4012a41bc795e01a5b36db99

​Here is my pgbouncer.ini config to make it easy to reproduce

https://gist.github.com/danielnc/3ce76c8a8957e8a52f65c9224839e670​

I've created a minimal rails 4.2 app with sidekiq and I'm able to reproduce the same behaviour so this is doesn't seems to be a rails specific version issue

Thread 10 (Thread 0x2103 of process 40990):
0  0x00007fff94b892a2 in poll () from /usr/lib/system/libsystem_kernel.dylib
1  0x00000001035c892d in pqSocketCheck () from /usr/local/opt/postgresql/lib/libpq.5.dylib
2  0x00000001035c87f9 in pqWaitTimed () from /usr/local/opt/postgresql/lib/libpq.5.dylib
3  0x00000001035c5fd8 in PQgetResult () from /usr/local/opt/postgresql/lib/libpq.5.dylib
4  0x00000001035c62be in PQexecFinish () from /usr/local/opt/postgresql/lib/libpq.5.dylib
5  0x00000001035c31bc in PQsetClientEncoding () from /usr/local/opt/postgresql/lib/libpq.5.dylib
6  0x000000010359d2a7 in pgconn_set_default_encoding () from /Users/user/.rbenv/versions/2.2.4/gemsets/caremessage-v1.1/extensions/x86_64-darwin-15/2.2.0-static/pg-0.18.4/pg_ext.bundle
7  0x00000001035990b7 in pgconn_init () from /Users/user/.rbenv/versions/2.2.4/gemsets/caremessage-v1.1/extensions/x86_64-darwin-15/2.2.0-static/pg-0.18.4/pg_ext.bundle
8  0x0000000102daa929 in vm_call0_body ()
9  0x0000000102da9d82 in rb_call0 ()
10 0x0000000102cbc969 in rb_class_new_instance ()

pg is not releasing the GVL before making a network call.

PQsetClientEncoding() is currently called without releasing the GVL while the PG::Connection initialisation. It should either be GVL aware or use the asynchronous libpq interface.

Thanks

Comments (16)

  1. Lars Kanis

    You're right, PQsetClientEncoding() is currently called without releasing the GVL while the PG::Connection initialisation. It should either be GVL aware or use the asynchronous libpq interface.

    I'll fix this.

  2. Lars Kanis

    @danielnc You can use the latest gem prerelease either by:

    gem inst pg --pre
    

    or with bundler:

    gem "pg", "=0.19.0.pre20160817083826"
    

    It's also possible now, to fetch and build the gem from a git repository:

    gem "pg", git: "https://github.com/ged/ruby-pg"
    
  3. Michael Granger repo owner

    @danielnc it looks like it's now giving up the GVL in both threads that are trying to connect, so when you say "Our process is still stuck" do you mean that it still blocks other Ruby threads? If you need the connect to be asynchronous from Ruby, you can use PG::Connection.connect_start + conn.connect_poll instead of PG::Connection.new, but new blocks until it can connect (or until it times out) by design.

    You can also set your own connection timeout using the connect_timeout option so connections fail faster as well.

  4. danielnc reporter

    @Michael Granger when I say our process is still stuck it still block other Ruby Threads.

    We don't need it to be async because we are using PgBouncer as a middleman

    Clients are able to connect but are stuck in the middle of running queries so I'm not 100% if this is another issue with pg

  5. Lars Kanis

    @danielnc I've changed the part for setting the default_encoding to use the asynchronous libpq interface: a646a1394e

    Therefore if a thread opens a new connection and blocks at "SET client_encoding", it will still respond to signals (Process.kill) and exception events (Thread#raise). Previously the blocking threads did not respond to such events (but all other threads did respond).

    You should also make sure, that when working with signals, you're using PG::Connection#async_exec instead of #exec for all queries. As @Chris Bandy pointed out, there is a single call to #exec in your log, which could be the blocking piece in your application. As an alternative you can use the PG::Connection#send/block/get_last_result sequence.

    I've pushed another pre-release on rubygems.org with the latest change. Could you please try it out?

  6. danielnc reporter

    @Lars Kanis this did not fix the issue where our connections just lingers on pgbouncer(although threads are now responsive to signals)

    I've opened an issue(https://github.com/pgbouncer/pgbouncer/issues/143) to try to understand why I'm getting FATAL and WARNING messages. I'm assuming that this might be the cause of clients being stuck while waiting to receive results from pgbouncer

    I think we can close this issue since at least now threads are responsive

    I'll have to try to investigate further since we cannot find a reason why our clients cannot work well with pgbouncer

    Thanks

  7. Log in to comment