Need help debugging a "result has been cleared" exception

Issue #166 resolved
Anonymous created an issue

In one of my projects at work we've set up a loadtesting environment for a rails application. Within this environment I'm seeing frequent (but apparently non-determinisic) PGError (result has been cleared) exceptions. I'm a bit stumped, since I haven't seen these ever before; not in another environment for this project, nor in any other postgres related ruby project ever...

The basic project parameters: ruby 2.0.0 p0 (will try upgrading to p195 and see if that changes anything), pg 0.15.1, rails 3.2.13, unicorn, postgres 9.2.4 running on a 64 bit CentOS 5.8.

Differences in our environments: Production uses physical machines, has pg compiled against libpq for postgres 9.2.1 (version mismatch on the patchlevel; .1 client vs .4 server), postgres is running in a default locale. Loadtest is running in VMWare based virutal machines, pg is compiled against libpq for postgres 9.2.4 (version match), postgres is running in a german locale.

To be honest I'm a bit stumped. Are there in your experience likely culprits, and how would I go about trying to find the underlying issue?

Any input and ideas are welcome!

Comments (13)

  1. Lars Kanis

    I don't have an idea right now. This error is raised if a PG::Result object is cleared first, but queried for some result information later on. I inspected the rails code of 3.2.13 for any such use case, but this situation should not happen. Do you use the PG::Connection or PG::Result objects directly somewhere in your application?

    Could you also post the stack trace of the exception?

    The version mismatch isn't an issue.

  2. _sr_

    Do I use the PG::Result or PG::Connection directly? Not that I know of. I'm using ActiveRecord to pipe all of my queries, unicorn.conf has the ActiveRecord::Base.connection.disconnect! and ActiveRecord::Base.connection.establish_connection in the before_fork and after_fork hooks, so a connection reuse should be implausible.

    The stack trace terminates in an! call; doesn't go deeper than that (nothing within AR::B itself).

    What I am doing that is unusual (however it works in the production environment without a hitch) is using unique indexes do inform me of uniqueness validation violations, see here:

    However, this mechanism is not being triggered at the moment within the loadtest environment; we're using UUID for the relevant http parameters to ensure uniqueness.

  3. Lars Kanis

    Sorry, I still don't have an idea, where the null pointer could come from if not by PG::Result#clear. I also don't remember of some issue with "result has been cleared" in the past.

    There are two cases to get a PG::Result object with a null pointer: result#clear was called or the query function (like conn#exec) didn't return a result. The latter case is checked in all cases in ruby-pg and a proper exception is raised, so the result object should never be delivered to Ruby code.

    It also does not seem to be a GC issue. A PG::Result object should be GC'ed and freed regardless whether cleared or not.

    Do you use DRb in the project? DRb retrieves objects from the ObjectSpace and sometimes takes a wrong object if the desired is already GC'ed.

    The stack trace terminates in an! call; doesn't go deeper than that (nothing within AR::B itself).

    Seems that the stack trace is an application only version. Any chance to get a trace that goes deeper than "save!" ?

  4. _sr_

    No DRb. Personally, I'm suspecting VMWare may be doing something nasty in the background.

    I don't have a backtrace cleaner active; usually my backtraces do go down into library code, so having the backtrace stop there is odd. However: airbrake to the rescue - it has a more complete backtrace than the production logs:

    [GEM_ROOT]/gems/activerecord-3.2.13/lib/active_record/connection_adapters/postgresql_adapter.rb:1147:in `error_field'

    [GEM_ROOT]/gems/activerecord-3.2.13/lib/active_record/connection_adapters/postgresql_adapter.rb:1147:in `translate_exception'

    [GEM_ROOT]/gems/activerecord-3.2.13/lib/active_record/connection_adapters/abstract_adapter.rb:284:in `rescue in log'

    [GEM_ROOT]/gems/activerecord-3.2.13/lib/active_record/connection_adapters/abstract_adapter.rb:280:in `log'

    [GEM_ROOT]/gems/activerecord-3.2.13/lib/active_record/connection_adapters/postgresql_adapter.rb:649:in `execute'

    [GEM_ROOT]/gems/activerecord-3.2.13/lib/active_record/connection_adapters/postgresql_adapter.rb:709:in `rollback_db_transaction'

    [GEM_ROOT]/gems/activerecord-3.2.13/lib/active_record/connection_adapters/abstract/database_statements.rb:199:in `rescue in transaction'

    [GEM_ROOT]/gems/activerecord-3.2.13/lib/active_record/connection_adapters/abstract/database_statements.rb:180:in `transaction'

    [GEM_ROOT]/gems/activerecord-3.2.13/lib/active_record/transactions.rb:208:in `transaction'

    [GEM_ROOT]/gems/activerecord-3.2.13/lib/active_record/transactions.rb:311:in `with_transaction_returning_status'

    [GEM_ROOT]/gems/activerecord-3.2.13/lib/active_record/transactions.rb:264:in `save!'

    lib/user_profile.rb:49:in `save!'

    lib/user_profile/registration.rb:17:in `save_and_mail!'

  5. Lars Kanis

    Now I've got it. Of course the result is delivered to the application as attachment to the exception. So if sending the query fails, libpq returns with a NULL result and exception.result.error_field will raise "has been cleared":

    c=PGconn.connect_start( '', 54320, "", "", "me", "xxxx", "somedb" )
      c.exec "select 1"
      rescue => e

    I think the best solution is to let exception.result return nil . This will break ActiveRecord v3.2.13, too, but since the code is changed in v4.0, it will translate the exception into a StatementInvalid with a more meaningful error text, there. I'll fix ruby-pg in that way.

    As a workaround for v3.2.13 you could insert

    rescue PG::Error

    into active_record/connection_adapters/postgresql_adapter.rb:1155. This should show you the real error message - probably something like "no connection to the server" or "terminating connection due to administrator command". This in turn could be caused by a connection close from a forked process or something else, that breaks your connection to the PostgreSQL server.

  6. _sr_

    A short feedback: The workaround for 3.2.13 worked; we're getting a "PG:Error socket not connected" message now. That will help us track down the underlying culprit.

    Thanks for your help!

  7. _sr_

    Ok, we managed to find the underlying culprit: SSL (server was configured with the snakeoil ssl certs and the clients were running without an explicit sslmode, thus defaulting to 'prefer'). Why only some connections failed, I don't really know, but the behavior seems to odd on the client side.

    Do you want to continue with that issue in this ticket, or should I open a new one with details?

  8. Lars Kanis

    Since ruby-pg does not do anything in particular with SSL, I guess that moves the issue to libpq hence to the postgres people, doesn't it?

    What did you do to solve the issue, now?

  9. _sr_

    If ruby-pg doesn't do anything wrt ssl other than pass the options along to libpq, then it's definitely a libpq issue. However the entire behavior is odd and I can't really categorize who would be the proper addressee. To elaborate: when the 'socket not connected' error pops up, the client sends a FIN/ACK to the server, who replies with two more tcp packets with PSH before sending it's own FIN packet. However the process on the client side is already gone (because of the exception I presume), leading to a connection reset by peer log message on the server. So the odd thing here is: client sends a FIN but complains about 'socket not connected', but only when using SSL.

    We "solved" the issue by disabling SSL for now (server side as well as adding 'sslmode: disable' to the relevant environments in config/database.yml). We're going to have to take a hard look at the entire situation before reenabling SSL.

  10. Log in to comment