integrate "pre-ping" into connection pool

Issue #3919 resolved
Mark Sapiro created an issue

This issue has been observed in Mailman 3 with

  • postgresql 9.3
  • psycopg2 2.6.2
  • SQLAlchemy 1.1.5.

There are several persistent runner processes. If the postgresql server is restarted without restarting the Mailman runners, each runner encounters an error similar to the following traceback upon its next database access which may be at a time significantly after the postgresql restart.

Feb 22 22:51:13 2017 (30637) LMTP message parsing
Traceback (most recent call last):
  File "/opt/mailman/mailman-bundler/venv-3.4/lib/python3.4/site-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context
    context)
  File "/opt/mailman/mailman-bundler/venv-3.4/lib/python3.4/site-packages/sqlalchemy/engine/default.py", line 470, in do_execute
    cursor.execute(statement, parameters)
psycopg2.OperationalError: terminating connection due to administrator command
SSL connection has been closed unexpectedly

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/mailman/mailman-bundler/venv-3.4/lib/python3.4/site-packages/mailman-3.1.0b4-py3.4.egg/mailman/runners/lmtp.py", line 133, in process_message
    listnames = set(getUtility(IListManager).names)
  File "/opt/mailman/mailman-bundler/venv-3.4/lib/python3.4/site-packages/mailman-3.1.0b4-py3.4.egg/mailman/model/listmanager.py", line 107, in names
    MailingList.list_name):
  File "/opt/mailman/mailman-bundler/venv-3.4/lib/python3.4/site-packages/sqlalchemy/orm/query.py", line 1202, in values
    return iter(q)
  File "/opt/mailman/mailman-bundler/venv-3.4/lib/python3.4/site-packages/sqlalchemy/orm/query.py", line 2790, in __iter__
    return self._execute_and_instances(context)
  File "/opt/mailman/mailman-bundler/venv-3.4/lib/python3.4/site-packages/sqlalchemy/orm/query.py", line 2813, in _execute_and_instances
    result = conn.execute(querycontext.statement, self._params)
  File "/opt/mailman/mailman-bundler/venv-3.4/lib/python3.4/site-packages/sqlalchemy/engine/base.py", line 945, in execute
    return meth(self, multiparams, params)
  File "/opt/mailman/mailman-bundler/venv-3.4/lib/python3.4/site-packages/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/opt/mailman/mailman-bundler/venv-3.4/lib/python3.4/site-packages/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/opt/mailman/mailman-bundler/venv-3.4/lib/python3.4/site-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context
    context)
  File "/opt/mailman/mailman-bundler/venv-3.4/lib/python3.4/site-packages/sqlalchemy/engine/base.py", line 1393, in _handle_dbapi_exception
    exc_info
  File "/opt/mailman/mailman-bundler/venv-3.4/lib/python3.4/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/opt/mailman/mailman-bundler/venv-3.4/lib/python3.4/site-packages/sqlalchemy/util/compat.py", line 186, in reraise
    raise value.with_traceback(tb)
  File "/opt/mailman/mailman-bundler/venv-3.4/lib/python3.4/site-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context
    context)
  File "/opt/mailman/mailman-bundler/venv-3.4/lib/python3.4/site-packages/sqlalchemy/engine/default.py", line 470, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) terminating connection due to administrator command
SSL connection has been closed unexpectedly
 [SQL: 'SELECT mailinglist.mail_host AS mailinglist_mail_host, mailinglist.list_name AS mailinglist_list_name \nFROM mailinglist']

This appears similar to https://bitbucket.org/zzzeek/sqlalchemy/issues/2570 but that is old and resolved.

Comments (17)

  1. Mike Bayer repo owner

    this is by design - SQLAlchemy core pools connections by default and does not "pre-ping" the connection to see if it's alive.

    See the documentation at http://docs.sqlalchemy.org/en/latest/core/pooling.html#dealing-with-disconnects which discusses both "optimistic" and "pessimistic" approaches to dealing with disconnected connections inside the connection pool. "pessimistic" is simpler to add to an existing application but adds a small performance hit. Another option is to disable pooling entirely (see http://docs.sqlalchemy.org/en/latest/core/pooling.html#switching-pool-implementations).

  2. Mike Bayer repo owner

    I should admit that there is a theoretical way that "ping" approach could be integrated automatically, treating the first statement invoked in the transaction as the "ping" and replaying it. Replaying a statement that's already been invoked is a "big deal" though, as there's all kinds of events and defaults associated with an arbitrary execution and it's not clear if people would want those events/defaults replayed as well. It's one of those cases where making this automatic is 90% possible but would potentially be a total mess in 10% of the cases.

  3. Mike Bayer repo owner

    in fact in some cases the defaults for an INSERT do emit their own SQL so that would move the "replay" up to include all the events and defaults. having a "ping" that's distinct from the actual "first statement" is a lot easier, but perhaps this could be integrated into dialects (e.g. "ping").

  4. Barry Warsaw

    Thanks for the documentation references. I'm not sure the optimistic approach will work for us. I'm worried about the performance impact of the pessimistic approach, or disabling pooling. We should at least document this as a known issue in Mailman 3.1, recommending that if the database server is restarted, Mailman should be restarted.

  5. Mike Bayer repo owner

    @warsaw I have a feeling the "ping" approach is not that big a deal, keep in mind this is a per-transaction ping, not per statement. Assuming mailman does a bunch of SQL statements within one connection pool checkout, that is (which is the idiomatic style these days). All of openstack is using the aforementioned "ping" recipe. I'm starting to consider integrating a low-Python-overhead "ping" option into create_engine() at this point because it's true, "optimistic" means the application has to weather an exception which means having to retry an operation.

  6. Mike Bayer repo owner

    Integrate "pre-ping" into connection pool.

    Added native "pessimistic disconnection" handling to the :class:.Pool object. The new parameter :paramref:.Pool.pre_ping, available from the engine as :paramref:.create_engine.pool_pre_ping, applies an efficient form of the "pre-ping" recipe featured in the pooling documentation, which upon each connection check out, emits a simple statement, typically "SELECT 1", to test the connection for liveness. If the existing connection is no longer able to respond to commands, the connection is transparently recycled, and all other connections made prior to the current timestamp are invalidated.

    Change-Id: I89700d0075e60abd2250e54b9bd14daf03c71c00 Fixes: #3919

    → <<cset f881dae8179b>>

  7. Mark Sapiro reporter

    I have tested

    → <<cset f881dae8179b>>

    In my environment in Mailman 3 with pool_pre_ping=True added to the create_engine() call and once I reread http://docs.sqlalchemy.org/en/latest/core/pooling.html#disconnect-handling-pessimistic including the note It is critical to note that the pre-ping approach does not accommodate for connections dropped in the middle of transactions or other SQL operations. and adjusted my expectations and testing accordingly, I can say it's working as advertised. I was only concerned with connections that were lost during potentially long periods between transactions in the first place, so this fix is good.

    Many thanks to @zzzeek for implementing this.

  8. Mike Bayer repo owner

    great thanks for testing. early versions of SQLAlchemy were oriented around a "checkout for every statement" model, which is very obsolete so it's taken me some years to realize a simple ping at the top is not a big deal.

  9. Mike Bayer repo owner

    that document is all wrong anyway, the server kills the connection. the non-threadsafe use doesn't really cause that message anymore.

  10. Log in to comment