a gevent killed greenlet causes pymysql (and mysqlconnector?) to continue on the wrong/a broken connection

Issue #3258 resolved
lxyu created an issue

the same error reported in https://bitbucket.org/zzzeek/sqlalchemy/issue/3043/error-file-descriptor-was-closed-in back again.

Currently the most stable version is 0.9.3, which only produce small amount of AttributeError, and versions after produces different exceptions, especially more excs since 0.9.5

I tested python-mysql-connector + gevent + sqlalchemy too and the problem is the same, so I assume the bug is in the sqlalchemy side.

The test was done with your benchmark script, both test script and produced log pasted here:

https://gist.github.com/lxyu/7cbd6cbe83b47e581c8d

The bug is very serious since it may cause strange bug in mysql transaction, we have captured situation where only part of the transaction succeed in our prod servers.

Comments (24)

  1. Mike Bayer repo owner

    Let's consider that all the exception logs illustrated in your gist are produced by this log line: https://bitbucket.org/zzzeek/sqlalchemy/commits/85d1899b76a3#Llib/sqlalchemy/pool.pyT563, which note, is the log line added in 0.9.4 as part of the fix. So that is the very simple reason you don't see these exceptions in 0.9.3; they simply aren't reported.

    We can illustrate further. Check out 0.9.3, and apply this patch:

    diff --git a/lib/sqlalchemy/pool.py b/lib/sqlalchemy/pool.py
    index af9b8fc..e9a03d3 100644
    --- a/lib/sqlalchemy/pool.py
    +++ b/lib/sqlalchemy/pool.py
    @@ -539,6 +539,8 @@ def _finalize_fairy(connection, connection_record, pool, ref, echo, fairy=None):
                 if not connection_record:
                     pool._close_connection(connection)
             except Exception as e:
    +            pool.logger.error(
    +                "Exception during reset or similar", exc_info=True)
                 if connection_record:
                     connection_record.invalidate(e=e)
                 if isinstance(e, (SystemExit, KeyboardInterrupt)):
    

    and then we see the same exact exception, though it's much harder to get to it because the pool itself fails much more seriously with "too many connections" more often than not (due to #2985 not being fixed in 0.9.3), ending the program, but here it is:

    ERROR:sqlalchemy.pool.QueuePool:Exception during reset or similar
    Traceback (most recent call last):
      File "/Users/classic/dev/sqlalchemy/lib/sqlalchemy/pool.py", line 536, in _finalize_fairy
        fairy._reset(pool, echo)
      File "/Users/classic/dev/sqlalchemy/lib/sqlalchemy/pool.py", line 666, in _reset
        self._reset_agent.rollback()
      File "/Users/classic/dev/sqlalchemy/lib/sqlalchemy/engine/base.py", line 1226, in rollback
        self._do_rollback()
      File "/Users/classic/dev/sqlalchemy/lib/sqlalchemy/engine/base.py", line 1264, in _do_rollback
        self.connection._rollback_impl()
      File "/Users/classic/dev/sqlalchemy/lib/sqlalchemy/engine/base.py", line 507, in _rollback_impl
        if self.connection._reset_agent is self.__transaction:
      File "/Users/classic/dev/sqlalchemy/lib/sqlalchemy/engine/base.py", line 232, in connection
        return self._revalidate_connection()
      File "/Users/classic/dev/sqlalchemy/lib/sqlalchemy/engine/base.py", line 238, in _revalidate_connection
        "Can't reconnect until invalid "
    InvalidRequestError: Can't reconnect until invalid transaction is rolled back
    

    This is nothing more than the database connection being killed right in the middle of the pool doing its cleanup work. The condition is detected, it's logged, and the connection is correctly invalidated afterwards; the program is allowed to continue without issue.

    we have captured situation where only part of the transaction succeed in our prod servers.

    If only "part" of a transaction is succeeding, that strongly suggests you are not using InnoDB tables, and are instead using MyISAM. When using MyISAM, there are no actual transactions. Only when using InnoDB do you get correct ACID guarantees. When using a relational database with ACID guarantees, assuming it works correctly, it is impossible for only "part" of a transaction to succeed; the client drivers and application code have no ability whatsoever to change this behavior. This is one of the key rationales to the ACID system; the protection against such a failure is built into the database.

    Please ensure your database is using all InnoDB, because there's absolutely no action to take elsewhere, short of filing/searching for bugs against MySQL server, if you are observing correctly invoked database transactions being partially recorded to permanent storage.

    So far I'm not seeing an issue on the client and/or SQLAlchemy side yet. However these are indeed issues to take very seriously so please try to provide more detail, thanks.

  2. lxyu reporter

    I'm sure we're using InnoDB and surely the problem isn't on the database side.

    There're strange InterfaceError("(0, '')") and OperationalError: (OperationalError) (2014, 'Command Out of Sync') and RuntimeError: reentrant call inside ..

    I think the thing seems to be more complicated than "doing its cleanup work".

    About the part of transaction succeeding, a typical case is:

    1. insert a line into table_a, call session.flush()
    2. insert a line into table_b with previous table_a's id, session.commit()

    With this function we detected some cases where there's only a record in table_a but not table_b, which should be impossible but did happened. I only saw the errors when I upgraded sqlalchmey to 0.9.8 recently. I downgrade back to 0.9.3 now and the errors disappeared.

    I currently don't have a clear mind on how it happened, I can only guess maybe it have something to do with the connection disorder or reuse? since there're many reentrant errors.

    Here's a guess:

    greenlet_a with connection_a running a function, but greenlet_a somehow be killed because of timeout, right in the middle of 1 and 2, but the mysql server don't know it, and this connection somewhat be used by another greenlet_b, which running succeed and issued a commit on that connection. boom.

    One different thing between your benchmark script and this guess is, the connection not being killed/closed in server or client, but the greenlet holding the connection is killed in the client side.

    It seems this only happen under heavy load. I'll try to get a reproduce script later.

  3. Mike Bayer repo owner

    The test case that we refer to in #3043 and the one that you ran in https://gist.github.com/lxyu/7cbd6cbe83b47e581c8d simulates the failure condition as emitting the MySQL KILL upon the connection itself. This terminates the connection, so it is not possible for the client to emit any further statements, and it's clear that it is not possible for any transaction prior to the KILL to be committed subsequent to that event, unless the database server itself is turning this KILL into a COMMIT, which would be a severe server error. The client has no ability to communicate with the server subsequent to this KILL.

    Additionally, all stack traces at https://gist.github.com/lxyu/7cbd6cbe83b47e581c8d (in the first block, illustrated against 0.9.8) do not illustrate the errors at #3043; none of these exceptions are reentrant within the invalidation process, which was the issue repaired in #3043, and additionally none of them "corrupt" the connection pool such that the pool is now loaded with invalid connections, as was also the issue in #3043. All of them illustrate the pool trying to call connection.rollback() as part of the normal process of a connection being returned to the pool, in the test script this is from conn.close(); this ROLLBACK is then being chopped off in the middle as the KILL instruction elsewhere is cancelling it. As this occurs within the close() process, the pool doesn't propagate the error, unlike all other kinds of errors, it logs it, and invalidates the connection appropriately, which is also why the test script only illustrates these errors and no others; all the rest are not logged unless you uncomment the traceback.print_exc() line, in which case you'll see tons more stack traces.

    This connection upon which the ROLLBACK is failing is no longer in communication with the server, hence you see only two well known networking errors, "InterfaceError("(0, '')")" and "OperationalError: (OperationalError) (2014, 'Command Out of Sync')". The error "RuntimeError: reentrant call inside .." ~is not reported under 0.9.8, that's in your test against 0.9.4~ [is a PyMySQL bug, and is probably the source of your issue involving pymysql, see my next comment], where there are much more serious errors which were repaired as part of #3043 for 0.9.5. In 0.9.4 the connection that was dead was attempted to be used again, so the MySQL drivers, which have very poor bookkeeping in regard to connection state, act that much more wildly.

    edited: the issue is probably this pymysql bug.

  4. Mike Bayer repo owner

    OK, there's something very interesting over on the PyMySQL list, which seems to be your issue.

    There's two instances of "reentrant call inside <_io.BufferedReader name=48>" in your test case under 0.9.5 as well. However, this is still just the rollback() process unable to proceed along a TCP connection that is dead, and then as stated earlier, it seems clear PyMySQL isn't gracefully handling it.

    For that specific issue, here's the PyMySQL bug partially fixed on their end. "This exc leads to a serious consequence: a transaction is incompletely committed, only part of modified is committed", which they are claiming was fixed in 0.6.2. So perhaps PyMySQL is doing something very wrong at the moment the connection is being dropped (it's obviously not a simple KILL command, that's just our test here), such that transactional state is committed. Has to be, since if the connection is truly killed off, there's no more transaction to commit (I'd note that that if this is in fact possible, it is definitely also a bug in MySQL server itself).

    I recommend you upgrade your PyMySQL or better yet just use MySQL-connector-Python which is now MySQL's official driver.

  5. lxyu reporter

    Hi, finally get a stable reproduce script: https://gist.github.com/lxyu/4e14021e4bae7bc26da7 , hope it helps.

    All software using the latest stable version installed by pip.

    In this script, we do this process:

    1. generate a row in table_a
    2. session.flush()
    3. generate a row in table_b with table_a's id
    4. session.commit() / session.rollback()

    So, all row's in table_b should have one match row in table_a. In the test script, I reproduce the case where table_b.a_id not exists in table_a.


    ok, let's look at the result, and there're 3 abnormal things:

    • a. there're many table_b.a_id == 0, which means even after session.flush(), the script still don't have a_id.

    • b. there're many rows which table_b.a_id not exists in table_a, this is obviously the exact bug "transaction partially succeed".

    • c. there're many rows in table_a which table_a.id not exists in table_b.a_id.

    I have tested both pymysql and mysqlconnector with sqlalchemy (0.9.3 / 0.9.8), more tests show:

    1. pymysql + sqlalchemy shows lots of situation a & b & c, mysqlconnector + sqlalchemy shows lots of situation a & c, so is situation a expected to happen or it's a bug? I don't get why a_id is not available even after flush().

    2. @maralla have provide a solution in the comment, with ping_connection on every pool checkout, the situation a and b totally gone and situation c exists but much less.

    Is there anything sqlalchemy can do or I have to go posting the script to pymysql side?

  6. Mike Bayer repo owner

    i notice there's no InnoDB on these tables. Also why is there no FOREIGN KEY on table_b.a_id? that should be raising right there, and would give you a stack trace exactly where in Python the invalid commit is occurring, if it is in fact in python.

  7. Mike Bayer repo owner

    what happens if you dont use ScopedSession ? perhaps whatever you're doing that kills gevent here is breaking its threading.local() patch. ScopedSession w/ gevent + the default threading.local() seems like a very bad idea to me. turn that off. problem solved? then it's gevent.

  8. lxyu reporter
    1. the database was configured to use InnoDB as default, I double checked and the table created were InnoDB.

    2. all package installed with the latest stable version.

      $ pip freeze PyMySQL==0.6.2 SQLAlchemy==0.9.8 mysql-connector-python==2.0.2 gevent==1.0.1

    3. mysql connector also has situation a & c as described above, but don't have situation b, which is the worst case.

    4. turn off scoped session doesn't help. and I assume gevent shouldn't break the threading.local, it should be what monkey_patch trying to do: patch threading.

    5. the foreign key in mysql side was deliberately turned off to reduce performance impact on mysql since db is not good at scale when under heavy load. In our system we don't use join and only deal with foreign key in application side.

  9. lxyu reporter

    Quote from the connection ping doc:

    This recipe will ensure that a new Connection will succeed even if connections in the pool have gone stale, provided that the database server is actually running.

    When this code added, the connection errors greatly reduced, so I'm wondering if the connection can be detected as stale even without this event hook?

  10. Mike Bayer repo owner

    the bug is in the driver and/or gevent. Here is the equivalent program, using only pymysql: https://gist.github.com/zzzeek/947337f44d70fac537be.

    By turning on the foreign key, it is easy to see where it happens. The commit() is proceeding on some connection where the A row does not exist:

    #!
    
    ERROR:foo:error
    Traceback (most recent call last):
      File "test2.py", line 106, in transaction_kill_worker
        connection.commit()
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pymysql/connections.py", line 674, in commit
        self._read_ok_packet()
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pymysql/connections.py", line 658, in _read_ok_packet
        pkt = self._read_packet()
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pymysql/connections.py", line 826, in _read_packet
        packet.check_error()
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pymysql/connections.py", line 370, in check_error
        raise_mysql_exception(self._data)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pymysql/err.py", line 116, in raise_mysql_exception
        _check_mysql_exception(errinfo)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pymysql/err.py", line 109, in _check_mysql_exception
        raise errorclass(errno,errorvalue)
    IntegrityError: (1452, u'Cannot add or update a child row: a foreign key constraint fails (`test`.`table_b`, CONSTRAINT `table_b_ibfk_1` FOREIGN KEY (`a_id`) REFERENCES `table_a` (`id`))')
    

    We even see this happening less frequently when we're calling rollback(), I don't exactly understand what it means to see this but here that is:

    #!
    
    ERROR:foo:Exception during rollback
    Traceback (most recent call last):
      File "test2.py", line 40, in return_conn
        conn.rollback()
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pymysql/connections.py", line 679, in rollback
        self._read_ok_packet()
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pymysql/connections.py", line 658, in _read_ok_packet
        pkt = self._read_packet()
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pymysql/connections.py", line 826, in _read_packet
        packet.check_error()
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pymysql/connections.py", line 370, in check_error
        raise_mysql_exception(self._data)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pymysql/err.py", line 116, in raise_mysql_exception
        _check_mysql_exception(errinfo)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pymysql/err.py", line 109, in _check_mysql_exception
        raise errorclass(errno,errorvalue)
    IntegrityError: (1452, u'Cannot add or update a child row: a foreign key constraint fails (`test`.`table_b`, CONSTRAINT `table_b_ibfk_1` FOREIGN KEY (`a_id`) REFERENCES `table_a` (`id`))')
    

    the reason this is happening is because gevent is yanking away the ability of pymysql to correctly continue it's communication with the socket. pymysql is then getting confused in some catastrophic way.

    I've tried a bunch of ways to anticipate when this might happen and to react sooner before anything further is called upon the connection. As the commit() here seems to do something even though the connection is in some invalid state, adding an event hook to check connection_id() and then quickly invalidate seems to alleviate the issue:

    @event.listens_for(engine, "commit")
    def receive_commit(conn):
        try:
            conn.scalar("select connection_id()")
        except:
            conn.invalidate()
            raise Exception("no connection id!  dead connection!")
    

    but from everything I can see, the issue is exactly what I said all the way at the top is so unlikely, and definitely not a SQLAlchemy issue, unless you have suggestions for what https://gist.github.com/zzzeek/947337f44d70fac537be is doing incorrectly. The driver is committing data on the wrong connection.

  11. Mike Bayer repo owner

    I think the issue is well-understood now. pymysql's (and apparently mysql-connector's) "rollback" just doesn't work very well in this state.

    a solution that's equivalent to the one we established at the end, for your script, is:

    def transaction_kill_worker():
        while True:
            try:
                with gevent.Timeout(0.1):
                    session = Session()
                    connection = session.connection()
                    s_a = TableA(data="a")
                    session.add(s_a)
                    session.flush()
                    s_b = TableB(a_id=s_a.id, data="b")
                    session.add(s_b)
                    gevent.sleep(random.random() * 0.2)
    
                    try:
                        session.commit()
                    except Exception:
                        # logger.exception(e)
                        session.rollback()
                        sys.stdout.write("$")
            except gevent.Timeout:
                connection.invalidate()
                sys.stdout.write("#")
            except Exception:
                # logger.exception(e)
                sys.stdout.write("@")
            else:
                sys.stdout.write(".")
            finally:
                try:
                    Session().close()
                except:
                    pass
    

    that is, we just call connection.invalidate() when the timeout occurs, without attempting to emit a rollback(). The Session unfortunately doesn't make it easy to get at this connection once an error has occurred (it's in there, but not with a public API), so for now I'm just pulling it out ahead of time in case we need to hit it.

    the approach with my event handler above also works.

    it's pretty clear that the MySQL drivers here aren't cleaning up properly in the face of the gevent timeout.

  12. Mike Bayer repo owner
    • Added new method :meth:.Session.invalidate, functions similarly to :meth:.Session.close, except also calls :meth:.Connection.invalidate on all connections, guaranteeing that they will not be returned to the connection pool. This is useful in situations e.g. dealing with gevent timeouts when it is not safe to use the connection further, even for rollbacks. references #3258

    → <<cset cf7981f60d48>>

  13. Mike Bayer repo owner
    • Added new method :meth:.Session.invalidate, functions similarly to :meth:.Session.close, except also calls :meth:.Connection.invalidate on all connections, guaranteeing that they will not be returned to the connection pool. This is useful in situations e.g. dealing with gevent timeouts when it is not safe to use the connection further, even for rollbacks. references #3258

    (cherry picked from commit cf7981f60d485f17465f44c6ff651ae283ade377)

    → <<cset c7de21b92cc0>>

  14. Mike Bayer repo owner

    per that above changeset, will make it easier to invalidate the session's connections totally:

    def transaction_kill_worker():
        while True:
            try:
                session = None
                with gevent.Timeout(0.1):
                    session = Session()
                    s_a = TableA(data="a")
                    session.add(s_a)
                    session.flush()
                    s_b = TableB(a_id=s_a.id, data="b")
                    session.add(s_b)
                    gevent.sleep(random.random() * 0.2)
    
                    try:
                        session.commit()
                    except Exception:
                        session.rollback()
            except gevent.Timeout:
                session.invalidate()
            finally:
                session.close()
    
  15. lxyu reporter

    So this will invalidate all connections in the connection pool? Is it possible to only invalidate the current connection used by the greenlet (for example public the related API)?

    When gevent used, there're many greenlet within the same process, and so if all connections invalidated for every timeout, it may have some impact on the performance.

  16. Mike Bayer repo owner

    So this will invalidate all connections in the connection pool?

    no. only the connection associated with the session. (ORM Session, note the ORM Session can be associated with multiple engines/connections at one time. though not typically).

    Is it possible to only invalidate the current connection used by the greenlet (for example public the related API)?

    the connection used by the session that you're calling upon, that is what this does.

    When gevent used, there're many greenlet within the same process, and so if all connections invalidated for every timeout, it may have some impact on the performance.

    in fact I'm not even thrilled that a gevent timeout causes even one connection to be bad - it means an attacker could vastly slow down your site by sending a flood of interrupting HTTP requests. this is a serious issue in the gevent/MySQL system.

  17. Log in to comment