Recurring consistency issues with higher loads

Issue #274 on hold
Jyrki Pulliainen created an issue

Under high-ish write load the Devpi Server seems to have some sort of a threading bug which causes it to try writing the same serial number into the changelog entries.

We tried limiting the impact of this by first introducing locks around the changelog writer, however the process could still end up in bad state. This then leads to intermittent failures to the clients.

Error we are seeing is following:

2015-09-24 02:24:55,331 ERROR Exception when serving /spotify/trusty-production
Traceback (most recent call last):
  File "/usr/share/python/devpi-server/local/lib/python2.7/site-packages/waitress/channel.py", line 336, in service
    task.service()
  File "/usr/share/python/devpi-server/local/lib/python2.7/site-packages/waitress/task.py", line 169, in service
    self.execute()
  File "/usr/share/python/devpi-server/local/lib/python2.7/site-packages/waitress/task.py", line 388, in execute
    app_iter = self.channel.server.application(env, start_response)
  File "/usr/share/python/devpi-server/local/lib/python2.7/site-packages/devpi_server/views.py", line 118, in __call__
    return self.app(environ, start_response)
  File "/usr/share/python/devpi-server/local/lib/python2.7/site-packages/pyramid/router.py", line 242, in __call__
    response = self.invoke_subrequest(request, use_tweens=True)
  File "/usr/share/python/devpi-server/local/lib/python2.7/site-packages/pyramid/router.py", line 217, in invoke_subrequest
    response = handle_request(request)
  File "/usr/share/python/devpi-server/local/lib/python2.7/site-packages/devpi_server/views.py", line 142, in request_log_handler
    response = handler(request)
  File "/usr/share/python/devpi-server/local/lib/python2.7/site-packages/devpi_server/views.py", line 177, in request_tx_handler
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/share/python/devpi-server/local/lib/python2.7/site-packages/devpi_server/keyfs.py", line 513, in transaction
    self.commit_transaction_in_thread()
  File "/usr/share/python/devpi-server/local/lib/python2.7/site-packages/devpi_server/keyfs.py", line 501, in commit_transaction_in_thread
    self._threadlocal.tx.commit()
  File "/usr/share/python/devpi-server/local/lib/python2.7/site-packages/devpi_server/keyfs.py", line 724, in commit
    commit_serial = fswriter.fs.next_serial
  File "/usr/share/python/devpi-server/local/lib/python2.7/site-packages/devpi_server/keyfs.py", line 201, in __exit__
    changed_keys, files_commit, files_del = self.commit_to_filesystem()
  File "/usr/share/python/devpi-server/local/lib/python2.7/site-packages/devpi_server/keyfs.py", line 231, in commit_to_filesystem
    write_changelog_entry(self.sqlconn, self.fs.next_serial, entry)
  File "/usr/share/python/devpi-server/local/lib/python2.7/site-packages/devpi_server/keyfs.py", line 161, in write_changelog_entry
    (serial, sqlite3.Binary(data)))
IntegrityError: UNIQUE constraint failed: changelog.serial
,Exception when serving /spotify/trusty-production

Only way out of this is to restart the service, which will keep it running for maybe 12 hours under our loads.

Comments (21)

  1. Jyrki Pulliainen reporter

    A very rudimentary patch attached to work around the issue. This leaves the controlling of the serial number to the database, making sure we properly update the serial number after entering a new row into the sqlite database.

  2. Holger Krekel repo owner

    It seems we are doing something wrong with respect to restarting transactions which happen when we, upon serving simple pages, discover we need to perform writes. If two requests happen in this mode, the second one will use a serial that is already used by the first which should result in the exception you are seeing. Need to think a bit how to best fix the issue. Your patch might be a possible work around but a bit by chance if it is.

  3. Holger Krekel repo owner

    I am not totally sure about the root problem, though. from your log files, does it look like the problem is related to serving simple pages?

  4. Jyrki Pulliainen reporter

    It also shows up on when serving actual files (where I guess the transactions happen). I don't know why it pops up when serving simple pages.

    As I have not gone through how the logging in devpi works, it might be that one of the previous changes we did to mitigate the transaction issues might have caused us to lose some log messages, or have a wrong order on them. That fix I have documented #267. I have not looked into what 'thread_log_pop' ends up doing as it now gets called in every case (even in failures)

  5. Holger Krekel repo owner

    The function i am concerned with is "keyfs.restart_as_write_transaction" which is used from various places, also from p the views.py pkgserv() function. The way it's implemented it's using its current read-serial for a write serial -- but other concurrent write requests might already have used that serial.
    (Actually i am still figuring things out -- maybe my analysis is not correct yet)

  6. Holger Krekel repo owner

    FWIW, i don't think your patch in #267 looses logging. push/pop_log are only used for the prefixing bit of logging.

  7. Jyrki Pulliainen reporter

    Alright, that's good to know.

    I'll look into cooking this up into a proper pull request and hope to find a test to expose the bug :)

  8. Holger Krekel repo owner

    are you on IRC on freenode by chance? If you run in --debug mode do you see "restarting afresh as write transaction" messages?

  9. Holger Krekel repo owner

    I am not actually sure anymore i know how the situation can arise. It seems that restart_as_write_transaction works contrary to my earlier impression.

    Also, looking more closely i can't match your traceback to the last couple devpi-server versions IISIC. Are you having custom changes in your devpi-server deployment?

    FWIW Just wrote a little stress-testing test which gets simple pages and a file from them in 10 concurrent threads which all works and serializes ok.

    So I am leaving it as is until you provide more information.

  10. Jyrki Pulliainen reporter

    So for some more information:

    Currently we run 2.1 with some custom patches on top of it. Nothing major, mainly the one mentioned in #267. I've enabled debug logging to see if I can see some anomalies on that host. We could also upgrade to 2.3.1 but that would need some planned downtime, so it can't just be done as a quick test.

    Our load is also significantly more more than 10 concurrent clients.

  11. Holger Krekel repo owner

    I am afraid this issue is getting a little out of scope for me if doing it on unpaid time. Your are using an older version with patches and i can't reproduce the problem. If you at least can give me something where i reproduce i can look again.

  12. Holger Krekel repo owner

    Are you by chance using a network file system and have multiple devpi-server instances?

  13. Jyrki Pulliainen reporter

    Sorry for lack of responses, been on a bit of busy side here too.

    And no, we don't have multiple processes and no NFS or anything similar, just plain local disk.

  14. Jyrki Pulliainen reporter

    The underlying 'broken' code still exists in devpi. I just checked 4.0 (since we need to upgrade to fix issues with pip) and one can see the broken bits here: https://bitbucket.org/hpk42/devpi/src/5ab22f0228fed7fe24f9a40029a83bf6462ae5f8/server/devpi_server/keyfs_sqlite.py?at=devpi-4.0.0&fileviewer=file-view-default#keyfs_sqlite.py-53

    It is still not honoring the database given incremental counter, which might lead into funny races when trying to insert twice with same serial. Now, it can be that this is no longer an issue (have not tried, though don't see why it would not fail there), but I can't really understand why not trust the database to do its job instead trying to do it by hand.

  15. Florian Schulze

    We made some more improvements related to concurrency in 4.1.0.

    Using the db provided id can leave gaps (at least in postgresql) if a transaction is rolled back, so we would have to handle that case if we go down that route.

    Another option would be to retry the transaction on conflict with the next serial.

    Curious on why you get so many concurrent writes. Automated CI or just many users?

  16. Jyrki Pulliainen reporter

    It is actually a combination of quite many users, CI and automated things like puppet runs of a quite large fleet of machines (though since they only pull content so I don't think they affect the transactions).

    So far we have managed with single devpi instance btw at Spotify, which is pretty neat, and it has improved dev experience quite a lot!

  17. Florian Schulze

    Just installing stuff can cause transactions if new versions were found in mirrors like PyPI.

    Do you have more info on how often this happens and how badly it affects you?

    Also pip recently added caching for simple package index pages, do you see a difference since then?

    Do you have a webserver in front of devpi-server to handle the actual files statically?

  18. Log in to comment