Intermittent AssertionError in devpi server 2.2.2

Issue #267 on hold
Jason R. Coombs created an issue

We recently updated our devpi server from 2.2.0 to 2.2.2. After doing so, we started encountering errors like the following:

2015-08-12 14:57:14,486 ERROR Exception when serving /root/pypi/+f/9ab/da702fcb41e47/Sphinx-1.3.1-py2.py3-none-any.whl
Traceback (most recent call last):
  File "/app/.heroku/venv/lib/python3.4/site-packages/waitress/channel.py", line 337, in service
    task.service()
  File "/app/.heroku/venv/lib/python3.4/site-packages/waitress/task.py", line 173, in service
    self.execute()
  File "/app/.heroku/venv/lib/python3.4/site-packages/waitress/task.py", line 392, in execute
    app_iter = self.channel.server.application(env, start_response)
  File "/app/.heroku/venv/lib/python3.4/site-packages/devpi_server/views.py", line 121, in __call__
    return self.app(environ, start_response)
  File "/app/.heroku/venv/lib/python3.4/site-packages/pyramid/router.py", line 242, in __call__
    response = self.invoke_subrequest(request, use_tweens=True)
  File "/app/.heroku/venv/lib/python3.4/site-packages/pyramid/router.py", line 217, in invoke_subrequest
    response = handle_request(request)
  File "/app/.heroku/venv/lib/python3.4/site-packages/devpi_server/views.py", line 144, in request_log_handler
    response = handler(request)
  File "/app/.heroku/venv/lib/python3.4/site-packages/devpi_server/views.py", line 177, in request_tx_handler
    with keyfs.transaction(write=write) as tx:
  File "/usr/lib/python3.4/contextlib.py", line 59, in __enter__
    return next(self.gen)
  File "/app/.heroku/venv/lib/python3.4/site-packages/devpi_server/keyfs.py", line 523, in transaction
    tx = self.begin_transaction_in_thread(write=write, at_serial=at_serial)
  File "/app/.heroku/venv/lib/python3.4/site-packages/devpi_server/keyfs.py", line 488, in begin_transaction_in_thread
    assert not hasattr(self._threadlocal, "tx")
AssertionError

There seemed to be no rhyme or reason to the errors. We rolled back to 2.2.0 and the errors stopped. Recommendations?

Comments (22)

  1. Jyrki Pulliainen

    I'm seeing the exact same errors, though running 2.1.0, not 2.2.0. Restarting seems to help for some time, but the errors come back eventually.

    No clear pattern of which package is causing it.

  2. Jyrki Pulliainen

    This does not seem to correlate with any package, actually it does not correlate with any package URL. Even regular index fetching or accessing the front page of the devpi-server can cause a 500.

  3. Jyrki Pulliainen

    Is it normal for the requests to pile up like this for a single line? The logging format is somewhat confusing here.

    2015-09-07 07:54:41,152 INFO  [req6829] [req6879] [req9779] [req33431] [req96477] [req134427] [req158327] [req354003] [Wtx12811] [req354053] [req354101] [req354151] [req354
    201] [req354250] [req354300] [req354350] [req354399] [req354449] [req354499] [req354549] [req354599] [req354649] [req354699] [req354749] [req354799] [req354849] [req354899]
     [req354949] [req354999] [req355049] [req355099] [req355149] [req355199] [req355249] [req355299] [req355349] [req355399] [req355449] [req355499] [req355549] [req355599] [re
    q355649] [req355699] [req355749] [req355799] [req355849] [req355899] [req355949] [req355999] [req356049] [req356099] [req356149] [req356199] [req356249] [req356299] [req356
    349] [req356399] [req356449] [req356499] [req356549] [req356599] [req356649] [req356699] [req356749] [req356799] [req356849] [req356899] [req356949] [req356999] [req357049]
     [req357099] [req357149] [req357199] [req357249] [req357299] [req357349] [req357399] [req357449] [req357499] [req357549] [req357599] [req357649] [req357699] [req357749] [re
    q357799] [req357849] [req357899] [req357949] [req357999] [req358049] [req358099] [req358149] [req358199] [req358249] [req358299] [req358349] [req358399] [req358449] [req358
    499] [req358549] [req358599] [req358649] [req358699] [req358749] [req358799] [req358849] [req358899] [req358949] [req358999] [req359049] [req359099] [req359149] [req359199]
     [req359249] [req359299] [req359349] [req359399] [req359449] [req359499] [req359549] [req359599] [req359649] [req359699] [req359749] [req359799] [req359849] [req359899] [re
    q359949] [req359999] [req360049] [req360099] [req360149] [req360199] [req360249] [req360299] [req360349] [req360399] [req360449] [req360499] [req360549] [req360599] [req360
    649] [req360699] [req360749] [req360799] [req360849] [req360899] GET /spotify/production/+simple/
    2015-09-07 07:54:41,152 ERROR Exception when serving /spotify/production/+simple/
    
  4. Jyrki Pulliainen

    Looks like if commit or rollback fails, the transaction does not get cleared properly.

    Here's the patch we applied to 2.1.0:

    diff --git a/server/devpi_server/keyfs.py b/server/devpi_server/keyfs.py
    index 5aa4653..445c1a0 100644
    --- a/server/devpi_server/keyfs.py
    +++ b/server/devpi_server/keyfs.py
    @@ -489,12 +489,16 @@ class KeyFS(object):
             thread_push_log("[Rtx%s]" %(tx.at_serial))
    
         def rollback_transaction_in_thread(self):
    -        self._threadlocal.tx.rollback()
    -        self.clear_transaction()
    +        try:
    +            self._threadlocal.tx.rollback()
    +        finally:
    +            self.clear_transaction()
    
         def commit_transaction_in_thread(self):
    -        self._threadlocal.tx.commit()
    -        self.clear_transaction()
    +        try:
    +            self._threadlocal.tx.commit()
    +        finally:
    +            self.clear_transaction()
    
         @contextlib.contextmanager
         def transaction(self, write=False, at_serial=None):
    
  5. Jyrki Pulliainen

    And here's patch against the current mercurial tip:

    diff -r d4d31410217e server/devpi_server/keyfs.py
    --- a/server/devpi_server/keyfs.py  Thu Sep 03 13:10:56 2015 +0200
    +++ b/server/devpi_server/keyfs.py  Mon Sep 07 12:19:18 2015 +0200
    @@ -511,12 +511,16 @@
             thread_push_log("[Rtx%s]" %(tx.at_serial))
    
         def rollback_transaction_in_thread(self):
    -        self._threadlocal.tx.rollback()
    -        self.clear_transaction()
    +        try:
    +            self._threadlocal.tx.rollback()
    +        finally:
    +            self.clear_transaction()
    
         def commit_transaction_in_thread(self):
    -        self._threadlocal.tx.commit()
    -        self.clear_transaction()
    +        try:
    +            self._threadlocal.tx.commit()
    +        finally:
    +            self.clear_transaction()
    
         @contextlib.contextmanager
         def transaction(self, write=False, at_serial=None):
    
  6. Holger Krekel repo owner

    can you do a PR for this change? If you can devise a test in test_keyfs.py that'd be grand, otherwise do it without the test.

  7. Holger Krekel repo owner

    I looked at it again. The thing is that tx.commit and tx.rollback are not supposed to throw exceptions. If they do we end in a bad state like the one you are seeing. Do you have any logs about how commit/rollback failed in a prior request?

  8. Jason R. Coombs reporter

    Thanks @hpk42. It's likely that the issue is not unrelated to our deployment system, which allows (for a very short interval) the deployment of two devpi instances against the same file system. I've shared with you (privately) some logs from our production system where this condition may have been replicated.

  9. Holger Krekel repo owner

    indeed, the logs i got likely indicates that the problem stems from multiple processes writing to the file system/sqlite db. but i wonder if the other people reporting on this issue here have a setup as well where multiple devpi-server instances write to the same (networked) file system?

  10. Matthias Bach

    Be aware that GlusterFS is not always fully serially consistent. E.g. one has to run rsync with the --inplace option as otherwise there is a finite possibility that it will not be able to move the file it created during transfer. Issues like that could also easily mess with SQLite.

  11. Jason R. Coombs reporter

    Even with 3.0.1, we've continued to observe the multiple [reqNNNN] in log messages on our read-only replica, which runs on GlusterFS, but for which there should have been no overlapping instances (this was the first deployment against a clean folder, synced from the master). For example:

    /logs/yhost_procs/yprocs.log.3.gz:Feb 28 08:05:14 y32 supervisord: devpi-v4.0-EMEA_prod-5312b1c6-server-5000 2016-02-28 08:05:14,215 INFO  [req7435] [req8385] [req9335] [req9585] [req10285] [req10335] [req11235] [req11285] [req12185] [req12235] [req12835] [req13135] [req14085] [req25384] [req26734] [req27984] [req28184] [req28434] [req28934] [req29134] [req29734] [req29884] [req30084] [req30334] [req30834] [req31034] [req31634] [req31784] [req31984] [req32734] [req32934] [req33684] [req33884] [req34634] [req34834] [req35084] [req35434] [req35584] [req35784] [req36534] [req36734] [req37484] [req37684] [req38634] [req38684] [req39833] [req40083] [req44031] [req44631] [req44681] [req56381] [req56581] [req56831] [req57331] [req57531] [req61381] [req64729] [req65429] [req66379] [req67329] [req67379] [req67979] [req68279] [req68329] [req69229] [req70179] [req71129] [req72079] [req72329] [req73029] [req73679] [req91478] [req92628] [req92828] [req93828] [req94178] [req96076] [req96226] [req96426] [req97176] [req97376] [req98176] [req98526] [req98726] [req98976] [req101826] [req102176] [req102926] [req103126] [req103876] [req103926] [req104076] [req104826] [req105026] [req105776] [req105976] [req106726] [req106776] [req106926] [req107676] [req107876] [req108326] [req108626] [req108826] [req109576] [req109626] [req109776] [req110526] [req110576] [req110726] [req110776] [req111176] [req111476] [req111676] [req112126] [req112426] [req112626] [req113376] [req113576] [req114326] [req114526] [req115276] [req115326] [req115476] [req115526] [req116226] [req116426] [req116826] [req117672] [req117872] [req118622] [req118822] [req119572] [req119622] [req119772] [req120522] [req120722] [req121172] [req121472] [req121522] [req121672] [req122422] [req122472] [req122622] [req123372] [req123572] [req123622] [req124322] [req124372] [req124522] [req125272] [req125322] [req125472] [req125522] [req126222] [req126422] [req126872] [req127172] [req127222] [req127372] [req127422] [req128122] [req128172] [req128322] [req129072] [req129272] [req130022] [req130222] [req130272] [req130972] [req131022] [req131172] [req131922] [req132122] [req132872] [req133072] [req133522] [req133822] [req134022] [req134772] [req134972] [req135722] [req135772] [req135922] [req136672] [req136872] [req137622] [req137672] [req137822] [req138572] [req138622] [req138772] [req138822] [req139522] [req139722] [req140472] [req140672] [req141422] [req141472] [req141622] [req141672] [req142072] [req142372] [req142572] [req143322] [req143522] [req143972] [req144272] [req144472] [req144922] [req145222] [req145272] [req145422] [req146172] [req146372] [req147122] [req147172] [req147322] [req148072] [req148272] [req148322] [req149022] [req149222] [req149972] [req150172] [req150622] [req150922] [req150972] [req151122] [req172735] [req173135] [req175035] [req176185] [req177535] [req178085] [req178435] [req178835] [req180935] [req181885] [req182235] [req182285] [req190185] GET /root/yg/
    

    From the PR:

    i think within the devpi-server-3.0 branch we have fixed the underlying issue. The bad output should not happen anymore. maybe still make sense to go for capturing the now hard-to-trigger (and thus hard to test) case?

    Given that we're probably still experiencing the issue, I think it does make sense to hard fail on this undesirable condition, or at least log more detail.

  12. Florian Schulze

    I saw this issue as well now and even on a plain master. Wasn't able to reproduce yet though.

    I think one issue might be this code: https://bitbucket.org/hpk42/devpi/src/74c1fd7807fa8d11e9d9053992c2c8b1307f456a/server/devpi_server/extpypi.py?at=default&fileviewer=file-view-default#extpypi.py-397

    If I understand the code correctly, then the tx variable in tween_keyfs_transaction won't be the correct one anymore after the above code ran. I think we rather need a restart_read_transaction or similar, so the tx variable is preserved.

  13. Log in to comment