Bandersnatch sometimes hangs on teardown
I don't have a lot to go on, so this is as much a request for ideas on how to better debug... Sometimes, seemingly at random, a run of bandersnatch 1.5 completes but never actually exits. We usually discover this by noticing we have a stale mirror, and upon investigation find a hung bandersnatch mirror process on the system. That run has logged that it reached the serial it desired (and so no other output from it it expected, but has still for whatever reason not terminated):
2014-09-16 11:40:02,144 INFO: bandersnatch/1.5 (CPython 2.7.3-final0, Linux 3.2.0-68-generic x86_64) 2014-09-16 11:40:02,144 INFO: Syncing with https://pypi.python.org. 2014-09-16 11:40:02,144 INFO: Current mirror serial: 1226428 2014-09-16 11:40:02,144 INFO: Syncing based on changelog. 2014-09-16 11:40:02,369 INFO: Trying to reach serial: 1226432 2014-09-16 11:40:02,369 INFO: 2 packages to sync. 2014-09-16 11:40:02,370 INFO: Syncing package: django-media-library (serial 1226430) 2014-09-16 11:40:02,370 DEBUG: Getting /pypi/django-media-library/json (serial 1226430) 2014-09-16 11:40:02,371 INFO: Syncing package: oar-docker (serial 1226432) 2014-09-16 11:40:02,371 DEBUG: Getting /pypi/oar-docker/json (serial 1226432) 2014-09-16 11:40:02,420 INFO: Syncing index page: django-media-library 2014-09-16 11:40:02,421 DEBUG: Getting /simple/django-media-library/ (serial 1226430) 2014-09-16 11:40:02,430 DEBUG: Getting /serversig/django-media-library/ (serial 1226430) 2014-09-16 11:40:02,460 INFO: Downloading: https://pypi.python.org/packages/source/o/oar-docker/oar-docker-0.1.2.tar.gz 2014-09-16 11:40:02,460 DEBUG: Getting https://pypi.python.org/packages/source/o/oar-docker/oar-docker-0.1.2.tar.gz (serial None) 2014-09-16 11:40:02,487 INFO: Syncing index page: oar-docker 2014-09-16 11:40:02,488 DEBUG: Getting /simple/oar-docker/ (serial 1226432) 2014-09-16 11:40:02,502 DEBUG: Getting /serversig/oar-docker/ (serial 1226432) 2014-09-16 11:40:02,517 INFO: Generating global index page. 2014-09-16 11:40:02,814 INFO: New mirror serial: 1226432
Attaching to the process with strace shows what seems to be in some sort of select/poll loop:
gettimeofday({1410961910, 563804}, NULL) = 0 select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) gettimeofday({1410961910, 614091}, NULL) = 0 select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) gettimeofday({1410961910, 664326}, NULL) = 0 select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) gettimeofday({1410961910, 714626}, NULL) = 0 select(0, NULL, NULL, NULL, {0, 33214}) = 0 (Timeout) gettimeofday({1410961910, 748154}, NULL) = 0 gettimeofday({1410961910, 748243}, NULL) = 0 gettimeofday({1410961910, 748321}, NULL) = 0 gettimeofday({1410961910, 748384}, NULL) = 0 gettimeofday({1410961910, 748444}, NULL) = 0 gettimeofday({1410961910, 748493}, NULL) = 0 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) gettimeofday({1410961910, 749718}, NULL) = 0 select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) gettimeofday({1410961910, 751927}, NULL) = 0 select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) gettimeofday({1410961910, 756149}, NULL) = 0 select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout) gettimeofday({1410961910, 764421}, NULL) = 0
Killing the running process results in no additional output, and on a subsequent run we see it thinks it was interrupted with a leftover todo list:
2014-09-17 13:55:01,667 INFO: bandersnatch/1.5 (CPython 2.7.3-final0, Linux 3.2.0-68-generic x86_64) 2014-09-17 13:55:01,668 INFO: Syncing with https://pypi.python.org. 2014-09-17 13:55:01,668 INFO: Current mirror serial: 1226432 2014-09-17 13:55:01,668 INFO: Resuming interrupted sync from local todo list. 2014-09-17 13:55:01,668 INFO: Trying to reach serial: 1226437 2014-09-17 13:55:01,668 INFO: 0 packages to sync. 2014-09-17 13:55:01,669 INFO: Generating global index page. 2014-09-17 13:55:02,005 INFO: New mirror serial: 1226437
Thereafter, it runs normally.
Comments (10)
-
-
- changed status to on hold
-
It's not something I'm able to recreate at will (since I have no idea what triggers the issue), and I don't have an easy way to upgrade all of our mirror servers to an unreleased version either. Is there a new release due out any time soon?
-
Ah. I'm happy to make a release soon. I'm currently discussing a few important changes that I might want to also pick up. If that doesn't happen soon I'll just make a release with what I have.
-
- changed status to open
-
-
assigned issue to
-
assigned issue to
-
I released 1.6.1 a minute ago. If you like, try it out - it's likely to solve your problem.
-
Thanks! We've upgraded to 1.6.1, and I'll follow up to this bug if we notice any further recurrences of the aforementioned teardown hang.
-
- changed status to resolved
I think this was solved as I didn't hear about it again.
-
Agreed, I find no further evidence of this in our logs after upgrading to 1.6.1
- Log in to comment
I just recently fixed a deadlock bug - could you check whether this also happens with the current head?