Issue #2953 new
Shaun Jackman created an issue

Hi. I'm running a pure Python app in pypy3. It is often deadlocking with the process in a sleep stating and making no progress. For example…

   PID TTY      STAT   TIME COMMAND
211617 ?        Sl    16:13 pypy3 mytool

I attach to the running process using gdb -p 211617. info threads show two threads. My code has created no threads. Are those two threads the Python interpreter and the garbage collector (just for my curiosity)? Both threads are blocked in the function futex_abstimed_wait_cancelable. The backtrace is…

#0  0x00007fe62694c18a in futex_abstimed_wait_cancelable (private=128, 
    abstime=0x0, expected=0, futex_word=0x7fe62a671000)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x7fe62a671000, abstime=0x0)
    at sem_waitcommon.c:111
#2  0x00007fe62694c231 in __new_sem_wait_slow (sem=0x7fe62a671000, abstime=0x0)
    at sem_waitcommon.c:181
#3  0x00007fe627e3c806 in ?? ()
   from /gsc/btl/linuxbrew/Cellar/pypy3/6.0.0/lib/libpypy3-c.so

Each thread is blocked on a different mutex. Thread 1 is blocked on mutex 0x7fe62a835780 and thread 2 on mutex 0x7fe62a671000. My guess is that the deadlock is the usual thread 1 holds mutex A and is blocked on mutex B, and thread 2 holds mutex B and is blocked on mutex A.

Is there more information that would be useful to you, or anything I can do to help troubleshoot this situation?

$ pypy3 --version
Python 3.5.3 (fdd60ed87e941677e8ea11acf9f1819466521bf2, Jul 13 2018, 18:38:41)
[PyPy 6.0.0 with GCC 5.4.0 20160609]

I'm using PyPy 6. I see that PyPy 7 is released. I'll try it out.

I'm using a Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz with 72 physical cores (144 hyperthreads).

Thanks for your help, Shaun

Comments (13)

  1. mattip

    "My code has created new threads" -> "My code has created no threads"? Ideally we would need to see your code. When you say "pure python", are you sure there are no c-extension modules being used under the hood by some of your packages, or that a package is creating a thread? Do you have any __del__ methods that might be reviving dead objects?

  2. Shaun Jackman reporter

    "My code has created new threads" -> "My code has created no threads"?

    Correct! Sorry for the typo. I've fixed the typo above.

  3. Shaun Jackman reporter

    Here's the complete list of imports:

    import argparse
    import itertools
    import multiprocessing
    import os
    import random
    import re
    import statistics
    import sys
    import timeit
    from collections import Counter
    
    import networkx as nx
    import tqdm
    

    Both networkx and tqdm are pure Python. My code has no __del__ methods. networkx has no __del__ method. https://github.com/networkx/networkx/search?q=__del__

    tqdm has one __del__ method that calls self.close(). https://github.com/tqdm/tqdm/search?q=__del__

    Ideally we would need to see your code.

    The code is unfortunately in a private repo for now. I'm sorry that I'm not able to share it with you. I'm happy to answer any questions that you have. If you think that's holding up the investigation, I could send you the code privately.

    Cheers, Shaun

  4. mattip

    We have had others comment that tqdm seems to have a race condition. Can you try without it? It would be nice to verify if there is a problem with it. Also multiprocessing.Pool has known bugs around closing the pool. From a cursory glance, the others should be fine although the use of itertools is worrysome from a performance perspective: some of the tricks it uses complicate the code.

  5. Shaun Jackman reporter

    The particular branch of code that I'm testing right now isn't using multiprocessing.Pool, but I'll keep that in mind.

    I'm using only one itertools function, itertools.combinations(xs, 2). I can recode that in Python to remove its use.

    I'll try without tqdm and get back to you. Do you have a replacement progress bar that you suggest? I find tqdm really helpful. Some of these runs take multiple days, and it's encouraging to know how far it is along, and whether it's making progress.

    Thanks for your help, @mattip!

  6. Shaun Jackman reporter

    Hi, Marcin. Your test case also hangs up for me 70% of the time (7 trials of 10).

    Removing tqdm from my code prevents the deadlock, but I really miss tqdm.

  7. mattip

    I wrote a test based on the minimal test case on the semlock-deadlock branch in commit 09ebb064a7b8. It seems the code in _multiprocessing/interp_semaphore.{acquire,release} is written as if it is atomic, but it hangs very quickly when run untranslated.

  8. mattip

    The test now runs successfully on the semlock-deadlock branch. There is a pypy2 download availabe here. Could you try it out on the real-world code?

  9. Shaun Jackman reporter

    Does this mean that tqdm will work with pypy3 soon!? That would make me very excited.

  10. Log in to comment