Python 3.8 segfault: CField_Type does not implement tp_traverse

Create issue
Issue #416 on hold
Christian Heimes created an issue

The CPython bug https://bugs.python.org/issue38006 describes a segfault on Python 3.8b4 that seems to be caused by a problem in cffi and triggered by the new vector call feature. The segfault occurs when Python shuts down and GC clears weak references

The CFieldObject object of CField_Type can hold a CTypeDescrObject object in its cf_type slot. This turns a CFieldObject into a container object. All container objects must implement GC protocol (Py_TPFLAGS_HAVE_GC, tp_traverse, tp_clear). Victor’s comment https://bugs.python.org/issue38006#msg351037 goes into more details.

Comments (14)

  1. Armin Rigo

    Yes, it's easy to add Py_TPFLAGS_HAVE_GC, tp_traverse and tp_clear to this type. I wonder, though. At some point in time I never used to say in the documentation that any type must implement the GC protocol, only that it should if it is part of a cycle. The only issue was of not freeing memory. It might be fine for some types like singletons, for example. There was nowhere the hint that missing the GC protocol might lead to segfaults.

    My own guess is that various other C extension modules (many of which you don't even know about because they are not public) will have the same "problem". This will look to users like their apps work in Python <= 3.7 and crash at shutdown in Python 3.8---a situation that they might remedy using e.g. os._exit() to shut Python down. Guess who is going to take the blame for this unclean result? Certainly not the author of some long-forgotten in-house C extension module they happen to use and that worked fine so far.

  2. Christian Heimes reporter

    Here is an experimental patch that adds GC to CField object. I haven't tested it yet because the patch leaks references like a sieve. I had to turn the borrow CField reference into an owned reference.

  3. Armin Rigo

    The same issue exists about many types in CFFI, not just CField.

    Most importantly, cdata objects, which are extremely common, have a strong reference to the ctype object but do not themselves implement the GC protocol. Should it be fixed too?

    What are the exact rules supposed to be? Just by a quick look in the source code of Python 3.8, I see that it is quite common to have types which don't implement the GC protocol but have a strong reference to an object which typically does:

    • os.DirEntry (no GC, but contains stats, which have GC)

    • select.poll (contains a dict)

    • _ctypes: PyCPointer_Type, PyCArray_Type, PyCArg_Type, ...

  4. Christian Heimes reporter

    Unfortunately it’s not that easy to add GC to CField_Type. The fields are stored in ct_extra, which is a *void and linked list of borrowed references to CFieldObjects. That has to change, too.

  5. Christian Heimes reporter

    Yes, the patch is PoC.

    I have asked Pablo and Victor to assist. I’m not an expert on GC.

  6. Armin Rigo

    Just saying, I'm fine writing up the patch myself too, but the real blocker here is that what works or doesn't work appears ill-defined and relying only on "someone saw a segfault in this precise case". Fixing CFieldObject won't fix CDataObject or GlobSupportObject, all of which have a strong reference to a CTypeDescrObject and no GC support.

  7. Victor Stinner

    Just saying, I'm fine writing up the patch myself too, but the real blocker here is that what works or doesn't work appears ill-defined and relying only on "someone saw a segfault in this precise case".

    FYI I’m working on a writing a reproducer, but it’s really hard to reproduce the bug: it requires a very specific order to destroy objects. At least, I managed to write a reproducer which only depends on cffi and pycparser, no longer on FreeIPA.

    Python must be fixed to detect C extension types which don’t implement GC: workaround the bug to prevent a hard crash. From an user point of view: CFFI worked well in Python 3.7 🙂 See https://bugs.python.org/issue38006 for the analysis of the crash.

    It would be nice to implement tp_traverse in CFFI types, but I’m not sure what are the minimum CFFI changes to prevent https://bugs.python.org/issue38006 crash. Implement tp_traverse for CField?

  8. Tim Peters

    Long ago, cyclic gc was added as an optional module types were free to ignore. If a type ignored it, memory might leak, but that was the worst that was expected.

    But things have gotten far more involved since then, with the addition of weakrefs w/ callbacks, running finalizers on objects in trash cycles, and people feeling ever-freer to give no thought at all to whether they're creating cycles. The last includes the CPython core now - cyclic gc is no longer optional in reality.

    weakrefs were designed with no thought to how they'd play with cyclic gc, and it shows ;-) They've been an historic source of miserable rare bugs, of which this is just the latest - although it's been years since the last one surfaced.

    "The rules" are simple: any type that may participate in a cycle really should implement the entire gc protocol. But many types don't, and we go to great lengths to "do OK" anyway.

    Same here: NeilS & I worked out changes to gc to prevent this specific kind of crash. Is this last possible? Doubt it! We always hope the last one was "the last" ;-)

    Armin, most important here is tp_traverse. That's the one & only clue CPython's gc has about what the object graph is. In this case there was a huge mass of cyclic trash at shutdown, including weakrefs with callbacks. The object, T, that was pointed to by the weakref was also part of cyclic trash, but the lack of tp_traverse blinded us to the one pointer that would have allowed us to deduce T was part of cyclic trash.

    For objects we know are part of cyclic trash, we run all their weakref callbacks before any objects are cleared or deallocated by brute force.

    But we didn't know here, and the callback function itself was made insane by brute force before breaking links caused ordinary refcount semantics to trigger T's (fatally damaged) callback.

    So the actual rules are "don't trigger things like that" ;-) We don't know what they are, so "implement everything" is the only safe answer.

    In this particular case we never intended to run a callback from a trash weakref, so we just clear all trash weakrefs now. It was news to us that it was possible for a missing tp_traverse to trick us into running one before. I never saw another form of gc that had to worry about "but what if what we're told about the object graph is lying"? ;-)

  9. Tim Peters

    Here's an easy example that may clarify the pragmatics: suppose we have a simple trash cycle, a circle of 50 objects of 50 different types.

    For CPython to recognize that it is a trash cycle, all 50 types must implement tp_traverse. If some type T doesn't, then the pointer starting from the object of type T is invisible, so the object that pointer ends at appears to be reachable from outside the current generation (something must account for its refcount of 1). So the cycle leaks.

    But if it is recognized as a trash cycle, to guarantee that it gets reclaimed only 1 of the types needs to implement tp_clear. Break any 1 of the pointers by brute force, and then regular old refcounting reclaims the whole circle.

    Lots of container types in the core haven't bothered with tp_clear because of that. They just don't expect to be involved in cycles devoid of the heavily used container types that do implement it. But they "should" implement tp_clear anyway: it's very easy to write one, often just moving Py_CLEAR(PyObject *) lines out of the type's deallocation function (and then calfling the new tp_clear from the latter).

    Ironically, that's what triggered the current segfault bug: adding tp_clear to function objects. That's what "caused" a weakref callback function to be rendered insane before it was used. But the real cause was that an absent tp_traverse prevented gc from knowing that relevant objects were in trash cycles so that callbacks needed special handling to be done safely. As far as gc could tell, the weakref referent was not trash, so its callback would never be invoked. That's one way in which failing to recognize a trash cycle can end up doing much worse than just leaking memory.

  10. Armin Rigo

    According to the rule "any type that may participate in a cycle really should implement the entire gc protocol", many of the more internal types (of cffi, of ctypes, of the rest of the CPython core) are bogus, and fixing them will require three more words for each instance. Some of them are very small and numerous objects. If that's the rule anyway, then fine, I guess just wasting all that memory in cffi is part of the price for avoiding these rare segfaults when something unrelated changes in CPython.

    I also guess that I need to fix the borrowed reference between cfield and ctype object in cffi. The idea was to avoid creating a ton of pointless cycles. I can give up that idea too and just throw much more work to CPython's GC.

    Better safe than sorry!

  11. Armin Rigo

    Marked as "on hold" until the CPython situation is clarified and someone un-marks this as "on hold". In particular I'd like some reasoning for why CFieldType needs to be fixed, but not the other similar types in cffi, nor the similar types inside CPython itself (see list in my comment above).

  12. Tim Peters

    You don't need to change anything. But, as noted, if a container type doesn't at least supply tp_traverse, the best it can hope for is to be implicated in memory leaks. It's simply impossible for CPython's gc to follow pointers it's not told about (note the twist here: for most forms of gc, missing a pointer can cause a live object to be mistakenly considered to be trash, which is catastrophic; but in CPython's gc it's the opposite: it assumes everything is alive at first, and chases pointers to prove things are trash).

    The docs already seem clear enough to me:

    Python’s support for detecting and collecting garbage which involves circular references requires support from object types which are “containers” for other objects which may also be containers. Types which do not store references to other objects, or which only store references to atomic types (such as numbers or strings), do not need to provide any explicit support for garbage collection.

    To create a container type, the tp_flags field of the type object must include the Py_TPFLAGS_HAVE_GC and provide an implementation of the tp_traverse handler. If instances of the type are mutable, a tp_clear implementation must also be provided.

    In slow motion, more types in CPython are changing to follow those rules, as they get implicated in memory leaks.

    The only failure modes I know of for not following the rules now (after the latest hacks inspired by this report):

    • Memory leaks.
    • Stats reported by gc for # of reclaimed objects may be wrong.
    • If an object in cyclic trash has a finalizer, and that object is resurrected, the finalizer may run again the next time the object becomes trash (when the rules are followed, a finalizer is run at most once total).

    It's possible segfaults can't happen anymore.

    So suit yourself - a new kind of segfault every few years isn't really a disaster 😉

  13. Tim Peters

    if an object in cyclic trash has a finalizer, and that object is resurrected, the finalizer may run again …

    Scratch that! Turns out different paths in the code cover this, independent of gc. So a finalizer will be run at most once regardless of whether all containers follow the rules. (Of course if an object with a finalizer leaks, its finalizer will never be invoked.)

  14. Log in to comment