pypy binary is linked to too much stuff, which breaks manylinux wheels

Issue #2617 new
Nathaniel Smith created an issue

With pypy 5.8, ldd pypy shows that it's linked to libbz2, libcrypto, libffi, libncurses, ... Likewise for pypy3 5.8 and ldd pypy3.5.

You would not think so, but it turns out that this is a big problem for distributing wheels.

The issue is that the way ELF works, any libraries that show up in ldd $TOPLEVELBINARY effectively get LD_PRELOADed into any extension modules that you load later. So, for example, if some wheel distributes its own version of openssl, then any symbols that show up in both their copy of openssl and pypy's copy of openssl will get shadowed and hello segfaults.

The cryptography project recently ran into this with uwsgi:

Fortunately this has not been a big deal so far because, uh... nobody distributes pypy wheels. But in the future maybe this is something that should be supported :-). And while in theory it would be nice if this could be fixed on the wheel side, this is not trivial.

The obvious solution would be to switch things around so that the top-level pypy executable does dlopen("", RTLD_LOCAL) to start the interpreter, instead of linking against it with -lpypy-c. Then the symbols from and everything it links to would be confined to an ELF local namespace, and would stop polluting the namespace of random extension modules.

However... there is a problem, which is that cpyext extension modules need some way to get at the C API symbols, and I assume cffi extension modules need access to some pypy symbols as well.

This is... tricky, given how rpython wants to mush everything together into one giant .so, and ELF makes it difficult to only expose some symbols from a binary like this. Some options:

  • when using libcrypto or whatever from rpython, use dlopen("libcrypto", RTLD_LOCAL) instead of -lcrypto. I guess this could be done systematically in rffi?
  • provide a special libcpyext that uses dlopen to fetch the symbols from and then manually re-exports them?

Comments (25)

  1. Armin Rigo

    For PyPy, we could imagine a solution where we use dlopen("", RTLD_LOCAL), and then not using the linker at all for cpyext and cffi modules. This would avoid this whole class of problem.

    Right now, a cpyext module is compiled with a large number of #define PyFoo PyPyFoo, and the names PyPyFoo are symbols exported by the main The solution that completely avoids relying on the platform-specific linker would be to compile them instead with #define PyFoo (_pypy_cpyext.PyFoo) and change the initialization code so that the (cpyext-module-local) structure _pypy_cpyext is filled with function pointers when the extension module is loaded. The same can be done with cffi's few functions.

    It's a bit of work and requires a full breakage of cpyext's binary API (which is fine I guess). The upside is that we don't rely on the linker to find the symbols from the that imported the extension module, as opposed to not finding them at all (because of RTLD_LOCAL) or even finding the wrong ones (if there are several loaded in the same process).

    For backward compatibility with the old way to embed PyPy, we'd retain a few exported symbols on like pypy_execute_source.

    I think that what I'm suggesting would always just work, but I may be missing something...

  2. Nathaniel Smith reporter

    That sounds plausible to me.

    Note that on Windows this issue doesn't come up, because symbols are always resolved as (dll name, symbol name) pairs (unless you have dll name collisions, but there are ways to avoid that). But your suggestion should also work fine.

    On MacOS... it's complicated. The default and recommended way to do things is to use "two-level namespace" lookup which is like Windows, except it uses (path to shared library, symbol name), so you actually have to know -- at compile time! -- where the shared library will be placed on the end user's disk (either absolute, or relative to the extension module). Alternatively, you can use the old "single-level namespace" lookup, which is like ELF, except without RTLD_LOCAL – it really is just a single big soup of symbols for the whole process. I believe that what people usually do in practice on CPython is to use single-level lookup to find the CPython C API symbols (because you certainly don't know where the Python interpreter will be installed), and two-level lookup for vendored libraries. (Or at least that's how it's supposed to work?) Your proposal + always using two-level namespaces seems like it might even be an improvement on this, but since it's different from what CPython does you might have to hack at distutils a bit to get it to play along.

    One thing you have to be careful of is ABI compatibility – if you add a new symbol to cpyext and do nothing else, then with traditional linking that's backwards compatible. But with the _pypy_cpyext trick, the struct needs to get bigger, so either you need to bump your ABI compatibility version, or else you need to somehow detect that an extension was built against an older version of cpyext, and only fill in as much of the struct as it knows about. (This is what numpy does.)

    On the other hand, this also potentially allows you quite a bit of freedom when it comes to evolving the cpyext ABI and controlling extension module ABI breakage -- for example, you could have different incompatible versions of the same PyWhatever symbol, and use metadata from the extension module to figure out which version it wants.

  3. Nathaniel Smith reporter

    Right, the connection to the cryptography issue isn't that uwsgi+pypy could somehow avoid this by changing pypy -- it's uwsgi causing the problem there. The connection is that right now, even the plain pypy executable causes the same problem, even without uwsgi getting involved :-).

  4. mattip

    @arigo when you say

    The solution that completely avoids relying on the platform-specific linker would be to compile them instead with #define PyFoo (_pypy_cpyext.PyFoo) and change the initialization code so that the (cpyext-module-local) structure _pypy_cpyext is filled with function pointers when the extension module is loaded

    the "initialization code" would then be like numpy's _import_array() that each c-extension module that uses numpy (pandas, matplotlib) is meant to call? Or is this the general PyPy startup code run in app_main ?

  5. Nathaniel Smith reporter

    I think the idea is that it would be like import_array, except that the interpreter calls it automatically instead of requiring each module to do it explicitly. How to actually implement this is a bit trickier though :-).

    It could be checked and lazily initialized in every cpyext API call. There's some overhead here, but maybe it's not too bad on modern CPUs.

    With some cleverness, it might be possible to convince each cpyext module to export a special symbol, that pypy dlsym's and fills in itself. I don't think this can be done using only standard C though; you could have your Python.h declare a public variable, but then if a single extension is built out of multiple files you'd get collisions between each file's copy of the variable. Some platforms have ways to declare that redundant definitions should be merged; maybe that could be made to work.

    Or given that CPython has already done the work of figuring out how to export its symbols into extension module namespaces, maybe the thing to do is to continue to copy them. But if you want to avoid exporting too much, and to gain the ability to support multiple versions of the cpyext ABI simultaneously, then you could combine the two approaches: have the one symbol you export from pypy be pypy_cpyext_abi_v1, and then later you could add a pypy_cpyext_abi_v2, etc.

  6. mattip

    We never did progress with this, and now we are set to release PyPy6. Is this something we should consider release critical?

  7. Nathaniel Smith reporter

    It's something that should be sorted out whenever you want people to start putting wheels up on PyPI, but I don't see why it should be a blocker for PyPy6 in particular.

  8. Nathaniel Smith reporter

    I should clarify though: despite the title, this issue doesn't literally break all manylinux wheels, just the ones that have symbol collisions with any of the libraries that are linked to the main pypy binary. (Usually because they use that library themselves.) So it may not affect scipy.

  9. mattip

    Returning to the proposal to use dlopen in the main executable, I understand the opposition was

    there is a problem, which is that cpyext extension modules need some way to get at the C API symbols, and I assume cffi extension modules need access to some pypy symbols as well.

    However, cffi extensions that embed an interpreter and c-extension modules link to libpypy.sonot to the executable. So what am I missing?

  10. Nathaniel Smith reporter

    c-extension modules link to not to the executable.

    There is definitely something to be said for requiring c extensions to link to, or otherwise getting access to pypy's symbols via some mechanism that's not "they're just present in the ambient environment". (The fact that cpython doesn't do this is why, for example, Apache's mod_python can't support some website running under py2 and others running under py3 at the same time. Python extensions assume that they can get the Python symbols from the global namespace → therefore both libpythons have to be loaded into the global namespace → therefore the symbols clash.)

    You do still have to think about symbol pollution though: if extensions link against, and links against, then (on Linux) those extensions are now linked against whichever version of openssl that pypy is linked against, which may not be what you want. This is better then when is linked into the main executable, because the way the symbol search order works, the executable comes really early, while library-that's-linked-to-a-library comes later. But you'd still want to think it through carefully.

    OTOH on macOS you can't require everyone link against libpypy.dylib, because the way macOS linking works, when you link against something you have to give its path, not just its name, so the only way to distribute an extension linked against libpypy.dylib would be if you somehow forced every system to place libpypy.dylib at the same place on disk.

  11. mattip

    An alternative to this would be to adopt the cpython approach: only link in the bare minimum of shared libraries needed to run the interpreter. Pypy3 uses cffi for the ssl module, but readelf -d shows links to libcrypt, libexpat, libbz2, libffi, libtinfo, and libgcc that are not linked in cpython.

    We could rewrite the bz2, expat, and crypt modules as cffi external modules, which would leave libffi, which we need, and libtinfo, libgcc which I am not sure how they are needed.

  12. Armin Rigo

    If we want to go down that path, we could do it automatically too: we hack around to convince the translator to not link statically to these libraries, but instead provide patchable NULL function pointers. Then we add logic to load the function pointers at runtime from a compiled cffi extension module. The only purpose of cffi in this case is to provide these function pointers; the logic for the modules remains written in RPython instead of in pure Python.

  13. Nathaniel Smith reporter

    In principle at least, you don't even need cffi to call dlopen + dlsym a few times. You might be able to avoid the libffi dependency too.

  14. Armin Rigo

    Yes, but cffi in API mode is useful to find the libraries and to deal with messes like #define foo1(x) foo2(x, NULL). Some libraries (like libssl) are full of that.

  15. mattip

    So three choices:

    1. Create a struct to abstract all the cpyext functions, dlopen libpypy at startup, add shim code to fill in the struct at cpyext module import
    2. Replace patchable NULL pointers with real functions at runtime
    3. Rewrite bz2, expat, crypt in cffi and port ssl to pypy2

    It seems to me 3 is the most straight-forward and might enable drive-by contributions to fix issues, 2 is probably the least changes in terms of lines of code but in some ways obscure, and 1 might enable us to change cpyext without recompiling c-extensions so we might want to do it anyway

    Are there other considerations?

  16. Armin Rigo

    An issue with the approach 1: say we have #define PyFoobar (_my_struct.PyFoobar), and we fill in _my_struct when the extension module is loaded. This would not compile typical C extension modules, which contain code like this:

    static PyTypeObject MyType = {
  17. mattip

    Started by porting ssl to pypy2.

    libncurses does not show up for me in pypy2-HEAD on ubuntu 18.04, should it?

    libcrypto is apparently needed for the extended hashlib algorithms. Should we move to a cffi-based solution, or try to do some kind of lazy binding to OpenSSL_add_all_digests, OBJ_NAME_do_all

  18. mattip

    We now link to the same shared objects as cpython, with the addition of libbz2 (in addition to libz which both link to), librt, libffi, libtinfo, and libgcc_s. Are any of those problematic?

  19. Nathaniel Smith reporter

    librt and libgcc_s are fine – those are part of the C runtime.

    libbz2, libffi, and libtinfo are potentially dubious. I wouldn’t expect them to be as a common a source of symbol conflicts as something like libcrypto, but I can’t guarantee they won’t cause problems, either ¯\_(ツ)_/¯. I guess libbz2 probably doesn’t have a lot of ABI breakage, since it didn’t have any releases from 2010 - 2019? But libtinfo recently changed soname along with the rest of ncurses, and I have no idea how stable libffi’s ABI is.

    Up to you I guess whether you want to try to fix those now just in case, leave the issue open but wait until you have the rest of your binary wheel story sorted out to worry about them, or just close it and wait for someone to hit a problem (which who knows, might never happen).

  20. Log in to comment