Split large FFI objects into multiple compilation units for the C compiler

Issue #235 new
Alex Gaynor
created an issue

As we've seen with PyPy, if you give GCC a very very large compilation unit of C code, it will take tons of memory to compile. cffi should automatically split up very large FFI objects into multiple compilation units.

This has been reported as a bug to pyca/cryptography (which has hundreds of functions):

Comments (19)

  1. Jeff Hodges

    By the way, the title of the letsencrypt ticket says "192MB" but there was a 512MB user that also couldn't get a build going. Us Let's Encrypt folks think 512MB is probably a reasonable goal for us to be compilable on in 2015 while 192MB is not. Don't feel like you have to go too crazy with the cheez-wiz.

    (LE has a goal of serving the long-tail but there are parts of that tail that is too-long even for us! There's a point where we're over-serving engineer-like folks and under-serving folks who, by dearth of engineering background, have to rely on their hosting provider to work with LE instead. Tradeoffs!)

  2. Armin Rigo

    On my system, I need 300MB to compile the .c source, which is 2.2MB of C code, or 70'000 lines. (If anything is excessive, it is gcc's own memory usage, imho...)

    It is not immediately clear how to split this .c file, because it is guaranteed that whatever you say in set_source() appears before the generated pieces. And duplicating this user-specified source into several .c files might lead to all sorts of issues... Measuring, I see that the main part of the memory is taken by compiling the numerous small functions generated. These must all be implemented after the user code from set_source(), so I don't see any obvious solution.

  3. Alex Gaynor reporter

    Hmmm, what sorts of issues can come up from duplicating set_source:

    • Duplicate function definition (linker error?)
    • Duplicating assignment to globals. Not sure what happens?

    Is there anything else?

  4. Armin Rigo

    Let's think about it anyway, assuming that the user gives some "safe_to_split=True" flag and never defines any non-static function, and no global variable (static or not). We'd potentially have duplicate static functions, which is strange but not the end of the world. Each .c source generates a subset of all functions needed by cffi. We need to put them all the big tables at some point. Suddenly, these functions can't be static any more. We could use the gcc trick of __attribute__((visibility(hidden))) to hide them anyway from the outside of the .so... This is all a refactoring that smells haaaack all over the place, if you ask me.

  5. Armin Rigo

    Maybe you could split the ffi into pieces, and have a final ffi that is defined by ffi.include() of the others. Then you get several .so's, but you can import the final one only and it should expose all the others. If you're feeling nicer you can even split it into several .so's that don't include each other, and import them lazily, and use the right ffi/lib at the right place...

  6. Armin Rigo

    Yes. You'd get several .so's though. Maybe indeed we can have a way to compile all the .o's, built from several ffi objects, into a single .so. Then you import it and it gives you the sum of all the pieces, much like an extra ffi that would include all others...

  7. Armin Rigo

    Maybe you should try the ffi.include() approach. I tried this:

    File x_build.py:

    import cffi
    ffibuilder = cffi.FFI()
        typedef int foo_t;
        #define FOO ...
        int foo(int x);
    ffibuilder.set_source("_x", """
        typedef int foo_t;
        #define FOO 42
        static int foo(int x) { return x + 42; }

    File y_build.py:

    import cffi
    ffibuilder = cffi.FFI()
        typedef struct { ...; } bar_t;
        #define BAR ...
        int bar(int x);
    ffibuilder.set_source("_y", """
        typedef struct { long x,y,z; } bar_t;
        #define BAR 2
        static int bar(int x) { return x * 2; }

    File z_build.py:

    import cffi
    import x_build
    import y_build
    ffibuilder = cffi.FFI()
    ffibuilder.set_source("_z", """
        typedef struct { long x,y,z; } bar_t;

    Note that we need to repeat the typedef for bar_t, but it should be automatic if you say #include <...> everywhere. We get three different .so files, but a program can import only the last one and access all functions, constants and types from it. If there is something missing it's probably not intended.

    Yes, we could think about a way to merge the three .so's into one, but I suppose that it's secondary at this point.

  8. reaperhulk

    Okay, I've spent a bit of time investigating this and run into a few issues.

    • When compiling the extensions using setuptools you can't use ext_package in the setup function and must specify the absolute path of the module name (e.g. mypkg._z)
    • Modules compiled as above do not inherit all the functions.

    I've built a small project that can demonstrate this behavior that you can grab from here: https://i.langui.sh/cffi-test-dUp3Q7h6Kl.tgz

    If you pip install -e . and then try to do python -c "from mypkg._z import lib;lib.foo(42)" it will fail since foo is not available on _z even though it is when compiled directly as in your example.

  9. reaperhulk

    I am also unable to reproduce, which is confusing me quite a bit :) I have no idea what I was doing before (especially as the demo project works fine for me now). Sorry Armin!

  10. Armin Rigo

    And can you expand the problem with ext_package? Not sure I'm following what you said. (Note: there is a typo in one place of the cffi doc, it's called ext_package and not ext_packages)

  11. Log in to comment