is way slower than it should be; it should use calloc

Create issue
Issue #295 resolved
Nathaniel Smith created an issue

Requests and PyOpenSSL recently ran into an issue where cffi's default allocator was causing pathological slowdowns due to memory zeroing, specifically in cases where they were allocating a large buffer but then only using a small portion of it:

Switching to a non-zeroing allocator produced real-world dramatic speedups (see the last link in particular for benchmarks).

I thought this was very odd, because on any even slightly modern system, allocating a large zeroed buffer should be just as cheap as allocating a large non-zeroed buffer, because of how calloc is implemented. E.g. on glibc, any allocation greater than 128 KiB (by default) is satisfied by directly asking the kernel for more memory via mmap, and the kernel always returns zeroed memory (for security reasons). BUT since it's the kernel, it does this in a clever way: it maps in a bunch of CoW views of the system zero page, so allocating N pages of zeroed memory is an O(1) operation, and the actual zeroing only happens lazily as the memory is accessed. And calloc knows when it's satisfying an allocation via mmap, so in this case it just returns the memory directly, and is also O(1). This means calloc is wayyyyy faster than malloc+memset: memset eagerly faults in all those pages, the kernel zeroes them, and then memset zeroes them again -- and then in the case requests/pyopenssl ran into, most of those pages just get thrown away without ever being touched again. It's a huge waste of time.

Unfortunately, cffi's default allocator emulates calloc using malloc+memset. It should use calloc instead.

There is one thing that makes this slightly tricky, which is that right now the default allocator uses PyObject_Malloc instead of calling malloc directly. CPython 3.5 provides a PyObject_Calloc, but earlier versions do not. So on earlier versions, the only way to get the benefits of calloc are to switch to using calloc/free directly instead of the PyObject_* wrappers. This seems like a plausibly good idea (I'm dubious that the PyObject_* wrappers are providing much value?), but I haven't benchmarked it or anything. On 3.5+ though it's a no-brainer.

Comments (5)

  1. Armin Rigo

    Did you try to use a non-default allocator (ffi.new_allocator)? A version based on calloc() as you describe would work; or possibly some more complicated system to avoid even going through the OS at every single allocation and free.

    But I see the point anyway and it should be solved more transparently. The problem is that I never fully understood the mess of PyObject_Malloc / PyObject_New / PyObject_Free / PyObject_Del and their uppercase macro variants. I guess it is wrong to call calloc() and have the destructor of the object call type->tp_free, which is initialized to either PyObject_Del or PyObject_GC_Del. Or maybe it is not wrong if type->tp_free is manually set to point to free. I guess I need to try and run all tests in a debug-mode Python (with PYMALLOC_DEBUG).

  2. Armin Rigo

    Tried to implement it in b6adad5f4ea3. Only changes the kind of cdata objects, to allocate them with malloc()/calloc() and replace their type->tp_free with free. It seems to pass all tests.

  3. Armin Rigo

    Note that code should avoid doing that too eagerly anyway, because on non-refcounted implementations of Python (i.e. PyPy), the objects are only freed when the GC runs. The fix here makes them at least not grab much actual memory, which is good, but the problem is that on 32-bit the address space of the process might be quickly exhausted.

  4. Log in to comment