cffi.new is way slower than it should be; it should use calloc
Requests and PyOpenSSL recently ran into an issue where cffi's default allocator was causing pathological slowdowns due to memory zeroing, specifically in cases where they were allocating a large buffer but then only using a small portion of it:
Switching to a non-zeroing allocator produced real-world dramatic speedups (see the last link in particular for benchmarks).
I thought this was very odd, because on any even slightly modern system, allocating a large zeroed buffer should be just as cheap as allocating a large non-zeroed buffer, because of how
calloc is implemented. E.g. on glibc, any allocation greater than 128 KiB (by default) is satisfied by directly asking the kernel for more memory via mmap, and the kernel always returns zeroed memory (for security reasons). BUT since it's the kernel, it does this in a clever way: it maps in a bunch of CoW views of the system zero page, so allocating N pages of zeroed memory is an O(1) operation, and the actual zeroing only happens lazily as the memory is accessed. And
calloc knows when it's satisfying an allocation via mmap, so in this case it just returns the memory directly, and is also O(1). This means
calloc is wayyyyy faster than
memset eagerly faults in all those pages, the kernel zeroes them, and then
memset zeroes them again -- and then in the case requests/pyopenssl ran into, most of those pages just get thrown away without ever being touched again. It's a huge waste of time.
Unfortunately, cffi's default allocator emulates
memset. It should use
There is one thing that makes this slightly tricky, which is that right now the default allocator uses
PyObject_Malloc instead of calling
malloc directly. CPython 3.5 provides a
PyObject_Calloc, but earlier versions do not. So on earlier versions, the only way to get the benefits of
calloc are to switch to using
free directly instead of the
PyObject_* wrappers. This seems like a plausibly good idea (I'm dubious that the
PyObject_* wrappers are providing much value?), but I haven't benchmarked it or anything. On 3.5+ though it's a no-brainer.