Creating a cdata from a buffer

Issue #47 resolved
Simon Sapin
created an issue

ffi.buffer is great to get read/write access to a chunk of C memory from Python, but something is needed the other way around.

cffi should have a ffi.from_buffer(ctype, obj) method that would accept any object that supports the buffer protocol and return a cdata that holds a reference to obj and points to the same data.

If obj is read-only, maybe from_buffer could only allow creating cdata of a "const" type (though I don’t know if "const" means anything to cffi.)

I would use this to implement ImageSurface.create_for_data in cairocffi, so that a cairo image could share memory with a bytearray or a numpypy array.

Comments (41)

  1. Armin Rigo

    The idea is simple enough on CPython, but will hit the same problem on PyPy as ctypes.c_char.from_buffer(). The issue is that Python objects within PyPy don't have natively a fixed address, because they can move. While array.array() is not a problem (it uses malloc() anyway), we'd have issues in exposing a bytearray as a raw C pointer. Would a partial implementation be considered reasonable? E.g. bytearrays wouldn't work, but array.arrays would work?

  2. Simon Sapin reporter

    Does the same issue exists in cpyext with eg. PyObject_AsReadBuffer? What happens there?

    There are two different use cases:

    • Read-only buffers, eg. from a bytes() object. In this case it’s acceptable to make a copy (as ffi.new('char[]', obj) does, I think) although avoiding it if possible would be nice, in cases where we know the buffer is not written to. (Of course this is unsafe but so is a lot in cffi.) Would be nice too accept "buffer-like" objects other than bytes, and copy once rather than twice.
    • Read-write buffers. I guess we really need a fixed address. Something that only works on some object types is better than nothing :) Would numpypy arrays work?
  3. Simon Sapin reporter

    In both cases (read-only and read-write) ffi.buffer() objects should be accepted too.

    Also, return not just the pointer but also the length in bytes. len(array) might be a number of multi-byte elements.

  4. Simon Sapin reporter

    A work-around is array.array.buffer_info(). I’m now using this:

    def from_buffer(obj):
        """Return ``(address, length)``."""
        if hasattr(obj, 'buffer_info'):
            # Looks like a array.array object.
            address, length = obj.buffer_info()
            return address, length * obj.itemsize
        else:
            # Other buffers.
            # XXX Unfortunately ctypes.c_char.from_buffer
            # does not have length information,
            # and we’re not sure that len(obj) is measured in bytes.
            # (It’s not for array.array, though that is taken care of.)
            return ctypes.addressof(ctypes.c_char.from_buffer(obj)), len(obj)
    

    So cairocffi now supports at least arrays on PyPy. Still, it would be good to support any buffer as it is possble in cpyext through PyObject_AsWriteBuffer.

  5. Amaury Forgeot d'Arc

    It's already the case. PyObject_AsWriteBuffer only works for objects defined in C extension modules (if they provide a tp_as_buffer->bf_getreadbuffer slot), and fail for all standard types.

    Strings support the read-only PyObject_AsReadBuffer, but by copying the content to a fixed location. No magic here...

  6. Simon Sapin reporter

    Maybe it’s not PyObject_AsWriteBuffer, but how does this work?

    >>>> a = bytearray(10)
    >>>> open('/dev/urandom','rb').readinto(a)
    10
    >>>> a
    bytearray(b'^\xe3Q\x8bJ9\xfb\xd7=\xf6')
    

    Maybe a can move, but file.readinto is special and keeps it at the same location during the call?

    More generally, is there such a thing as “The buffer protocol” on PyPy?

  7. David Wilson

    Alex Gaynor noted on IRC that since this ticket was filed, the GC has grown support for pinning, which might be the missing link to sharing more buffers or buffer-like types.

    Selfishly mentioning it here to trigger some Bitbucket emails that might attract comments from the right people. ;)

  8. Armin Rigo

    I needs thinking again, but it's still not completely obvious, because pinning works along a "best-effort" approach and may fail. It's not meant to be used for pinning objects for extended periods of time, and if you do, you reach the maximum number of pinned objects quickly.

  9. David Wilson

    At least for any use cases I had, pinning for the duration of the function call would be fine (assuming the pin/unpin is cheaper than a copy), though it sounds like Simon's use case would be more of the long lived variety

    Perhaps a suitably scarily named wrapper function, "ffi.unstable_pointer_for(...)"? :)

  10. Armin Rigo

    So, the main limitation to avoid making it inefficient on PyPy would be: you can call ffi.from_buffer(x) as long as x is not a bytearray. I think we can live with this limitation (and check it on top of CPython too).

    What GC pinning added is the ability to "very often" pass a PyPy string object directly to a call, without copying it first.

    These are two different issues that should probably both be implemented. :-)

  11. simonzack

    I would love to have this too. Including anything except for bytearray works for my use case, as I need to convert buffers returned from ffi to types again.

  12. simonzack

    Thanks, I've compiled it locally and it appears to work, but haven't tested it on my projects yet. Would it be reasonable to add a ctype parameter? I would use this to parse file structures on disk combined with readinto to save memory. I think that having this parameter would also be closer to the other ways of constructing cdata objects using ffi.new and ffi.cast.

  13. Armin Rigo

    You can always do p1 = ffi.from_buffer(x); p2 = ffi.cast('struct foo_s *', p1). But also, are you sure you need from_buffer() in your case? It is only useful if you have a buffer/memoryview-enabled object from elsewhere, which is of some type that you don't control. If you can control the type of object whose buffer is taken, then you can as well do p3 = ffi.new("struct foo_s[]", n) to allocate that object in the first place. Then you use file.readinto(ffi.buffer(p3)).

  14. simonzack

    Sorry if I wasn't clear enough, but in my use case, a lot of the windows structs also have ANYSIZE_ARRAY in them. This means that if the buffer is shared, I could use the field directly (which points to the buffer). Otherwise I would need to allocate the struct to first determine it's size (which is stored in a field), then allocate another with the correct size, which can waste memory depending on the size. I could also compute the field's offset using offsetof and manually find the location within the existing buffer. Niether seem very desirable to me.

    You're right that I could use cast, but I just thought that having a ctype parameter would be more convenient.

    Casting to the struct using the buffer is what I would do in C and it would be nice to do it in the same way in python.

  15. Armin Rigo

    Can you paste code examples? I don't really follow you. If you're talking about variable-sized structures, surely you can't have an array of them? Or do you want to support ffi.from_buffer("struct foo_s", x) as well, and not only array types? This seems to be too much of a special case and adds complications to the C code of CFFI. Better to have the simple char[] array and do computations on it, like adding some offset and casting to some struct type, explicitly.

  16. simonzack

    What I would like is to do p1 = ffi.from_buffer(x); p2 = ffi.cast('struct foo_s *', p1) using p2 = ffi.cast('struct foo_s *', x). By array I mean the variable struct's last field. Here's a code example:

    from cffi import FFI
    
    ffi = FFI()
    ffi.cdef('''
        typedef struct _TOKEN_PRIVILEGES {
            DWORD               PrivilegeCount;
            LUID_AND_ATTRIBUTES Privileges[ANYSIZE_ARRAY];
        } TOKEN_PRIVILEGES, *PTOKEN_PRIVILEGES;
    ''')
    
    buffer = memoryview(bytearray(0x1000))
    some_stream.readinto(buffer)
    tp = ffi.cast("PTOKEN_PRIVILEGES", buffer)
    if tp.PrivilegeCount > 0:
        print(tp.Privileges[0])
    

    If I didn't have cast this would be a lot more inconvenient and I would need to unnecessarily allocate a TOKEN_PRIVILEGES struct.

  17. Armin Rigo

    As I said above, your case can be dealt with more easily by allocating the memory with ffi.new() instead of using bytearray(). Then there is no need for any of this and it works in old versions of cffi. Example:

    from cffi import FFI
    
    ffi = FFI()
    ffi.cdef('''
        typedef struct {
            int PrivilegeCount;
            int Privileges[];
        } foo_t;
    ''')
    
    tp = ffi.new('foo_t *', {"Privileges": 100})
    some_stream.readinto(ffi.buffer(tp))
    if tp.PrivilegeCount > 0:
        print(tp.Privileges[0])
    
  18. simonzack

    Yes that would work. I would certainly use the latter when I need a fresh copy, but I think the former uses less memory, when buffer is reused to read more data.

  19. Matthias Geier

    Thanks for adding from_buffer(), that's really useful!

    Would it be possible to add support for passing buffers directly to C functions?

    For example, if I have a C function taking an "unsigned char*", I can pass b"xyz" or (1, 2, 3) or a cdata object (if it has the correct type). It would be great if I could also pass a buffer object, which could be internally converted to a cdata object using from_buffer() (without copying the data, of course).

  20. Matthias Geier

    I have no clue what's going on under the hood when I call a C function via CFFI. Could you please give me a pointer to where in the code this happens?

    I guess there must be some kind of if statement that differentiates between bytes, list, tuple and CData inputs, right?

    I have also no clue about the complexity this feature would entail, but it would be definitely a very convenient and meaningful feature.

    I'm working on CFFI wrappers of a few libraries, all of which return pointers to some memory at some point and expect pointers to memory at some other point. If I want low-level access to the memory from within Python, I wrap them in a ffi.buffer(), which works great for reading and writing. However, if I want to pass some memory to a function that expects a pointer, I cannot simply pass a buffer, I'll have to wrap it with ffi.from_buffer() (or I have to make a copy with bytes(), which I would like to avoid).

    I don't want to expose ffi to the users of my wrapper, they should just get a Python buffer and they should be able to pass along such a buffer to a different function. I guess I can check with isinstance(my_input, _cffi_backend_buffer) (or something more meaningful? something that also allows non-CFFI buffers?) and then call ffi.from_buffer() in my wrapper code.

    But I thought since CFFI is already checking the types of inputs, it wouldn't hurt to add another case there ...?

  21. Armin Rigo

    As a general answer, yes, the way to design an API using CFFI is to hide CFFI behind some Pythonic abstraction layer. If you expect some function to be called with either a str/bytes or a buffer, you can check the type explicitly in your wrapper code. But I see your point too. It would not be impossible to fix, but involves at least two places: in the _cffi_backend.c, and in the C modules generated by vengine_*.py. The latter involves a compatibility breakage which I tried to avoid so far for the next release of CFFI.

  22. Matthias Geier

    Thanks for the pointers, I had a look at the code, but the Python C API and all this ctypes stuff is quite alien to me and I don't really get the bigger picture ...

    But anyway, breaking compatibility doesn't sound good.

    Since implementing this within CFFI seems not feasible right now, what's the best way to solve this in my Python code?

    1. Shall I try to call ffi.from_buffer() in the beginning and just continue with the original input data if this raises an error?

    2. Or shall I check the type of the input first? How to check for a buffer (any kind, including CFFI and, e.g., NumPy arrays)?

    3. Or shall I just try to call the C function and re-try it with ffi.from_buffer() if it raises an error?

    The first one seems quite Pythonic to me, but probably I'm missing something, probably there is even another way ...?

  23. Armin Rigo

    I think you should just say if not isinstance(x, str): x = ffi.from_buffer(x) (or bytes for Python 3). The other cases where a non-buffer might be accepted in the function call are probably not relevant in your case. E.g. users of your API don't expect a list of small integers to be acceptable. Maybe you want to also check for ffi.CData, but that's up to which exact design you are looking for.

  24. Matthias Geier

    Actually, I do want to accept lists of small integers at least in one case! The user is dealing with a sequence of bytes (MIDI data), so this is extremely convenient.

    I tried my option 1 from above and it seems to work quite well. I'm basically doing this before calling a C function:

    try:
        data = ffi.from_buffer(data)
    except AttributeError:
        pass
    except TypeError:
        pass
    

    Here is the actual code if someone is interested: https://github.com/spatialaudio/jackclient-python/commit/ac867386354c765090c72c200f5e9b6f0c2a2f3e

    I think this is a good solution for my case. If you add support for buffers at a later time, that's great, but using this work-around, I can also live without it.

    Thanks for your help!

  25. Armin Rigo

    I'm taking advantage of my double hat as a CFFI and as a PyPy developer: we can't accept bytearray on CPython and refuse it on PyPy. Implementing it on PyPy would be a lot of troubles, so I'm happy to ignore it by declaring bytearray illegal here.

    Let's make release 0.9 official. (PyPy 2.5.0 already includes it, even if cffi.__version__ returns '0.8.6' by mistake...)

  26. Cory Benfield

    Armin, this discussion seems to suggest that bytearray is forbidden, but that memoryview objects wrapping bytes objects and similar are fine, thanks to PyPy GC pinning. However, you added code in response to issue #187 that explicitly forbade that logic.

    Is there any reason GC pinning is insufficient here?

  27. Armin Rigo

    I'm sorry if I implied that the GC is the reason for why memoryviews wrapping bytes objects are fine but not directly bytes objects. This is wrong. With GC pinning we could accept both, given enough efforts. But anyway we couldn't accept bytearray or memoryviews wrapping bytearray in the current PyPy. The reason is that GC pinning may fail; if it does, hopefully in the rare case, we can always make a read-only regular copy of a bytes, but we can't make a "read-write copy" of a bytearray.

  28. Cory Benfield

    Ah, that makes sense. Is there any value in trying to support the bytes and memoryview-wrapping-bytes objects in the next release, or should I just accept that it's going to be easier to make a copy of the buffer? (Sadly, in my implementation the buffer is potentially 65kB large and that's a copy I'd really rather not make.)

  29. Armin Rigo

    Can you explain concretely what you're trying to do? Do you simply want to pass a bytes to a function call? Or do you really need ffi.from_buffer("foobar")? Only the former has chances to work in a future version of PyPy. The latter would require long-term pinning of the string "foobar", which is harder.

  30. Cory Benfield

    Sure. I can go one better and give you some excruciating detail. =)

    I'm writing a CFFI-based wrapper for picohttpparser, which has a function with this signature:

    int phr_parse_request(const char* buf, size_t len, const char** method,
                          size_t* method_len, const char** path,
                          size_t* path_len, int* minor_version,
                          struct phr_header* headers, size_t* num_headers,
                          size_t last_len);
    

    Naturally, the function has a ton of out parameters that return pointers into buf, because it does zero-copy parsing: it just walks over the provided buffer and parses it there. This is potentially a huge time saver, particularly if there are lots of headers (for example).

    My ideal use case has me making as few copies as possible from the socket to the user. That means using socket.recv_into to use an in-memory buffer, and then passing a memoryview of that buffer into phr_parse_request. (I want to use a memoryview so that I can avoid constantly re-allocating buffers whenever I've parsed part of them.) This flow would lead to one copy from read to parse, and then one further copy to bring the parsed data structures into Python strings for exposure to the user.

    This means my ideal use-case is passing a memoryview as a char * parameter. The memoryview is backed by a bytes object that is definitely going to outlive the function call, so as long as we can prevent PyPy's GC moving that buffer while phr_parse_request is using it and while I'm doing pointer manipulation with it, I'm happy.

    It may be that we just can't convince PyPy to play ball, and that'll have to be ok, but obviously I'd rather be able to be as efficient as possible! =)

  31. Armin Rigo

    You just want to allocate a ffi.new("char[]"), and do the operations there. You don't need a bytes nor any memoryview as far as I can tell. You can pass the ffi.buffer() of the char[] object to socket.recv_into(), and then get pointers to the Nth character in the char[] buffer, and so on; do everything you need with it.

  32. Cory Benfield

    Yeah, the problem there is that hyper only wants to use CFFI libraries as an optimisation if it's available, but it needs a pure-Python fallback. To get a pure-Python fallback but still have the advantage of recv_into, I need to use a memoryview. =(

    For now, having the single copy is acceptable.

  33. Log in to comment