Changes to pycparser break cffi's `...` handling

Issue #310 resolved
Nate Bogdanowicz
created an issue

I recently made some changes to pycparser (see PR #169) to fix an aspect of its parsing. This has the unfortunate side effect of breaking cffi's handling of its ... notation.

Basically, the c_parser module is replacing certain instances of ... with __dotdotdot__ or similar, and then typedef'ing __dotdotdot__ so it's a typedef name. With the recent changes, it's now invalid to have more than one type in a type declaration if one of them is a typedef name. For example,

typedef __dotdotdot__ int;
int __dotdotdot__ a;

will not parse. (I'll note here that this behavior is in agreement with the C spec as well as GCC and clang).

Below I'll quote a proposal I made in that PR thread when this issue was raised:

Maybe we could add a minor way for users of pycparser to do this sort of thing, but I'm not sure how Eli feels about this. I do know that cffi uses another hack to work around handling of __stdcall using volatile volatile const which it would be nice to get rid of.

Here's a basic proposal: Add dummy token type called USERQUOTE, which can be any sequence of characters surrounded by backticks (or whatever invalid C we decide). Then we say USERQUOTE is a valid type-qualifier. Now cffi or whoever can insert arbitrary notes to themselves wherever type qualifiers are allowed. I just tested this out and it works with 6 lines of code added.

Please let me know if this sounds like a reasonable solution, or if there are other possible solutions. I'd be willing to code up the pycparser end of this (or at least see if it's likely to be accepted), but first I'd like to know what you cffi folks think of this.

Comments (8)

  1. Armin Rigo

    It's a bit more involved, because ... can appear in different positions. For example, you can say typedef ... foo_t; or typedef ... *foo_t;. It is to fulfill that role that __dotdotdot__ is defined as a typedef.

    I see the problem with int __dotdotdot__ a;, which basically worked by chance so far. There are various ways to fix it, but all involve some more regexps inside cffi's cparser module. I don't think we can simply replace all ... with a USERQUOTE that works as a type qualifier, because not all ... take the position of a type qualifier.

  2. Armin Rigo

    The best is to look for __dotdotdot (without trailing __) inside and see how various cases are handled. You have:

    • array notation: int a[...]; => int a[__dotdotdotarray__];, which is parsed as a generic c_ast.ID

    • enums: enum e { A, B, ... }; or enum e { A=..., B=... }; => enum e { A, B, __dotdotdotNUMBER__ };

    All other ... are replaced with __dotdotdot__, which is a typedef, and then detected in pycparser's output in these places:

    • typedef ... foo_t; or typedef ... *foo_t;

    • Same as above but with an integer or floating point type before the ... (the case that crashes right now)

    • In regular ellipsis parameters declaration, int f(int x, ...); parses as int f(int x, __dotdotdot__);, which works because __dotdotdot__ is a type. It is then turned back into an ellipsis declaration.

    • In structs/unions: struct foo { int x; ...; };, which also works because __dotdotdot__ is a type.

  3. Nate Bogdanowicz reporter

    As you've indicated, it seems to make the most sense for __dotdotdot__ to be interpreted as a type. Does it ever make sense to use ... with a typedef-name? If not, one path I see is to make the dummy token be a valid type (instead of a type-qualifier), so it would be on the same footing as something like unsigned. In that case, the parser allows it alongside other builtin types, but not alongside typedef-names or struct/union/enum declarations (inside structs is fine though, obviously).

  4. Armin Rigo

    Yes, that would work to cover the last four cases, I think. typdef int unsigned foo_t; is not commonly seen but it is certainly parseable and it seems to be valid C code. ... should never be used together with a typedef-name, only either on its own as a type, or in conjunction with a sequence of built-in type words from int|long|short|signed|unsigned|char|double|float.

  5. Nate Bogdanowicz reporter

    Yes, it does appear that your patch fixes the specific code in cryptography that the parser was choking on before. If you're happy with this solution, then I'm happy. The only thing I noticed is that something like typedef ... int foo_t; fails to parse, but you may consider that improper usage of ....

  6. Armin Rigo

    Yes, typedef ... int foo_t; is invalid. It also fails to parse in cffi 1.9/pycparser 2.16, so I guess it never worked and there is no backward-compatibility issue.

    Marking this as resolved for now. It would be nice at some point to discuss a more official method for parsing the ellipsis. The general issue with that is really that the text ... can be used in different syntactic roles: its role as a built-in type name covers many cases, but it can also be a constant expression (int a[...]; or enum { X=... };) or just an identifier (enum { X, ... };).

  7. Log in to comment