pyquery 1.2, queries are broken with xml.

illume avatarillume created an issue

In pyquery pre 1.2 the following would work.

This works with pyquery 1.1.

>>> from pyquery import PyQuery as pq
>>> d = pq("<X>1</X>", parser="xml")
>>> print d
<X>1</X>
>>> d('X')
[<X>]

This fails with pyquery 1.2:

>>> from pyquery import PyQuery as pq
>>> d = pq("<X>1</X>", parser="xml")
>>> print d
<X>1</X>
>>> d('X')
[]

It can not find the node X in the example above.

Comments (5)

  1. Simon Sapin

    Hi, cssselect maintainer here.

    Short version: this particular problem should be fixed by setting translator.lower_case_element_names = False on the JQueryTranslator object in cssselectpatch.py for XML documents.

    Longer version: pyquery should probably use GenericTranslator instead of HTMLTransator for non-HTML documents.

    Admittedly the documentation could be improved on this, but it is all explained in source comments: https://github.com/SimonSapin/cssselect/blob/master/cssselect/xpath.py#L123

    Elements names in selectors should be case-sensitive for XML but case-insensitive for HTML. To do that, cssselect.HTMLTranslator makes all elements names lower-case in selectors and expects the HTML parser to do the same in the document. lxml.html does. lxml.etree, however, parses XML and keeps the element name upper-case in the example, so the selector does not match. cssselect makes this assumption because there is no lower-case function in XPath 1.0.

    Compared to GenericTranslator, HTMLTransator makes element names and attributes lower-case, but also has an HTML-specific implementation of some pseudo-classes such as :link

  2. Gael Pasgrimaud

    Yep! Thanks for the help. Even if I've already figured out the problem ;)

    PyQuery now accept a custome css_translator and use JQueryTranslator(xhtml=True) for xml documents

    1.2.1 is available on pypi

  3. Simon Sapin

    Nice. I hadn’t thought of XHTML. To clarify, passing xhtml=True make HTMLTanslator behave like XML with respect to case-sensitivity, but still keeps HTML semantics. I’ll leave to you to decide if the later is what you want, even for stuff that might really not be (X)HTML.

  4. Log in to comment
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.