pyquery 1.2, queries are broken with xml.

Issue #44 resolved
René Dudfield
created an issue

In pyquery pre 1.2 the following would work.

This works with pyquery 1.1. {{{

from pyquery import PyQuery as pq d = pq("<X>1</X>", parser="xml") print d <X>1</X> d('X') [<X>] }}}

This fails with pyquery 1.2: {{{

from pyquery import PyQuery as pq d = pq("<X>1</X>", parser="xml") print d <X>1</X> d('X') [] }}}

It can not find the node X in the example above.

Comments (5)

  1. Simon Sapin

    Hi, cssselect maintainer here.

    Short version: this particular problem should be fixed by setting translator.lower_case_element_names = False on the JQueryTranslator object in for XML documents.

    Longer version: pyquery should probably use GenericTranslator instead of HTMLTransator for non-HTML documents.

    Admittedly the documentation could be improved on this, but it is all explained in source comments:

    Elements names in selectors should be case-sensitive for XML but case-insensitive for HTML. To do that, cssselect.HTMLTranslator makes all elements names lower-case in selectors and expects the HTML parser to do the same in the document. lxml.html does. lxml.etree, however, parses XML and keeps the element name upper-case in the example, so the selector does not match. cssselect makes this assumption because there is no lower-case function in XPath 1.0.

    Compared to GenericTranslator, HTMLTransator makes element names and attributes lower-case, but also has an HTML-specific implementation of some pseudo-classes such as :link

  2. Gael Pasgrimaud

    Yep! Thanks for the help. Even if I've already figured out the problem ;)

    PyQuery now accept a custome css_translator and use JQueryTranslator(xhtml=True) for xml documents

    1.2.1 is available on pypi

  3. Simon Sapin

    Nice. I hadn’t thought of XHTML. To clarify, passing xhtml=True make HTMLTanslator behave like XML with respect to case-sensitivity, but still keeps HTML semantics. I’ll leave to you to decide if the later is what you want, even for stuff that might really not be (X)HTML.

  4. Log in to comment