Commits

Olivier Lauzanne  committed a8fc828

user the html parser from lxml.html

  • Participants
  • Parent commits 3bce8f5

Comments (0)

Files changed (2)

File pyquery/README.txt

 I can also be used for web scrapping or for theming applications with
 `Deliverance`_.
 
-The project is being actively developped on `Bitbucket`_ and I have the policy
-of giving push access to anyone who wants it and then to review what he does.
-So if you want to contribute just email me.
+The `project`_ is being actively developped on a mercurial repository on
+Bitbucket. I have the policy of giving push access to anyone who wants it
+and then to review what he does. So if you want to contribute just email me.
 
 The Sphinx documentation is available on `pyquery.org`_.
 
 .. _deliverance: http://www.gawel.org/weblog/en/2008/12/skinning-with-pyquery-and-deliverance
-.. _bitbucket: http://www.bitbucket.org/olauzanne/pyquery/
+.. _project: http://www.bitbucket.org/olauzanne/pyquery/
 .. _pyquery.org: http://pyquery.org/
 
 .. contents::
 -----------------------
 
 By default pyquery uses the lxml xml parser and then if it doesn't work goes on
-to try the html parser. It can sometimes be problematic when parsing xhtml pages
-because the parser will not raise an error but give an unusable tree.
+to try the html parser from lxml.html. The xml parser can sometimes be
+problematic when parsing xhtml pages because the parser will not raise an error
+but give an unusable tree (on w3c.org for example).
 
 You can also choose which parser to use explicitly::
 
-   >>> pq('<p>toto</p>', parser='html')
+   >>> pq('<html><body><p>toto</p></body></html>', parser='xml')
    [<html>]
-   >>> pq('<p>toto</p>', parser='xml')
+   >>> pq('<html><body><p>toto</p></body></html>', parser='html')
+   [<html>]
+   >>> pq('<html><body><p>toto</p></body></html>', parser='html_fragments')
    [<p>]
 
+The html and html_fragments parser are the ones from lxml.html.
+
 Testing
 -------
 
 
     $ STATIC_DEPS=true bin/buildout
 
-Other documentations
---------------------
+More documentation
+------------------
 
 First there is the Sphinx documentation `here`_.
-Then for more documentation about the API you can use the jquery website http://docs.jquery.com/.
+Then for more documentation about the API you can use the `jquery website`_.
 The reference I'm now using for the API is ... the `color cheat sheet`_.
 Then you can always look at the `code`_.
 
+.. _jquery website: http://docs.jquery.com/
 .. _code: http://www.bitbucket.org/olauzanne/pyquery/src/tip/pyquery/pyquery.py
 .. _here: http://pyquery.org
 .. _color cheat sheet: http://colorcharge.com/wp-content/uploads/2007/12/jquery12_colorcharge.png

File pyquery/pyquery.py

 # Distributed under the BSD license, see LICENSE.txt
 from cssselectpatch import selector_to_xpath
 from lxml import etree
+import lxml.html
 from copy import deepcopy
 from urlparse import urljoin
 
     """
     if parser == None:
         try:
-            return etree.fromstring(context)
+            return [etree.fromstring(context)]
         except etree.XMLSyntaxError:
-            return etree.fromstring(context, etree.HTMLParser())
+            return [lxml.html.fromstring(context)]
     elif parser == 'xml':
-        return etree.fromstring(context)
+        return [etree.fromstring(context)]
     elif parser == 'html':
-        return etree.fromstring(context, etree.HTMLParser())
+        return [lxml.html.fromstring(context)]
+    elif parser == 'html_fragments':
+        return lxml.html.fragments_fromstring(context)
     else:
         ValueError('No such parser: "%s"' % parser)
 
         parser = kwargs.get('parser')
         if 'parser' in kwargs:
             del kwargs['parser']
+        if not kwargs and len(args) == 1 and isinstance(args[0], basestring) \
+           and args[0].startswith('http://'):
+            kwargs = {'url': args[0]}
+            args = []
 
         if 'parent' in kwargs:
             self._parent = kwargs.pop('parent')
                 self._base_url = url
             else:
                 raise ValueError('Invalid keyword arguments %s' % kwargs)
-            elements = [fromstring(html, parser)]
+            elements = fromstring(html, parser)
         else:
             # get nodes
 
             # get context
             if isinstance(context, basestring):
                 try:
-                    elements = [fromstring(context, parser)]
+                    elements = fromstring(context, parser)
                 except Exception, e:
                     raise ValueError('%r, %s' % (e, context))
             elif isinstance(context, self.__class__):
 
         """
         assert isinstance(value, basestring)
-        value = fromstring(value)
+        value = fromstring(value)[0]
         nodes = []
         for tag in self:
             wrapper = deepcopy(value)
             return self
 
         assert isinstance(value, basestring)
-        value = fromstring(value)
+        value = fromstring(value)[0]
         wrapper = deepcopy(value)
         if not wrapper.getchildren():
             child = wrapper