Commits

Olivier Lauzanne committed e438752

add the make links absolute method

Comments (0)

Files changed (2)

pyquery/README.txt

 You can use the PyQuery class to load an xml document from a string, a lxml
 document, from a file or from an url::
 
-    >>> from pyquery import PyQuery
+    >>> from pyquery import PyQuery as pq
     >>> from lxml import etree
-    >>> d = PyQuery("<html></html>")
-    >>> d = PyQuery(etree.fromstring("<html></html>"))
-    >>> d = PyQuery(url='http://google.com/')
-    >>> d = PyQuery(filename=path_to_html_file)
+    >>> d = pq("<html></html>")
+    >>> d = pq(etree.fromstring("<html></html>"))
+    >>> d = pq(url='http://google.com/')
+    >>> d = pq(filename=path_to_html_file)
 
 Now d is like the $ in jquery::
 
 
 Filtering functions can refer to the current element as 'this', like in jQuery::
 
-    >>> d('p').filter(lambda i: PyQuery(this).text() == 'you know Python rocks')
+    >>> d('p').filter(lambda i: pq(this).text() == 'you know Python rocks')
     [<p#hello.hello>]
 
 The opposite of filter is `not_` - it returns the items that don't match the selector::
 You can map a callable onto a PyQuery and get a mutated result. The result can
 contain any items, not just elements::
 
-    >>> d('p').map(lambda i, e: PyQuery(e).text())
+    >>> d('p').map(lambda i, e: pq(e).text())
     ['you know Python rocks', 'hello python !']
 
 Like the filter method, map callbacks can reference the current item as this::
 
-    >>> d('p').map(lambda i, e: len(PyQuery(this).text()))
+    >>> d('p').map(lambda i, e: len(pq(this).text()))
     [21, 14]
 
 The map callback can also return a list, which will extend the resulting
 PyQuery::
 
-    >>> d('p').map(lambda i, e: PyQuery(this).text().split())
+    >>> d('p').map(lambda i, e: pq(this).text().split())
     ['you', 'know', 'Python', 'rocks', 'hello', 'python', '!']
 
 It is possible to select a single element with eq::
 .. _paste: http://pythonpaste.org/
 .. _proxy: http://pythonpaste.org/modules/proxy.html#paste.proxy.Proxy
 
+Making links absolute
+---------------------
+
+You can make all links on a page absolute which can be usefull for screen
+scrapping::
+
+    >>> d = pq(url='http://google.com')
+    >>> d('a:last').attr('href')
+    '/intl/fr/privacy.html'
+    >>> d.make_links_absolute()
+    [<html>]
+    >>> d('a:last').attr('href')
+    'http://google.com/intl/fr/privacy.html'
+
+
 Testing
 -------
 
     $ bin/buildout
     $ bin/test
 
+You can build the Sphinx documentation by doing::
+
+    $ cd docs
+    $ make html
+
 If you don't already have lxml installed use this line::
 
     $ STATIC_DEPS=true bin/buildout

pyquery/pyquery.py

 from cssselectpatch import selector_to_xpath
 from lxml import etree
 from copy import deepcopy
+from urlparse import urljoin
 
 def fromstring(context):
     """use html parser if we don't have clean xml
     def __init__(self, *args, **kwargs):
         html = None
         elements = []
+        self._base_url = None
 
         if 'parent' in kwargs:
             self._parent = kwargs.pop('parent')
                 html = file(kwargs['filename']).read()
             elif 'url' in kwargs:
                 from urllib2 import urlopen
-                html = urlopen(kwargs['url']).read()
+                url = kwargs['url']
+                html = urlopen(url).read()
+                self._base_url = url
             else:
                 raise ValueError('Invalid keyword arguments %s' % kwargs)
             elements = [fromstring(html)]
             results = self.__class__(expr, self)
             results.remove()
         return self
+
+    #####################################################
+    # Additional methods that are not in the jQuery API #
+    #####################################################
+
+    @property
+    def base_url(self):
+        """Return the url of current html document or None if not available.
+        """
+        if self._base_url is not None:
+            return self._base_url
+        if self._parent is not no_default:
+            return self._parent.base_url
+
+    def make_links_absolute(self, base_url=None):
+        """Make all links absolute.
+        """
+        if base_url is None:
+            base_url = self.base_url
+            if base_url is None:
+                raise ValueError('You need a base URL to make your links'
+                 'absolute. It can be provided by the base_url parameter.')
+
+        self('a').each(lambda a:
+                       a.attr('href', urljoin(base_url, a.attr('href'))))
+        return self
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.