Wiki
Clone wikiPynav / Pynav-0.7
Pynav 0.7
- Pynav 0.7
- Changelog
- Interactive mode examples
- Visit an url and print it
- Get all links of an url and filter them with regex
- Get all images of an url, filter with regex and download them
- Check if a new resource is available
- Check erroneous links (error 404)
- Restrict links only to html content type
- Download files behind non-direct links
- Get response header information
- Dump forms and generate code
- Select a specific form after a dump
- FAQ
- How to restrict links only to specific content types
- How to allow all content types
- Howt to disable robots.txt handling (do this with thought and consideration)
- How to handle referer automatically
- Howt to set the timemout
- How to define a delay between page visits
- How to check if a cookie exists
- How to use HTTP Basic authentication
- How to use a proxy (do not works with https, urllib restriction)
Changelog
Release date: 2011-02-22
- New: Split Pynav to Browser and Response classes, refactor code
- New: FormDumper class added to Dump forms as readable text and pre-generate python code for Pynav
- New: Python 2.5 is no longer supported, Python 2.6 minimum is now required.
- New: Change licence from GPL to LPGL
- New: Add Browser.check_404(url, values)
- New: Add Browser.check_new_resource(url, values, last_datetime)
- New: Add Browser.handle_robots boolean attribute to handle robots.txt
- New: Add methods to manage content types white list
- New: Add Response.read(boolean)
- Qual: Migrate Python 2.5 code to Python 2.6 (format strings, decorators, imports...)
- Qual: Refactor code to be more pythonic, rename attributes and clean code
- Fix: Typo error after refactoring, thanks gjbaker ;)
- Fix: empty link bug in get_all_links()
- Fix: get method is not used with no post data. Thanks ranan. ;)
Interactive mode examples
Visit an url and print it
>>> from pynav import Browser
>>> b = Browser()
>>> b.go('example.com')
<pynav.response.Response object at 0x8feb9cc>
>>> print b.response #or print b.r
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML ...">
<head>
<title>IANA — Example domains</title>
...
-
Get all links of an url and filter them with regex
>>> b.go('example.com')
<pynav.response.Response object at 0x8febdec>
>>> b.r.links
['http://example.com/foo.html', 'http://example.com/bar1.html',
'http://otherwebsite.com/myfoopage.html', 'http://example.com/bar2.html']
>>> b.r.get_links('foo')
['http://example.com/foo.html', 'http://otherwebsite.com/myfoopage.html']
-
Get all images of an url, filter with regex and download them
>>> b.go('example.com')
<pynav.response.Response object at 0x8febdec>
>>> b.r.images
['http://example.com/foobar.png', 'http://example.com/images/foo.jpeg',
'http://example.com/images/logobar.png']
>>> for image in b.r.get_images('bar\.png'):
... b.download(image, '/tmp')
...
('/tmp/foobar.png', <httplib.HTTPMessage instance at 0x8ff7d2c>)
('/tmp/logobar.png', <httplib.HTTPMessage instance at 0x8ff7f6c>)
-
Check if a new resource is available
>>> from datetime import datetime
>>> last_check = datetime(2011, 2, 5, 10, 35, 12)
>>> b.check_new_resource('example.com/news.html', last_check)
True
-
Check erroneous links (error 404)
>>> b.go('example.com')
>>> for link in b.r.links:
... print "{link}: {res}".format(link=link, res=b.check_404(link))
...
http://example.com/erroneous_link.php: True
http://example.com/valid_link.php: None
-
Restrict links only to html content type
>>> b.go('example.com')
<pynav.response.Response object at 0x96d89ac>
>>> b.allow_html_only()
>>> for link in b.r.links:
... b.go(link)
...
Content-Type image/x-icon is not allowed !
<pynav.response.Response object at 0x96d87cc>
Content-Type text/css is not allowed !
<pynav.response.Response object at 0x96d8aac>
Content-Type text/css; charset=UTF-8 is not allowed !
<pynav.response.Response object at 0x96d8acc>
<pynav.response.Response object at 0x96d8a0c>
<pynav.response.Response object at 0x96e0a8c>
...
-
Download files behind non-direct links
>>> b.go('example.com/download.php')
<pynav.response.Response object at 0x8f78b2c>
>>> b.verbose = True
>>> b.r.links
['http://www.example.com/download.php?f=video145part1',
'http://www.example.com/download.php?f=video145part2',
'http://www.example.com/download.php?f=video145part3',]
>>> for link in b.r.links:
... b.download(link, '/tmp/videos'
...
Downloading switzerland_trip.avi (86 MB) to: /tmp/videos/switzerland_trip.avi
('/tmp/videos/switzerland_trip', <httplib.HTTPMessage instance at 0x8f7848c>)
Downloading japan_trip.avi (531 MB) to: /tmp/videos/japan_trip.avi
('/tmp/videos/japan_trip.avi', <httplib.HTTPMessage instance at 0x8f786ec>)
Downloading europe_trip.avi (2.6 GB) to: /tmp/videos/europe_trip.avi
('/tmp/videos/europe_trip.avi', <httplib.HTTPMessage instance at 0x8f785ec>)
-
Get response header information
>>> b.go('example.com')
<pynav.response.Response object at 0x8f78fec>
>>> b.r.content_type
'text/html; charset=UTF-8'
>>> b.r.content_length
'2945'
>>> b.r.date
datetime.datetime(2011, 2, 20, 21, 20, 52)
>>> b.r.last_modified
datetime.datetime(2011, 2, 9, 17, 13, 15)
>>> b.r.status_code
200
>>> b.r.status_message
'OK'
>>> print b.r.headers
Server: Apache/2.2.3 (CentOS)
Last-Modified: Wed, 09 Feb 2011 17:13:15 GMT
Content-Type: text/html; charset=UTF-8
Accept-Ranges: bytes
Connection: close
Date: Sun, 20 Feb 2011 21:20:52 GMT
Age: 40
Content-Length: 2945
-
Dump forms and generate code
>>> b = Browser()
>>> b.go('http://www.example.com/form.html')
<pynav.response.Response object at 0x9f839ec>
>>> b.r.dump_form()
>There are 1 forms in this page: [0])
>Possibilities:
fd.dump(0)
fd.dump_all()
(POST) http://www.example.com/cgi-bin/echo.cgi multipart/form-data
(Textarea) blah=
(Text) comments=
(Checkbox) eggs=[spam]
(Select) cheeses=[mozz, caerphilly, gouda, gorgonzola, parmesan, leicester, cheddar, mascarpone, emmenthal]
(Checkbox) apples=[pears]
(Checkbox) whocares=[edam, gouda]
(Radio) spam=[spam, rhubarb]
(Radio) smelly=[on]
(Select) favorite_cheese=[*cheddar, brie, leicester, jahlsberg]
(File) No files added
values = {'blah':'', 'comments':'', 'eggs':'', 'cheeses':'', 'apples':'', 'whocares':'', 'spam':'', 'smelly':'',
'favorite_cheese':'cheddar'}
print b.go('http://www.example.com/cgi-bin/echo.cgi', values)
>>>
-
Select a specific form after a dump
>>> b.go('http://www.example.com/packages')
<pynav.response.Response object at 0x9f83f4c>
>>> b.r.dump_form()
>There are 4 forms in this page: [0, 1, 2, 3])
>Possibilities:
fd.dump(0)
fd.dump(1)
fd.dump(2)
fd.dump(3)
fd.dump_all()
>>> b.r.dump_form(2)
(GET) http://www.example.com/search application/x-www-form-urlencoded
(Hidden) searchon=contents
(Text) keywords=
(Submit) Search
(Ignore)
(Radio) mode=[*path, exactfilename, filename]
(Select) suite=[experimental, unstable, testing, *stable, oldstable]
(Select) arch=[*any, i386, m68k, alpha, amd64, sparc, powerpc, arm, hppa, ia64, mips, mipsel, s390]
values = {'searchon':'contents', 'keywords':'', 'mode':'path', 'suite':'stable', 'arch':'any'}
print b.go('http://www.example.com/search', values)
>>>
-
FAQ
How to restrict links only to specific content types
b.allowed_content_types = ['text/html', 'text/plain', 'text/css']
-
How to allow all content types
b.allow_all_content_types()
-
Howt to disable robots.txt handling (do this with thought and consideration)
b.handle_robots = False
-
How to handle referer automatically
b.handle_referer = True
-
Howt to set the timemout
b.timeout = 15 #seconds
-
How to define a delay between page visits
b.set_page_delay(2, 6) #random time between 2 and 6 seconds
-
How to check if a cookie exists
>>> 'my_cookie' in b.cookies
True
-
How to use HTTP Basic authentication
b.set_http_auth('http://example.com', 'login', 'pass')
-
How to use a proxy (do not works with https, urllib restriction)
b.proxy='http://www.example.com:3128/'
-
Updated