Wiki

Clone wiki

Pynav / Pynav-0.7

Pynav 0.7

Changelog

Release date: 2011-02-22

  • New: Split Pynav to Browser and Response classes, refactor code
  • New: FormDumper class added to Dump forms as readable text and pre-generate python code for Pynav
  • New: Python 2.5 is no longer supported, Python 2.6 minimum is now required.
  • New: Change licence from GPL to LPGL
  • New: Add Browser.check_404(url, values)
  • New: Add Browser.check_new_resource(url, values, last_datetime)
  • New: Add Browser.handle_robots boolean attribute to handle robots.txt
  • New: Add methods to manage content types white list
  • New: Add Response.read(boolean)
  • Qual: Migrate Python 2.5 code to Python 2.6 (format strings, decorators, imports...)
  • Qual: Refactor code to be more pythonic, rename attributes and clean code
  • Fix: Typo error after refactoring, thanks gjbaker ;)
  • Fix: empty link bug in get_all_links()
  • Fix: get method is not used with no post data. Thanks ranan. ;)

Interactive mode examples

Visit an url and print it

>>> from pynav import Browser
>>> b = Browser()
>>> b.go('example.com')
<pynav.response.Response object at 0x8feb9cc>
>>> print b.response #or print b.r
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML ...">
<head>
	<title>IANA &mdash; Example domains</title>
...

-

>>> b.go('example.com')
<pynav.response.Response object at 0x8febdec>
>>> b.r.links
['http://example.com/foo.html', 'http://example.com/bar1.html',
'http://otherwebsite.com/myfoopage.html', 'http://example.com/bar2.html']
>>> b.r.get_links('foo')
['http://example.com/foo.html', 'http://otherwebsite.com/myfoopage.html']

-

Get all images of an url, filter with regex and download them

>>> b.go('example.com')
<pynav.response.Response object at 0x8febdec>
>>> b.r.images
['http://example.com/foobar.png', 'http://example.com/images/foo.jpeg',
'http://example.com/images/logobar.png']
>>> for image in b.r.get_images('bar\.png'):
...   b.download(image, '/tmp')
... 
('/tmp/foobar.png', <httplib.HTTPMessage instance at 0x8ff7d2c>)
('/tmp/logobar.png', <httplib.HTTPMessage instance at 0x8ff7f6c>)

-

Check if a new resource is available

>>> from datetime import datetime
>>> last_check = datetime(2011, 2, 5, 10, 35, 12)
>>> b.check_new_resource('example.com/news.html', last_check)
True

-

>>> b.go('example.com')
>>> for link in b.r.links:
...   print "{link}: {res}".format(link=link, res=b.check_404(link))
... 
http://example.com/erroneous_link.php: True
http://example.com/valid_link.php: None

-

>>> b.go('example.com')
<pynav.response.Response object at 0x96d89ac>
>>> b.allow_html_only()
>>> for link in b.r.links:
...   b.go(link)
... 
Content-Type image/x-icon is not allowed !
<pynav.response.Response object at 0x96d87cc>
Content-Type text/css is not allowed !
<pynav.response.Response object at 0x96d8aac>
Content-Type text/css; charset=UTF-8 is not allowed !
<pynav.response.Response object at 0x96d8acc>
<pynav.response.Response object at 0x96d8a0c>
<pynav.response.Response object at 0x96e0a8c>
...

-

>>> b.go('example.com/download.php')
<pynav.response.Response object at 0x8f78b2c>
>>> b.verbose = True
>>> b.r.links
['http://www.example.com/download.php?f=video145part1',
'http://www.example.com/download.php?f=video145part2',
'http://www.example.com/download.php?f=video145part3',]
>>> for link in b.r.links:
...   b.download(link, '/tmp/videos'
...   
Downloading switzerland_trip.avi (86 MB) to: /tmp/videos/switzerland_trip.avi
('/tmp/videos/switzerland_trip', <httplib.HTTPMessage instance at 0x8f7848c>)
Downloading japan_trip.avi (531 MB) to: /tmp/videos/japan_trip.avi
('/tmp/videos/japan_trip.avi', <httplib.HTTPMessage instance at 0x8f786ec>)
Downloading europe_trip.avi (2.6 GB) to: /tmp/videos/europe_trip.avi
('/tmp/videos/europe_trip.avi', <httplib.HTTPMessage instance at 0x8f785ec>)

-

Get response header information

>>> b.go('example.com')
<pynav.response.Response object at 0x8f78fec>
>>> b.r.content_type
'text/html; charset=UTF-8'
>>> b.r.content_length
'2945'
>>> b.r.date
datetime.datetime(2011, 2, 20, 21, 20, 52)
>>> b.r.last_modified
datetime.datetime(2011, 2, 9, 17, 13, 15)
>>> b.r.status_code
200
>>> b.r.status_message
'OK'
>>> print b.r.headers
Server: Apache/2.2.3 (CentOS)
Last-Modified: Wed, 09 Feb 2011 17:13:15 GMT
Content-Type: text/html; charset=UTF-8
Accept-Ranges: bytes
Connection: close     
Date: Sun, 20 Feb 2011 21:20:52 GMT
Age: 40     
Content-Length: 2945

-

Dump forms and generate code

>>> b = Browser()
>>> b.go('http://www.example.com/form.html')
<pynav.response.Response object at 0x9f839ec>
>>> b.r.dump_form()

>There are 1 forms in this page: [0])

>Possibilities:
fd.dump(0)
fd.dump_all()

(POST) http://www.example.com/cgi-bin/echo.cgi multipart/form-data
  (Textarea) blah=
  (Text) comments=
  (Checkbox) eggs=[spam]
  (Select) cheeses=[mozz, caerphilly, gouda, gorgonzola, parmesan, leicester, cheddar, mascarpone, emmenthal]
  (Checkbox) apples=[pears]
  (Checkbox) whocares=[edam, gouda]
  (Radio) spam=[spam, rhubarb]
  (Radio) smelly=[on]
  (Select) favorite_cheese=[*cheddar, brie, leicester, jahlsberg]
  (File) No files added

values = {'blah':'', 'comments':'', 'eggs':'', 'cheeses':'', 'apples':'', 'whocares':'', 'spam':'', 'smelly':'',
'favorite_cheese':'cheddar'}
print b.go('http://www.example.com/cgi-bin/echo.cgi', values)

>>>

-

Select a specific form after a dump

>>> b.go('http://www.example.com/packages')
<pynav.response.Response object at 0x9f83f4c>
>>> b.r.dump_form()

>There are 4 forms in this page: [0, 1, 2, 3])

>Possibilities:
fd.dump(0)
fd.dump(1)
fd.dump(2)
fd.dump(3)
fd.dump_all()

>>> b.r.dump_form(2)
(GET) http://www.example.com/search application/x-www-form-urlencoded
  (Hidden) searchon=contents
  (Text) keywords=
  (Submit) Search
  (Ignore) 
  (Radio) mode=[*path, exactfilename, filename]
  (Select) suite=[experimental, unstable, testing, *stable, oldstable]
  (Select) arch=[*any, i386, m68k, alpha, amd64, sparc, powerpc, arm, hppa, ia64, mips, mipsel, s390]

values = {'searchon':'contents', 'keywords':'', 'mode':'path', 'suite':'stable', 'arch':'any'}
print b.go('http://www.example.com/search', values)

>>>

-

FAQ

b.allowed_content_types = ['text/html', 'text/plain', 'text/css']

-

How to allow all content types

b.allow_all_content_types()

-

Howt to disable robots.txt handling (do this with thought and consideration)

b.handle_robots = False

-

How to handle referer automatically

b.handle_referer = True

-

Howt to set the timemout

b.timeout = 15 #seconds

-

How to define a delay between page visits

b.set_page_delay(2, 6) #random time between 2 and 6 seconds

-

>>> 'my_cookie' in b.cookies
True

-

How to use HTTP Basic authentication

b.set_http_auth('http://example.com', 'login', 'pass')

-

How to use a proxy (do not works with https, urllib restriction)

b.proxy='http://www.example.com:3128/'

-

Updated