Pynav 0.7

Pynav 0.7

Changelog

Release date: 2011-02-22

New: Split Pynav to Browser and Response classes, refactor code
New: FormDumper class added to Dump forms as readable text and pre-generate python code for Pynav
New: Python 2.5 is no longer supported, Python 2.6 minimum is now required.
New: Change licence from GPL to LPGL
New: Add Browser.check_404(url, values)
New: Add Browser.check_new_resource(url, values, last_datetime)
New: Add Browser.handle_robots boolean attribute to handle robots.txt
New: Add methods to manage content types white list
New: Add Response.read(boolean)

Qual: Migrate Python 2.5 code to Python 2.6 (format strings, decorators, imports...)
Qual: Refactor code to be more pythonic, rename attributes and clean code

Fix: Typo error after refactoring, thanks gjbaker ;)
Fix: empty link bug in get_all_links()
Fix: get method is not used with no post data. Thanks ranan. ;)

Interactive mode examples

Visit an url and print it

>>> from pynav import Browser
>>> b = Browser()
>>> b.go('example.com')
<pynav.response.Response object at 0x8feb9cc>
>>> print b.response #or print b.r
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML ...">
<head>
	<title>IANA &mdash; Example domains</title>
...

-

Get all links of an url and filter them with regex

>>> b.go('example.com')
<pynav.response.Response object at 0x8febdec>
>>> b.r.links
['http://example.com/foo.html', 'http://example.com/bar1.html',
'http://otherwebsite.com/myfoopage.html', 'http://example.com/bar2.html']
>>> b.r.get_links('foo')
['http://example.com/foo.html', 'http://otherwebsite.com/myfoopage.html']

-

Get all images of an url, filter with regex and download them

>>> b.go('example.com')
<pynav.response.Response object at 0x8febdec>
>>> b.r.images
['http://example.com/foobar.png', 'http://example.com/images/foo.jpeg',
'http://example.com/images/logobar.png']
>>> for image in b.r.get_images('bar\.png'):
...   b.download(image, '/tmp')
... 
('/tmp/foobar.png', <httplib.HTTPMessage instance at 0x8ff7d2c>)
('/tmp/logobar.png', <httplib.HTTPMessage instance at 0x8ff7f6c>)

-

Check if a new resource is available

>>> from datetime import datetime
>>> last_check = datetime(2011, 2, 5, 10, 35, 12)
>>> b.check_new_resource('example.com/news.html', last_check)
True

-

Check erroneous links (error 404)

>>> b.go('example.com')
>>> for link in b.r.links:
...   print "{link}: {res}".format(link=link, res=b.check_404(link))
... 
http://example.com/erroneous_link.php: True
http://example.com/valid_link.php: None

-

Restrict links only to html content type

>>> b.go('example.com')
<pynav.response.Response object at 0x96d89ac>
>>> b.allow_html_only()
>>> for link in b.r.links:
...   b.go(link)
... 
Content-Type image/x-icon is not allowed !
<pynav.response.Response object at 0x96d87cc>
Content-Type text/css is not allowed !
<pynav.response.Response object at 0x96d8aac>
Content-Type text/css; charset=UTF-8 is not allowed !
<pynav.response.Response object at 0x96d8acc>
<pynav.response.Response object at 0x96d8a0c>
<pynav.response.Response object at 0x96e0a8c>
...

-

Download files behind non-direct links

>>> b.go('example.com/download.php')
<pynav.response.Response object at 0x8f78b2c>
>>> b.verbose = True
>>> b.r.links
['http://www.example.com/download.php?f=video145part1',
'http://www.example.com/download.php?f=video145part2',
'http://www.example.com/download.php?f=video145part3',]
>>> for link in b.r.links:
...   b.download(link, '/tmp/videos'
...   
Downloading switzerland_trip.avi (86 MB) to: /tmp/videos/switzerland_trip.avi
('/tmp/videos/switzerland_trip', <httplib.HTTPMessage instance at 0x8f7848c>)
Downloading japan_trip.avi (531 MB) to: /tmp/videos/japan_trip.avi
('/tmp/videos/japan_trip.avi', <httplib.HTTPMessage instance at 0x8f786ec>)
Downloading europe_trip.avi (2.6 GB) to: /tmp/videos/europe_trip.avi
('/tmp/videos/europe_trip.avi', <httplib.HTTPMessage instance at 0x8f785ec>)

-

Get response header information

>>> b.go('example.com')
<pynav.response.Response object at 0x8f78fec>
>>> b.r.content_type
'text/html; charset=UTF-8'
>>> b.r.content_length
'2945'
>>> b.r.date
datetime.datetime(2011, 2, 20, 21, 20, 52)
>>> b.r.last_modified
datetime.datetime(2011, 2, 9, 17, 13, 15)
>>> b.r.status_code
200
>>> b.r.status_message
'OK'
>>> print b.r.headers
Server: Apache/2.2.3 (CentOS)
Last-Modified: Wed, 09 Feb 2011 17:13:15 GMT
Content-Type: text/html; charset=UTF-8
Accept-Ranges: bytes
Connection: close     
Date: Sun, 20 Feb 2011 21:20:52 GMT
Age: 40     
Content-Length: 2945

-

Dump forms and generate code

>>> b = Browser()
>>> b.go('http://www.example.com/form.html')
<pynav.response.Response object at 0x9f839ec>
>>> b.r.dump_form()

>There are 1 forms in this page: [0])

>Possibilities:
fd.dump(0)
fd.dump_all()

(POST) http://www.example.com/cgi-bin/echo.cgi multipart/form-data
  (Textarea) blah=
  (Text) comments=
  (Checkbox) eggs=[spam]
  (Select) cheeses=[mozz, caerphilly, gouda, gorgonzola, parmesan, leicester, cheddar, mascarpone, emmenthal]
  (Checkbox) apples=[pears]
  (Checkbox) whocares=[edam, gouda]
  (Radio) spam=[spam, rhubarb]
  (Radio) smelly=[on]
  (Select) favorite_cheese=[*cheddar, brie, leicester, jahlsberg]
  (File) No files added

values = {'blah':'', 'comments':'', 'eggs':'', 'cheeses':'', 'apples':'', 'whocares':'', 'spam':'', 'smelly':'',
'favorite_cheese':'cheddar'}
print b.go('http://www.example.com/cgi-bin/echo.cgi', values)

>>>

-

Select a specific form after a dump

>>> b.go('http://www.example.com/packages')
<pynav.response.Response object at 0x9f83f4c>
>>> b.r.dump_form()

>There are 4 forms in this page: [0, 1, 2, 3])

>Possibilities:
fd.dump(0)
fd.dump(1)
fd.dump(2)
fd.dump(3)
fd.dump_all()

>>> b.r.dump_form(2)
(GET) http://www.example.com/search application/x-www-form-urlencoded
  (Hidden) searchon=contents
  (Text) keywords=
  (Submit) Search
  (Ignore) 
  (Radio) mode=[*path, exactfilename, filename]
  (Select) suite=[experimental, unstable, testing, *stable, oldstable]
  (Select) arch=[*any, i386, m68k, alpha, amd64, sparc, powerpc, arm, hppa, ia64, mips, mipsel, s390]

values = {'searchon':'contents', 'keywords':'', 'mode':'path', 'suite':'stable', 'arch':'any'}
print b.go('http://www.example.com/search', values)

>>>

-

FAQ

How to restrict links only to specific content types

b.allowed_content_types = ['text/html', 'text/plain', 'text/css']

-

How to allow all content types

b.allow_all_content_types()

-

Howt to disable robots.txt handling (do this with thought and consideration)

b.handle_robots = False

-

How to handle referer automatically

b.handle_referer = True

-

Howt to set the timemout

b.timeout = 15 #seconds

-

How to define a delay between page visits

b.set_page_delay(2, 6) #random time between 2 and 6 seconds

-

>>> 'my_cookie' in b.cookies
True

-

How to use HTTP Basic authentication

b.set_http_auth('http://example.com', 'login', 'pass')

-

How to use a proxy (do not works with https, urllib restriction)

b.proxy='http://www.example.com:3128/'

-

Wiki

Pynav / Pynav-0.7

Pynav 0.7

Changelog

Release date: 2011-02-22

Interactive mode examples

Visit an url and print it

Get all links of an url and filter them with regex

Get all images of an url, filter with regex and download them

Check if a new resource is available

Check erroneous links (error 404)

Restrict links only to html content type

Download files behind non-direct links

Get response header information

Dump forms and generate code

Select a specific form after a dump

FAQ

How to restrict links only to specific content types

How to allow all content types

Howt to disable robots.txt handling (do this with thought and consideration)

How to handle referer automatically

Howt to set the timemout

How to define a delay between page visits

How to use HTTP Basic authentication

How to use a proxy (do not works with https, urllib restriction)

Pynav 0.7

Changelog

Release date: 2011-02-22

Interactive mode examples

Visit an url and print it

Get all links of an url and filter them with regex

Get all images of an url, filter with regex and download them

Check if a new resource is available

Check erroneous links (error 404)

Restrict links only to html content type

Download files behind non-direct links

Get response header information

Dump forms and generate code

Select a specific form after a dump

FAQ

How to restrict links only to specific content types

How to allow all content types

Howt to disable robots.txt handling (do this with thought and consideration)

How to handle referer automatically

Howt to set the timemout

How to define a delay between page visits

How to check if a cookie exists

How to use HTTP Basic authentication

How to use a proxy (do not works with https, urllib restriction)