Commits

Pavel Zhukov committed e7f85c1

Adding documentation and push version

Comments (0)

Files changed (2)

+Description:
+============
+Pbot contains two modules, Bot and Spider
+
+Bot is a simple helper, created to save request state (cookies, referrer) between http requests. Also, it provides addional methods for adding cookies. With no dependencies this module is easy to use when you need to simulate browser.
+
+Spider it's pbot, armed by lxml (required). Provides addional methods for easy website crawling, see below.
+
+Bot is very easy to use::
+
+ from pbot.pbot import Bot
+ bot = Bot(proxies={'http': 'localhost:3128'}) # You can provide proxies, during bot creation, or set later as bot.proxies
+ bot.add_cookie({'name': 'sample', 'value': 1, 'domain': 'example.com'})
+ response = bot.open('http://example.com') # Open with cookies and empty referrer
+ bot.follow('http://google.com') # Open google with example.com as a referrer
+ response = bot.response # Response saved, and can be read later
+ bot.follow('http://example.com', post={'q': 'abc'}) # You can provide post and get as keyword arguments
+ bot.refresh_connector() # Flush cookies and referrer
+
+
+
+Spider gives you special features::
+
+    from pbot.spider import Spider
+    bot = Spider()
+    bot.open('http://example.com')
+    bot.tree.xpath('//a') # lxml tree can be accessed by .tree, response will be automatically readed and parsed by lxml.html
+    form = bot.xpath('//form[@id="main"]') # xpath shortcut for bot.tree.xpath
+    bot.submit(form) # Submit lxml f§orm
+    #
+    # Crawler, recursively crawl from target page yielding xml_tree, query_url, real_url (real_url - url after all redirects).
+    bot.crawl(self,
+        url=None, # Target url to start crawling
+        check_base=True, # Yield pages only on domain from url
+        only_descendant=True, # Yield only pages that urls starts with url
+        max_level=None, #Maximum level
+        allowed_protocols=('http:', 'https:'),
+        ignore_errors=True,
+        ignore_starts=(), # Tuple/array,  ignore urls that starts with ignore_starts (exclude some parts of site)
+        check_mime=())
+
 
 setup(
         name = 'pbot',
-        version = '1.2.0',
+        version = '1.3.0',
         packages = ['pbot'],
         install_requires = ['lxml'],
         author = 'Pavel Zhukov',
         author_email = 'gelios@gmail.com',
         description = 'An simple site crawler with proxy support',
-        long_description = 'An simple site grawler, project target - save state (cookies, referrer) between requests. \
-         Also support lxml.html.submit_form with bot.open_http method',
+        long_description = open('README').read(),
         license = 'GPL',
         keywords = 'crawling, bot',
         url = 'http://bitbucket.org/zeus/pbot'