Overview

爬虫解析库:

用来解析爬下来的各种网页数据.


工具:
-----------------------
zkit/bot_txt.py

txt_wrap_by(start_str,end_str,text):
取回text中在start_str和end_str之间的字符串. 只返回第一个查找到的结果.

txt_wrap_by_all(start_str,end_str,text):
取回text中在start_str和end_str之间的字符串. 以列表形式返回所有查找结果.

Example:
text = '''<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
    <title>8.6. array — Efficient arrays of numeric values &mdash; Python v2.7.2 documentation</title>
    <link rel="stylesheet" href="../_static/default.css" type="text/css" />
    <link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.7.2',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>
    <script type="text/javascript" src="../_static/jquery.js"></script>
    <script type="text/javascript" src="../_static/doctools.js"></script>
    <link rel="search" type="application/opensearchdescription+xml"
          title="Search within Python v2.7.2 documentation"
          href="../_static/opensearch.xml"/>
    <link rel="author" title="About these documents" href="../about.html" />
    <link rel="copyright" title="Copyright" href="../copyright.html" />
    <link rel="top" title="Python v2.7.2 documentation" href="../index.html" />
    <link rel="up" title="8. Data Types" href="datatypes.html" />
    <link rel="next" title="8.7. sets — Unordered collections of unique elements" href="sets.html" />
    <link rel="prev" title="8.5. bisect — Array bisection algorithm" href="bisect.html" />
    <link rel="shortcut icon" type="image/png" href="../_static/py.png" />
    <script type="text/javascript" src="../_static/copybutton.js"></script>
 

  </head>'''

txt_wrap_by('<head>','</head>',text)
将返回:
8.6. array — Efficient arrays of numeric values &mdash; Python v2.7.2 documentation

txt_wrap_by_all('type="text/javascript" src="','"',text)
将返回所有js外链文件的地址.
[
    "../_static/jquery.js",
    "../_static/doctools.js",
    "../_static/copybutton.js",
]
-----------------------
网页爬虫 zkit/spider

使用方法:
#1. 设置cookies
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7',
        'Accept': ' text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
        'Accept-Language':'zh-cn,zh;q=0.5',
        'Accept-Charset':'gb18030,utf-8;q=0.7,*;q=0.7',
        'Content-type':'application/x-www-form-urlencoded'
}

#2. 设置fetcher, 共有两个, 有缓存和没有缓存的
fetcher = NoCacheFetch(0, headers=headers)
#3. 初始化爬虫Rolling, 第一个参数是抓取, 第二个是第一层url handler和url的tuple.
spider = Rolling( fetcher, url_iter)
#4. 启动爬虫, workers_count是线程数量(使用了gevent,所以事前装一下.)
spider_runner = GSpider(spider, workers_count=10)
spider_runner.start()

#5. url_iter 是一个返回 url_handler和url的iter, 当页面抓取完成后, 将调用url_handler(page,url),page是抓取的url内容,url为抓取的url. 如果url_handler再返回一个yield, 那么还将继续下去.

具体请看zkit/spider/demo.py(不能运行,不过可以参考用法):