Source

pomp / docs / quickstart.rst

Quickstart

Pomp is fun to use, incredibly easy for basic applications.

A Minimal Application

For a minimal application all you need is to define you crawler by inherit :class:`BaseCrawler`:

import re
from pomp.core.base import BaseCrawler, BasePipeline
from pomp.contrib import SimpleDownloader


python_sentence_re = re.compile('[\w\s]{0,}python[\s\w]{0,}', re.I | re.M)


class MyCrawler(BaseCrawler):
    """Extract all sentences with `python` word"""
    ENTRY_URL = 'http://python.org/news' # entry point

    def extract_items(self, response):
        for i in python_sentence_re.findall(response.body.decode('utf-8')):
            yield i.strip()

    def next_url(self, response):
        return None # one page crawler, stop crawl


class PrintPipeline(BasePipeline):
    def process(self, item):
        print('Sentence:', item)


if __name__ == '__main__':
    from pomp.core.engine import Pomp

    pomp = Pomp(
        downloader=SimpleDownloader(),
        pipelines=(PrintPipeline(),)
    )

    pomp.pump(MyCrawler())

Item pipelines

Custom downloader

Downloader middleware

Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.