DjangoCon 2011 Notes / docs / scraping.rst

Kenneth Love 9278ae4 
Kenneth Love 5d2c1a0 
Kenneth Love 9278ae4 

Kenneth Love 54d859b 

Kenneth Love 9278ae4 
Kenneth Love 54d859b 

Kenneth Love dffa3b1 

Kenneth Love 54d859b 

Y'all Wanna Scrape with Us? Content Ain't a Thing : Web Scraping With Our Favorite Python Libraries

Presented by Katharine Jarmul

I got 99 problems but content ain't one

* Everyone needs good content.
* Good content exists all over the web.
* Scrape it 'til you make it.

LXML: Diving in

``lxml.etree`` vs. ``lxml.html``

* ``etree``: best for properly formatted xml/xhtml
* ``etree``: powerful and fast  for SOAP or other xml-formatted content
* ``html``: best for web sites & irregular content

``lxml.html``: hidden gems

        utilizes css element syntax to find and highlight html elements.
        creates a generator of all **linky** elements on the page.
        Remember: ads have lots of links.
        can identify the location of your element on the page.
        Exists in both ``lxml.html`` and ``lxml.etree``.
    ``find``, ``findall``
        can locate html elements within another node or a page.
        Exists in both ``lxml.html`` and ``lxml.etree``.
        all elements have ``iterchildren``, ``itersiblings``, ``iterancestors`` and ``iterdescendents``.
        can find all (normal) forms on a page.
        beware of CAPTCHAs and the like.
    ``text``, ``text_content``, and ``iter_text``
        ways to get content without tags.

If you have to parse in realtime, LXML is sometimes too much.

    html == strings == parseable.
    standard XML has rules, feedparser knows them.
    good base class for your own HTML parser.
    good for "I have an idea about how I want to handle ``embed`` tags".

Content is 1/2 of the equation.

I'm tired of ugly pages with badass content.

.. note:: Text = Content = Boss