Source

DjangoCon 2011 Notes / docs / scraping.rst

Kenneth Love 9278ae4 
Kenneth Love 5d2c1a0 
Kenneth Love 9278ae4 




Kenneth Love 54d859b 

Kenneth Love 9278ae4 
Kenneth Love 54d859b 













Kenneth Love dffa3b1 

Kenneth Love 54d859b 

































===================================================================================================
Y'all Wanna Scrape with Us? Content Ain't a Thing : Web Scraping With Our Favorite Python Libraries
===================================================================================================

Presented by Katharine Jarmul
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I got 99 problems but content ain't one
---------------------------------------

* Everyone needs good content.
* Good content exists all over the web.
* Scrape it 'til you make it.

LXML: Diving in
---------------

``lxml.etree`` vs. ``lxml.html``
________________________

* ``etree``: best for properly formatted xml/xhtml
* ``etree``: powerful and fast  for SOAP or other xml-formatted content
* ``html``: best for web sites & irregular content

``lxml.html``: hidden gems
__________________________

    ``cssselect``
        utilizes css element syntax to find and highlight html elements.
    ``iterlinks``
        creates a generator of all **linky** elements on the page.
        Remember: ads have lots of links.
    ``sourceline``
        can identify the location of your element on the page.
        Exists in both ``lxml.html`` and ``lxml.etree``.
    ``find``, ``findall``
        can locate html elements within another node or a page.
        Exists in both ``lxml.html`` and ``lxml.etree``.
    descendents/children/siblings/ancesorts
        all elements have ``iterchildren``, ``itersiblings``, ``iterancestors`` and ``iterdescendents``.
    forms
        can find all (normal) forms on a page.
        beware of CAPTCHAs and the like.
    ``text``, ``text_content``, and ``iter_text``
        ways to get content without tags.

If you have to parse in realtime, LXML is sometimes too much.

``re``
    html == strings == parseable.
``feedparser``
    standard XML has rules, feedparser knows them.
``htmlparser``
    good base class for your own HTML parser.
    good for "I have an idea about how I want to handle ``embed`` tags".

Content is 1/2 of the equation.

I'm tired of ugly pages with badass content.

.. note:: Text = Content = Boss