Source

DjangoCon 2011 Notes / docs / scraping.rst

Full commit

Y’all Wanna Scrape with Us? Content Ain’t a Thing : Web Scraping With Our Favorite Python Libraries

I got 99 problems but content ain't one

  • Everyone needs good content.
  • Good content exists all over the web.
  • Scrape it 'til you make it.

LXML: Diving in

lxml.etree vs. lxml.html

  • etree: best for properly formatted xml/xhtml

  • etree: powerful and fast for SOAP or other xml-formatted content

  • html: best for web sites & irregular content

    cssselect

    utilizes css element syntax to find and highlight html elements.

    iterlinks

    creates a generator of all linky elements on the page. Remember: ads have lots of links.

    sourceline

    can identify the location of your element on the page. Exists in both lxml.html and lxml.etree.

    find, findall

    can locate html elements within another node or a page. Exists in both lxml.html and lxml.etree.

    descendents/children/siblings/ancesorts

    all elements have iterchildren, itersiblings, iterancestors and iterdescendents.

    forms

    can find all (normal) forms on a page. beware of CAPTCHAs and the like.

    text, text_content, and iter_text

    ways to get content without tags.

If you have to parse in realtime, LXML is sometimes too much.

re
html == strings == parseable.
feedparser
standard XML has rules, feedparser knows them.
htmlparser
good base class for your own HTML parser. good for "I have an idea about how I want to handle embed tags".

Content is 1/2 of the equation.

I'm tired of ugly pages with badass content.

Note

Text = Content = Boss