Source

DjangoCon 2011 Notes / docs / scraping.rst

Y’all Wanna Scrape with Us? Content Ain’t a Thing : Web Scraping With Our Favorite Python Libraries

I got 99 problems but content ain't one

  • Everyone needs good content.
  • Good content exists all over the web.
  • Scrape it 'til you make it.

LXML: Diving in

lxml.etree vs. lxml.html

  • etree: best for properly formatted xml/xhtml
  • etree: powerful and fast for SOAP or other xml-formatted content
  • html: best for web sites & irregular content

lxml.html: hidden gems

cssselect
utilizes css element syntax to find and highlight html elements.
iterlinks
creates a generator of all linky elements on the page. Remember: ads have lots of links.
sourceline
can identify the location of your element on the page. Exists in both lxml.html and lxml.etree.
find, findall
can locate html elements within another node or a page. Exists in both lxml.html and lxml.etree.
descendents/children/siblings/ancesorts
all elements have iterchildren, itersiblings, iterancestors and iterdescendents.
forms
can find all (normal) forms on a page. beware of CAPTCHAs and the like.
text, text_content, and iter_text
ways to get content without tags.

If you have to parse in realtime, LXML is sometimes too much.

re
html == strings == parseable.
feedparser
standard XML has rules, feedparser knows them.
htmlparser
good base class for your own HTML parser. good for "I have an idea about how I want to handle embed tags".

Content is 1/2 of the equation.

I'm tired of ugly pages with badass content.

Note

Text = Content = Boss

Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.