Y’all Wanna Scrape with Us? Content Ain’t a Thing : Web Scraping With Our Favorite Python Libraries
I got 99 problems but content ain't one
- Everyone needs good content.
- Good content exists all over the web.
- Scrape it 'til you make it.
LXML: Diving in
lxml.etree vs. lxml.html
etree: best for properly formatted xml/xhtml
etree: powerful and fast for SOAP or other xml-formatted content
html: best for web sites & irregular content
utilizes css element syntax to find and highlight html elements.
creates a generator of all linky elements on the page. Remember: ads have lots of links.
can identify the location of your element on the page. Exists in both lxml.html and lxml.etree.
- find, findall
can locate html elements within another node or a page. Exists in both lxml.html and lxml.etree.
all elements have iterchildren, itersiblings, iterancestors and iterdescendents.
can find all (normal) forms on a page. beware of CAPTCHAs and the like.
- text, text_content, and iter_text
ways to get content without tags.
If you have to parse in realtime, LXML is sometimes too much.
- html == strings == parseable.
- standard XML has rules, feedparser knows them.
- good base class for your own HTML parser. good for "I have an idea about how I want to handle embed tags".
Content is 1/2 of the equation.
I'm tired of ugly pages with badass content.
Text = Content = Boss