Kenneth Love  committed 54d859b

first chunk of scraping talk

  • Participants
  • Parent commits 9278ae4
  • Branches default

Comments (0)

Files changed (1)

File docs/scraping.rst

 Presented by Katharine Jarmul
+I got 99 problems but content ain't one
+* Everyone needs good content.
+* Good content exists all over the web.
+* Scrape it 'til you make it.
+LXML: Diving in
+``lxml.etree`` vs. ``lxml.html``
+* ``etree``: best for properly formatted xml/xhtml
+* ``etree``: powerful and fast  for SOAP or other xml-formatted content
+* ``html``: best for web sites & irregular content
+    ``lxml.html``: hidden gems
+    __________________________
+    ``cssselect``
+        utilizes css element syntax to find and highlight html elements.
+    ``iterlinks``
+        creates a generator of all **linky** elements on the page.
+        Remember: ads have lots of links.
+    ``sourceline``
+        can identify the location of your element on the page.
+        Exists in both ``lxml.html`` and ``lxml.etree``.
+    ``find``, ``findall``
+        can locate html elements within another node or a page.
+        Exists in both ``lxml.html`` and ``lxml.etree``.
+    descendents/children/siblings/ancesorts
+        all elements have ``iterchildren``, ``itersiblings``, ``iterancestors`` and ``iterdescendents``.
+    forms
+        can find all (normal) forms on a page.
+        beware of CAPTCHAs and the like.
+    ``text``, ``text_content``, and ``iter_text``
+        ways to get content without tags.
+If you have to parse in realtime, LXML is sometimes too much.
+    html == strings == parseable.
+    standard XML has rules, feedparser knows them.
+    good base class for your own HTML parser.
+    good for "I have an idea about how I want to handle ``embed`` tags".
+Content is 1/2 of the equation.
+I'm tired of ugly pages with badass content.
+.. note:: Text = Content = Boss