Kenneth Love avatar Kenneth Love committed 54d859b

first chunk of scraping talk

Comments (0)

Files changed (1)

docs/scraping.rst

 Presented by Katharine Jarmul
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
+I got 99 problems but content ain't one
+---------------------------------------
 
+* Everyone needs good content.
+* Good content exists all over the web.
+* Scrape it 'til you make it.
+
+LXML: Diving in
+---------------
+
+``lxml.etree`` vs. ``lxml.html``
+________________________
+
+* ``etree``: best for properly formatted xml/xhtml
+* ``etree``: powerful and fast  for SOAP or other xml-formatted content
+* ``html``: best for web sites & irregular content
+
+    ``lxml.html``: hidden gems
+    __________________________
+
+    ``cssselect``
+        utilizes css element syntax to find and highlight html elements.
+    ``iterlinks``
+        creates a generator of all **linky** elements on the page.
+        Remember: ads have lots of links.
+    ``sourceline``
+        can identify the location of your element on the page.
+        Exists in both ``lxml.html`` and ``lxml.etree``.
+    ``find``, ``findall``
+        can locate html elements within another node or a page.
+        Exists in both ``lxml.html`` and ``lxml.etree``.
+    descendents/children/siblings/ancesorts
+        all elements have ``iterchildren``, ``itersiblings``, ``iterancestors`` and ``iterdescendents``.
+    forms
+        can find all (normal) forms on a page.
+        beware of CAPTCHAs and the like.
+    ``text``, ``text_content``, and ``iter_text``
+        ways to get content without tags.
+
+If you have to parse in realtime, LXML is sometimes too much.
+
+``re``
+    html == strings == parseable.
+``feedparser``
+    standard XML has rules, feedparser knows them.
+``htmlparser``
+    good base class for your own HTML parser.
+    good for "I have an idea about how I want to handle ``embed`` tags".
+
+Content is 1/2 of the equation.
+
+I'm tired of ugly pages with badass content.
+
+.. note:: Text = Content = Boss
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.