1. Lars Yencken
  2. iterxml

Commits

Lars Yencken  committed 858d96b

Adds a basic tutorial for use

  • Participants
  • Parent commits a6ca392
  • Branches default

Comments (0)

Files changed (1)

File Home.wiki

View file
-== Welcome ==
-
-Welcome to your wiki! This is the default page we've installed for your convenience. Go ahead and edit it.
-
-=== Wiki features ===
-
-This wiki uses the [[http://www.wikicreole.org/|Creole]] syntax, and is fully compatible with the 1.0 specification.
-
-The wiki itself is actually a hg repository, which means you can clone it, edit it locally/offline, add images or any other file type, and push it back to us. It will be live immediately.
-
-Go ahead and try:
-
-{{{
-$ hg clone http://bitbucket.org/lars512/iterxmlwiki/
-}}}
-
-Wiki pages are normal files, with the .wiki extension. You can edit them locally, as well as creating new ones.
-
-=== Syntax highlighting ===
-
-You can also highlight snippets of text, we use the excellent [[http://www.pygments.org/|Pygments]] library.
-
-Here's an example of some Python code:
-
-{{{
-#!python
-
-def wiki_rocks(text):
-	formatter = lambda t: "funky"+t
-	return formatter(text)
-}}}
-
-You can check out the source of this page to see how that's done, and make sure to bookmark [[http://pygments.org/docs/lexers/|the vast library of Pygment lexers]], we accept the 'short name' or the 'mimetype' of anything in there.
-
-Have fun!
+== {{{iterxml}}} tutorial ==
+
+So you're equipped with Python and a huge XML file which you want to parse, a file too big to fit into memory. {{{iterxml}}} to the rescue!
+
+In most situations like this, the key is that the file is just a collection of smaller documents which are the parts you really care about. For example, I keep running into huge files like:
+
+{{{
+#!xml
+
+<docs>
+<doc>...</doc>
+<doc>...</doc>
+...
+</docs>
+}}}
+
+Firstly, we should install {{{iterxml}}}.
+
+{{{
+$ sudo easy_install iterxml
+}}}
+
+Once that's done, we can start using it. For comparison, let's see what you might naturally do using built-in python libraries. For fun, let's assume the file is compressed with bz2 (large XML files compress well!).
+
+{{{
+#!python
+
+import bz2
+from xml.etree import cElementTree as ElementTree
+
+istream = bz2.BZ2File('my_huge_file.xml.bz2', 'r')
+root = ElementTree.parse(istream).getroot()
+for node in root.findall('doc'):
+    print doc.find('title').text
+}}}
+
+This works fine for small files, but since it eagerly parses the entire XML file first, it will eat up unacceptably large amounts of memory for large files to the point where you can't even parse them any more.
+
+Let's try {{{iterxml}}} for comparison:
+
+{{{
+#!python
+
+import iterxml
+
+for doc in iterxml.iterxml('my_huge_file.xml.bz2', 'doc'):
+    print doc.find('title').text
+}}}
+
+Shorter, and easier. Now we only use as much memory as each {{{<doc>...</doc>}}} requires, meaning that you're only cpu-bound, not memory-bound. This means that vastly larger files are now practical to work with. Notice that it also used the filename to detect the bz2 compression and transparently decompressed the file on the fly.
+
+Aside: for truly large data sets, I hate XML, and suggest YAML instead. YAML has the concept of many-documents-per-file built-in, so you can iterate over documents without the fancy parsing hacks which {{{iterxml}}} has to resort to. Its data is also typed, which can safe time and code when deserializing. Check out [[http://pyyaml.org/|pyyaml]], and be sure to compile it with libyaml bindings.