Clone wiki

iterxml / Home

iterxml tutorial

So you're equipped with Python and a huge XML file which you want to parse, a file too big to fit into memory. iterxml to the rescue!

In most situations like this, the key is that the file is just a collection of smaller documents which are the parts you really care about. For example, I keep running into huge files like:


Firstly, we should install iterxml.

$ sudo easy_install iterxml

Once that's done, we can start using it. For comparison, let's see what you might naturally do using built-in python libraries. For fun, let's assume the file is compressed with bz2 (large XML files compress well!), and that each doc has a <title> node whose contents we want to print.

import bz2
from xml.etree import cElementTree as ElementTree

istream = bz2.BZ2File('my_huge_file.xml.bz2', 'r')
root = ElementTree.parse(istream).getroot()
for doc in root.findall('doc'):
    print doc.find('title').text

This works fine for small files, but since it eagerly parses the entire XML file first, it will eat up unacceptably large amounts of memory for large files to the point where you can't even parse them any more.

Let's try iterxml for comparison:

import iterxml

for doc in iterxml.iterxml('my_huge_file.xml.bz2', 'doc'):
    print doc.find('title').text

Shorter, and easier. Now we only use as much memory as each <doc>...</doc> requires, meaning that you're only cpu-bound, not memory-bound. This means that vastly larger files are now practical to work with. Notice that it also used the filename to detect the bz2 compression and transparently decompressed the file on the fly.

Aside: for truly large data sets, I hate XML, and suggest YAML instead. YAML has the concept of many-documents-per-file built-in, so you can iterate over documents without the fancy parsing hacks which iterxml has to resort to. Its data is also typed, which can save time and code when deserializing. Check out pyyaml, and be sure to compile it with libyaml bindings.