So you're equipped with Python and a huge XML file which you want to parse, a file too big to fit into memory.
iterxml to the rescue!
In most situations like this, the key is that the file is just a collection of smaller documents which are the parts you really care about. For example, I keep running into huge files like:
<docs> <doc>...</doc> <doc>...</doc> ... </docs>
Firstly, we should install
$ sudo easy_install iterxml
Once that's done, we can start using it. For comparison, let's see what you might naturally do using built-in python libraries. For fun, let's assume the file is compressed with bz2 (large XML files compress well!), and that each doc has a
<title> node whose contents we want to print.
import bz2 from xml.etree import cElementTree as ElementTree istream = bz2.BZ2File('my_huge_file.xml.bz2', 'r') root = ElementTree.parse(istream).getroot() for doc in root.findall('doc'): print doc.find('title').text
This works fine for small files, but since it eagerly parses the entire XML file first, it will eat up unacceptably large amounts of memory for large files to the point where you can't even parse them any more.
iterxml for comparison:
import iterxml for doc in iterxml.iterxml('my_huge_file.xml.bz2', 'doc'): print doc.find('title').text
Shorter, and easier. Now we only use as much memory as each
<doc>...</doc> requires, meaning that you're only cpu-bound, not memory-bound. This means that vastly larger files are now practical to work with. Notice that it also used the filename to detect the bz2 compression and transparently decompressed the file on the fly.
Aside: for truly large data sets, I hate XML, and suggest YAML instead. YAML has the concept of many-documents-per-file built-in, so you can iterate over documents without the fancy parsing hacks which
iterxml has to resort to. Its data is also typed, which can save time and code when deserializing. Check out pyyaml, and be sure to compile it with libyaml bindings.