1. Victor Stinner
  2. hachoir

Wiki

Clone wiki

hachoir / Features

Hachoir core features:

  • Autofix: On error, Hachoir can automatically fix errors on buggy file or parser (more details below) ;
  • Lazy: Field value, size, description, absolute address, (...) are computed on demand (more details below) ;
  • No arbitrary limit: Hachoir has no arbitrary limit: addresses can be bigger than 4 Go, an integer can be 32, 64, 128 bits or more, there is no field number limit, no depth limit, etc. ;
  • Types: Hachoir has many predefined field types: integer, string, boolean, byte array, ... ;
  • Bit granularity: Size and address are in bits, so it's easy to mix fields with size in bytes or in bits ;
  • Unicode: String values are stored in Unicode (if string charset is specified) ;
  • Endian: You don't have to care about endian: you just need to set it once for the whole parser ;
  • No dependency: Hachoir just needs Python 2.4 (but some frontends need more libraries) ;
  • API: Data is represented as a tree of fields where each field is a Python object (more details below) ;
  • Parser: 70 parsers are included: JPEG picture, ZIP archive, MP3 audio, ... (see the parser list) ;
  • Guess parser: An algorithm automatically chooses the right parser using file extension, given MIME type or using validate() function of each parser ;
  • etc.

Nice API

A file is split into a fields tree. A field set is a field too, and each field is an object. Some interesting attributes:

  • size: size in bits
  • value: field value
  • description: string description
  • address: relative address (relative to its parent's absolute address)
  • absolute_address: address in the stream

Lazy parser

Quite all features of Hachoir are lazy: it means that Hachoir reads/computes an information only when it is asked for or if Hachoir needs it to read/compute another information.

Examples:

  • Value and description of field are read/created on first access ;
  • Fields of a field set are created on demand, or if Hachoir is unable to guess field set size ;
  • etc.

Because of this lazy feature, Hachoir is able to read big files like a FAT32 partition of 10 GB and create complex and deep field tree.

The "secret" is the Python keyword yield, get more information:

Many ways to explore field set

Hachoir allows exploring the field sets in different ways.

Iterate on each field:

for field in fieldset:
  print "%s=%s" % (field.name, field.value)

Get a field using its path:

# Get a field
header = png["header"]
height = png["height"].value
width = png["/header/width"].value

#|Get parents
assert png == header.parent == header[".."]
grandfather = field["../.."]
grandfather = field["..."]  # alternative syntax
assert field["../../.."] == field["...."]

Autofix

Hachoir is able to fix (some) parser errors: errors from invalid/truncated input files or from parser code.

  • If a field is bigger than it should be, it's truncated or removed ;
  • On parser error, the field is removed ;
  • When parsing is done, the field set is filled with padding if needed.

Other autofixes:

  • GenericString tries to fix truncated UTF-16-LE string (add null byte at the end)

Too big

class Header(FieldSet):
   def createFields(self):
      yield UInt32(self, "width")
      yield UInt32(self, "height")
      yield UInt32(self, "flags")
      yield UInt32(self, "extra_header")

class Parser(Parser):
   def createFields(self):
      ...
      yield Header(self, "header", size=12*8)
      yield UInt32(self, "important_value")
      ...

Hachoir will not create integer /header/extra_header (stop at /header/flags), but /important_value is available.

Too small

class Header(FieldSet):
   def createFields(self):
      yield UInt32(self, "width")
      yield UInt32(self, "height")

class Parser(Parser):
   def createFields(self):
      ...
      yield Header(self, "header", size=12*8)
      yield UInt32(self, "important_value")
      ...

Hachoir will add padding to /header/raw`[0]` field (4 bytes) and /important_value is available.

Catch error

class Image(FieldSet):
   def createFields(self):
      yield UInt32(self, "width")
      if not(10 <= self["width"].value <= 1000):
         raise ParserError("Invalid with")
      yield UInt32(self, "height")
      if not(10 <= self["height"].value <= 1000):
         raise ParserError("Invalid height")
      yield RawBytes(self, "data", self["width"].value * self["height"].value)

class|Parser(Parser):
   def createFields(self):
      ...
      yield Image(self, "image")
      yield UInt32(self, "important_value")
      ...

If width is invalid, Hachoir will stop parsing just after /image/width, so that you can't read image data but at least it's size! If height is invalid, Hachoir will stop after /image/height, so that you get width and height.

This example is trivial but it shows that even on error, you're able to read the beginning of a file. Errors can be:

  • Stream error: end of stream, disk error, ...
  • Field error: try to create empty string, invalid valid, ...
  • Parser error: buggy parser, !ParserError raised by the parser, ...

In this example, it's not possible to read /important_value. To be able to read it, you have to guess /image size and specify it as follows:

class Parser(Parser):
   def createFields(self):
      ...
      yield UInt32(self, "imagelen", "Image size in bytes")
      ...
      yield Image(self, "image", size=self["imagelen"].value*8)
      yield UInt32(self, "important_value")
      ...

Updated