Clone wiki

hachoir / ReverseEngineering

Warning: First of all, read legal issue page

Reverse engineering tools in Hachoir

I wrote some tools in Hachoir 0.1 to detect pattern or properties of a stream:

  • entropy(): Compute the entropy of a stream ;
  • Functions to detect some commons patterns ;
  • Functions to find constant bytes ;
  • Functions to try to find a field.

The code [[source:tags/old-hachoir/0.1/|]] only works with Hachoir version 0.1 (ask us for update :-)).


The entropy is not an exact tool, but it can help. The entropy is the "quantity of information" in a stream. Examples:

  • {0, 0, 0, 0}: entropy=0 bit => don't contains any information :-P
  • {0, 0, 1, 1}: entropy=1 bit => it means that 1 bit is enough to encode a value
  • {0, 1, 2, 3}: entropy=2 bits
  • (but value can be a real, for example entropy=2.34 bits)

Somes values:

  • A compiled Python script: 3.936 bits
  • A Python script (plain text): 4.007 bits
  • English text (GPLv2 license): 4.686 bits
  • Linux executable program: 6.447 bits
  • Compressed data in a PNG picture: 7.974 bits
  • Compressed data in a JPEG picture: 7.975 bits
  • Big RAR archive file: 7.998 bits (wow!)


I realized that we can find some patterns in file. Most common one is:


I can be found in PNG picture for example:

Header (PNG signature): 8 bytes
  <type "end">

So my algorithm try different header size, different chunk header size, and different data_size delta (because chunk may have footer or other header). The algorithm works well and is very fast!

Compare many streams (files)

When it's possible to have two or mores files in the same file format, it's interresting to compare them. I wrote some functions to: 1. find constant bytes 1. find field in data (non constant) bytes

Find constant bytes is easy: just need to compare bytes, input1with input2[i byte, input1with input3[i bytes, etc.

To find a field, the algorithm is: 1. Find position of value in data bytes in the first input 1. Find position of value in data bytes in the second input (and remove false-positive) 1. Find position of value in data bytes in the ...

The value can be an integer or, better, a range (value-delta ... value+delta).

The algorithm works well with delta=10. It can accept different value for each input. Example: To find file size in a gzip archive, for foo.gz tells that foo is 1300 bytes and for bar.gz that bar is 3400 bytes.


See also: