# hachoir / ReverseEngineering

Warning: First of all, read legal issue page

# Reverse engineering tools in Hachoir

I wrote some tools in Hachoir 0.1 to detect pattern or properties of a stream:

• entropy(): Compute the entropy of a stream ;
• Functions to detect some commons patterns ;
• Functions to find constant bytes ;
• Functions to try to find a field.

The code [[source:tags/old-hachoir/0.1/reverse.py|reverse.py]] only works with Hachoir version 0.1 (ask us for update :-)).

# Entropy

The entropy is not an exact tool, but it can help. The entropy is the "quantity of information" in a stream. Examples:

• {0, 0, 0, 0}: entropy=0 bit => don't contains any information :-P
• {0, 0, 1, 1}: entropy=1 bit => it means that 1 bit is enough to encode a value
• {0, 1, 2, 3}: entropy=2 bits
• (but value can be a real, for example entropy=2.34 bits)

Somes values:

• A compiled Python script: 3.936 bits
• A Python script (plain text): 4.007 bits
• English text (GPLv2 license): 4.686 bits
• Linux executable program: 6.447 bits
• Compressed data in a PNG picture: 7.974 bits
• Compressed data in a JPEG picture: 7.975 bits
• Big RAR archive file: 7.998 bits (wow!)

# Patterns

I realized that we can find some patterns in file. Most common one is:

```<header>
<chunk1>
<data_size>
<data>
<chunk2>
<data_size>
<data>
<chunk3>
<data_size>
<data>
<end>
```

I can be found in PNG picture for example:

```Header (PNG signature): 8 bytes
Chunk1
<data_size>
<type>
<data>
<crc32>
Chunk2
<data_size>
<type>
<data>
<crc32>
...
ChunkN
0
<type "end">
<crc32>
```

So my algorithm try different header size, different chunk header size, and different data_size delta (because chunk may have footer or other header). The algorithm works well and is very fast!

# Compare many streams (files)

When it's possible to have two or mores files in the same file format, it's interresting to compare them. I wrote some functions to: 1. find constant bytes 1. find field in data (non constant) bytes

Find constant bytes is easy: just need to compare bytes, input1with input2[i byte, input1with input3[i bytes, etc.

To find a field, the algorithm is: 1. Find position of value in data bytes in the first input 1. Find position of value in data bytes in the second input (and remove false-positive) 1. Find position of value in data bytes in the ...

The value can be an integer or, better, a range (value-delta ... value+delta).

The algorithm works well with delta=10. It can accept different value for each input. Example: To find file size in a gzip archive, for foo.gz tells that foo is 1300 bytes and for bar.gz that bar is 3400 bytes.