Technologies: Python

Status: Complete, No longer maintained.

This is a helper program I wrote when writing my disassembler. At that time I had a full time job and I had to implement an x86 disassembler for the final project of my Software Analysis class, so I had to get creative. In the other repo you can read the story of how the disassembler was implemented, here I'll just put information about the opcodeDB.

What works

Takes an xml file from the awesome x86ref ( and generates a binary structure with all the information.

The objective

You can read the prelude to this in the disassembler repo. Suffice to say, I had found an awesome source of information of every x86 assembler instruction ever, this one: (I'll link as many times as needed, these guys deserve a lot of credit). But I had a problem, I needed to be able to parse the file in C, I had little time and the file is incredibly complex (even I had trouble reading it). So I figured: this file won't change often and parsing XML in C sounds incredibly time consuming, perhaps I could preprocess it and generate some C friendly notation?

So thats the idea: parse the file and generate an output that can be parsed easily in C.

More Information

At that time I was working developing tests for UEFI compatible BIOSes, so I was familiar with the way it stores data on the firmware, in particular, there is a module for storing UI elements in a "binary HTML" format. Instead of storing HTML using characters, it converted everything into opcodes (unrelated to machine code, they were similar to html tags) and then stored them as a binary blob. The parsing was done by using C structs and pointers.

Using that as inspiration I created a simple binary format to encode the XML file:

  • Consists a series of binary structures (Entry), each one containing one opcode and all data (mnemonic, opcode, parameters, size, etc). They are saved one next to the other as a binary blob.
  • There is a top level 'Index' structure, inspired on innoDB. It groups entries in chunks of 256 and allows sub-indexes.
  • Searching is incredibly fast: O(n) where n is the number of indexes.
  • It should be parseable by C using structs

The end design consisted of a file with single MAIN index, a series of sub indexes (for 2 and 3 byte opcodes) and a large contiguous blob of 'Entry' structures.

Indexes have 256 entries because there can only be up to 256 unique opcodes with a single byte. For 2 and 3 bytes opcodes sub-indexes can be used.

Roughly speaking, this is what happened:

  1. A MAIN index is created
  2. The XML is processed, for each opcode found:
  3. If it is a single byte opcode, create an Entry structure, save it into storage and save an offset in the corresponding position in index, based on the binary representation of the opcode. For example, if the opcode is 0x12 then the offset is saved in the position 0x12.
  4. if it is a multi-byte opcode, get the most significative byte and look at that position in the MAIN index, if its empty create a sub-index and save its offset to the position. In the new subindex, if the opcode is 2 byte, then create an entry as with 1 byte opcodes, but use the LSB as its position instead. For 3 byte opcodes, create another index using the 2nd byte and then create an entry in the new index using the LSB.

Parsing consisted on loading the entire file on memory (or using memory mapped files) and parsing the MAIN header using a simple struct pointer. Since every header contained an offset to the beginning of the entries searching was incredibly fast.

For example, assuming that we want to find the entry for opcode 0x12 0xdf (a 2 byte opcode):

  • Locate position 0x12 in MAIN using pointer arithmetic:

    offset = MAIN->entries+0x12

    object = MAIN->storage+offset

  • Since its 2 bytes, 'object' will be of type 'index', so repeat using the second byte 0xbf:

    offset = object->entries+0xbf

    opcode = object->storage+offset

  • Now 'opcode' contains the entry structure with all information

My participation

I did all the design and coding.

What is broken

Nothing that I am aware of.