HTTPS SSH

README

Steps to get it up and running:

Writing an emitter

There's a special section on this: Recompilation

What is this repository for?

  • How does it work
  • Components

How does it work

If you have a 'Main' method in your program, it will usually call another method in another type. Let's for a moment assume we know this, jumping from method to method and from type to type. We know if something is a base type or a derived type. And we know if you're doing a virtual call.

This basically means the complete set of all methods and all types that are 'hit' by your program can be analyzed. And that means that we can take all that code, and compile it to another language. For example, let's compile it to C++, Objective C, Javascript, Flash or Java! That would be cool!

Static code analysis / decompiler

  • The project NativeCompiler contains code for analyzing a complete .NET project.
  • It will also decompile all methods, resolve virtual calls, branches and does basic flow analysis.
  • Blocks are found, which are the basic building blocks of an OO language.

A block is basically a sequence of IL instructions with no "start points". Start points are mostly try-catch-finally blocks. It is guaranteed that all instructions within a block have a fixed number of elements (with the same types) on the stack and that the stack depth is the same at the start and at the end of the block.

This is important, because we have to modify the elements on the stack in certain scenario's. For example, if a 'finally' block handles a 'pop', this would mean the 'try' block always has an element on the stack before it's called. The way to simplify this is to push the items off the stack before entering the 'try' block into local variables. (which are re-loaded when the 'try' is hit). Effectively this means that we can make sure there are always 0 or 1 (Exception object) items on the stack at the beginning of a block, which simplifies the decompilation process greatly.

In other words:

  • A block starts with stack depth X and ends with the same stack depth (independent on how it's stopped, f.ex. with branches, jumps, throw).
  • A block contains instructions. Instructions are stack operators, which is basically .NET IL code.
  • The NativeCompiler then changes attempts to convert stack operators into statements, operators and operands. It will emit fields if necessary.

(Unconditional) branches need special treatment, since the stack after these branches is unknown at first. Branches between blocks are especially hard to solve (but DO happen... and often). To handle this, a bit of pre-processing is required, that splits blocks into multiple sub-blocks that don't follow the above stack criterium that the start and end depth must be the same depth.

Another nasty case is the case where elements are 'dupped' and used in multiple consecutive statements. For example: 'b=2; a=2;' can be written as 'a=b=2;' using a 'dup' instruction. Pre-processing of the instructions ensures we don't encounter this.

The result of all this work is a piece of code that can best be described as 'statements':

Start block 0000
0002  |var_1 = 0;
0003  |goto label0022;

Start block 0005
000c  |void Console.WriteLine(@"Foo!")
0013  |Leave() 001c

Start block 001c
0021  |var_1 = 1 + var_1;

Start block 0022
0027  |var_2 = 10 < var_1;
0029  |if (var_2) { goto label0005; }

Start block 002b
002b  |return;

Note that this is a very basic representation. Things like 'for', 'while', 'do', and even 'else' aren't even supported; these are handled by 'if' and 'goto'. The project contains a flow analysis component that does the transformations for 'for', 'if' and 'while'.

This is basically the input for the cross compiler.

Cross compiler

Each language then accepts a set of types, with a set of methods, containing of a bunch of blocks and emits the code in the target language.

The main requirement for a target language is that it follows (more or less) the same rules with access, type system and inheritance. As for instruction set, you need basic types, jumps, conditions, ... the usuals. Most modern OO languages fit these criterium.

What cannot be cross-compiled

Basically Reflection.Emit (dynamic proxy's) and native P/invoke's. The latter means:

  • User interface components
  • Threads, Socket, Math, Environment, Buffer, Enum, GC, Object -- these need to be rebuild in the target language

That's it :-)

  • For C++ there's a 'Corlib' folder, with a compileable stub implementation. (See also C++ emitting below)
  • These stubs are initially generated through reflection.

Current state of decompiler

  • Block analyzer is done
  • Decompiler is done
  • Flow analysis is done

This combination produces quite nice code already. Some optimizations are still possible with f.ex. logical expressions, but to be honest that has no real practical value.

C# emitting

  • Basically this is just for test purposes, although it's quite easy to create full-blown decompiler with this.

Java emitting

  • Java byte code decompiler is mostly done
  • Java emitter is still quite some work
  • TODO: figure out if Java byte code can be 1:1 mapped to IL. On first glance, it seems Java has an easier IL / type system than C#. Having such a tool would make the cross-compiler usable for Java code as well!

C++ emitting

  • Practically everything is implemented.
  • A separate standard library (Corlib) project is parsed; if a method is found there, it's copied instead of emitted.
  • C++ code and header files are emitted seperately. All references are forward included except for base classes.
  • The emitter assumes a GC is present to handle memory management.

TODO:

  • Reflection (perhaps even Emit via an interpreter)
  • Arrays, strings
  • Standard library (corlib)

Other languages

  • No concrete plans.
  • Javascript seems interesting.