Source

compiler-libs-hack /

Filename Size Date modified Message
ocaml
overload
ppx
retype
vanilla
287 B
430 B
224 B
14.9 KB

A safe but strange way of modifying OCaml compiler

OCaml 4.00.0 is out, and now we are with 4.02.0. Have you tried it? GADT? Better first class modules? More warnings?

I am going to talk something different, something more exciting at least for me. Compiler-libs.

Compiler-libs are modules of the compiler itself, and now available for everyone as a library, even for binary package users. This means that we can deliver compilish hacks much easier to everyone. If the hack is reasonably small we can publish them not as a compiler patch which requires boring source download + patching + entire recompilation of the compiler (even with the help of OPAM, it takes long time), but as a stand alone tool which compiles in really shorter time. Here, I am going to demonstrate such a small compiler hack, freely extensible SML style overloading, my favorite compiler mod.

Safe compiler mod ever

What is great since 4.00.0 is it also have the untyper (Untypeast) and the AST printer (Pprintast). They are not in the part of the compiler-libs, but found in tools dir.

The untyper takes a type-checked source tree (Typedtree), strips away its attached type information, then returns the corresponding untyped tree (Parsetree). Pprintast prints out a Parsetree as a reparseable OCaml source code for debugging purpose.

Using them we can create safe compiler mods: our modified compiler can do whatever it wants, then it makes an AST of Parsetree as the final result, then feeds it to the original typechecker and compiler. If the mod does something wrong, the original compiler part should find it. If a user is paranoiac about what our mod does, we can always print out the result as vanilla OCaml code using Pprintast. Cool.

Preparation

All the code is available here:

$ hg clone https://bitbucket.org/camlspotter/compiler-libs-hack
$ hg update 4.02.0

Note that the code and compiler-libs are highly dependent on OCaml 4.02.0 compiler internals. It is unlikely that it will work with later versions of OCaml.

Vanilla compiler

First of all, lets start cloning a vanilla compiler from compiler-libs. It is very easy:

$ cd vanilla
$ make

cp ../ocaml/driver/main.ml main.ml
ocamlc -I +compiler-libs -I +unix -c main.ml
ocamlc -o vanilla -I +compiler-libs ocamlcommon.cma ocamlbytecomp.cma main.cmo

cp ../ocaml/driver/optmain.ml optmain.ml
ocamlc -I +compiler-libs -I +unix -c optmain.ml
ocamlc -o vanillaopt -I +compiler-libs ocamlcommon.cma ocamloptcomp.cma optmain.cmo

To build a vanilla ocamlc, we need the original main.ml and link it with ocamlcommon.cma and ocamlbytecomp.cma. main.ml must be copied from the original source tree, since it is not included in the compiler-libs.

For the native code compiler, instead of main.ml and ocamlbytecomp.cma, we use optmain.ml and ocamloptcompo.cma.

Now you have two executables vanilla and vanillaopt, which are actually clones of ocamlc and ocamlopt. Try using them to compile some simple modules to see they are really working.

Now you know how to use compiler-libs. Let's do something more interesting.

Compiler with untype+retyping

The next thing is to use the untyper and the AST printer. Here we modify the bytecode compiler workflow a bit, so that once the original compiler type-check the source code, we untype it, then print it as readable OCaml source, then retype it again. The workflow is implemented in ocaml/driver/compile.ml:

ast
++ print_if ppf Clflags.dump_parsetree Printast.implementation
++ print_if ppf Clflags.dump_source Pprintast.structure
++ Typemod.type_implementation sourcefile outputprefix modulename env
++ print_if ppf Clflags.dump_typedtree
            Printtyped.implementation_with_coercion
++ Translmod.transl_implementation modulename
++ print_if ppf Clflags.dump_rawlambda Printlambda.lambda
++ Simplif.simplify_lambda
++ print_if ppf Clflags.dump_lambda Printlambda.lambda
++ Bytegen.compile_implementation modulename
++ print_if ppf Clflags.dump_instr Printinstr.instrlist
++ Emitcode.to_file oc modulename objfile;

Simple. The source file is first parsed as ast, then it is sent to the next line of the parsetree dumper, then sent to the type checker, and so on... The source is pipelined from the top line to the bottom.

We here insert few extra steps into this pipeline to untype and print:

ast
++ print_if ppf Clflags.dump_parsetree Printast.implementation
++ print_if ppf Clflags.dump_source Pprintast.structure

(* HACK: Save the parsed result as xxx.ml1 *)
++ (fun ptree -> 
  let ml1file = outputprefix ^ ".ml1" in
  let oc1 = open_out_bin ml1file in
  let ppf = Format.formatter_of_out_channel oc1 in
  Format.fprintf ppf "%a@." Pprintast.structure ptree; 
  close_out oc1;
  ptree
)
(* HACK END *)

(* continues... *)

First, we save parsed ast as xxx.ml1 using the AST printer Pprintast.structure. Then ast is given to the type checker:

++ Typemod.type_implementation sourcefile outputprefix modulename env
++ print_if ppf Clflags.dump_typedtree
            Printtyped.implementation_with_coercion

(* HACK: untype + retype *)
++ (fun (str, _) -> 
  (* Untype then save it as xxx.ml2 *)
  let ptree =  Untypeast.untype_structure str in
  let ml2file = outputprefix ^ ".ml2" in
  let oc2 = open_out_bin ml2file in
  let ppf = Format.formatter_of_out_channel oc2 in
  Format.fprintf ppf "%a@." Pprintast.structure ptree; 
  close_out oc2;
  ptree)

(* continues .. *)

Typed structure str from Typemod.type_implementation is untyped back to ptree by Untypeast.untype_structure, then it is printed to xxx.ml2 by Pprintast.structure.

++ ( fun ptree -> 
  (* Retype!

     Beware! We must reset the state of typing, 
     or anything strange could happen!

     I am not sure all the followings are required but at least
     they are done at the beginning of implementation, so it should be
     ok.
  *)
  Compmisc.init_path false;
  Env.set_unit_name modulename;
  let env = Compmisc.initial_env () in
  Typemod.type_implementation sourcefile outputprefix modulename env ptree)
  (* HACK END *)

++ Translmod.transl_implementation modulename
++ print_if ppf Clflags.dump_rawlambda Printlambda.lambda
++ Simplif.simplify_lambda
++ print_if ppf Clflags.dump_lambda Printlambda.lambda
++ Bytegen.compile_implementation modulename
++ print_if ppf Clflags.dump_instr Printinstr.instrlist
++ Emitcode.to_file oc modulename objfile;

Finally, the untyped tree is sent again to the type checker and the later steps.

Does it really work? Yes!:

$ cd retype
$ make

It creates a bytecode compiler retype. It just works as ocamlc, but it also save out the source code. Try it to compile some files.

Compiler mod!

So far we see that we can clone ocamlc compiler pretty easily with complier-libs, and also can perform something in addition in our clone. What do these mean? Here is the idea of compiler modification with compiler-libs: you can build your modified OCaml compiler pretty easily with compiler-libs.

In addition, the example of retype shows us how to make our modified compiler safer. Once we get an output typed tree from our modified compiler, we convert+untype it to vanilla OCaml parse tree and pass it to the vanilla type checker. It retypes it and assures the safety of the output then pass it down to the code generation. (For this, of course the mod must happen around the typing phase. This trick does not work for modifications in and after the byte code compilation.) By this, you can even have your own parser and you own type-checker in order to implement a completely diffrent language which uses OCaml as a backend with type safety!

But for this time, I would like to demonstrate something much simpler: using the original parser and type-checker, then modify that typedtree: adding another pipeline step after the first type checking of the retype compiler:

(* See overload/compile.ml *)
...
++ Typemod.type_implementation sourcefile outputprefix modulename env (* normal typing phase *)
++ print_if ppf Clflags.dump_typedtree
            Printtyped.implementation_with_coercion

(* HACK: typed mod *)
++ (fun (str, mc)   -> Mod.Map.map_structure str, mc)
(* HACK END *)

(* HACK: untype + retype *)
++ (fun (str, _) -> 
  (* Untype then save it as xxx.ml2 *)
  let ptree =  Untypeast.untype_structure str in
  ptree)

++ ( fun ptree ->
  Compmisc.init_path false;
  Env.set_unit_name modulename;
  let env = Compmisc.initial_env () in
  Typemod.type_implementation sourcefile outputprefix modulename env ptree)
  (* HACK END *)

++ ...

Mod.structure : Typedtree.structure -> Typedtree.structure does something fancy, in this article, SML style overloading resolution!

SML style overloading

SML style overloading is very simple way to overload things. Much simpler than Haskell type classes, so you cannot derive overloading from overloaded values. You can get the idea from my past article. Let's try to overload (+) here too.

The design of the mod of this time is as follows. We need a seed of an overloaded value, with a polymorphic type, but without any actual definition. Fortunately, we have a way for this in OCaml: primitive declaration:

module Loaded = struct
  external (+) : 'a -> 'a -> 'a = "%OVERLOADED"
end

Here we declare Loaded.(+) to be a polymorphic function whose implementation is by a primitive named %OVERLODED. But we do not give any deinition of the primitive. The name %OVERLOADED is just a mark for our overloading. Very luckily, we can have such a fake polymorphic value in OCaml as far as it is never actually used.

In this Loaded module, we stack sub-modules which provide overloaded instances for this (+):

module Loaded = struct
  external (+) : 'a -> 'a -> 'a = "%OVERLOADED"
  module Int = struct
    let (+) = Pervasives.(+)
  end
  module Float = struct
    let (+) = Pervasives.(+.)
  end
end

Here we have pluses for int and float. Now the preparation is done! Let's use Loaded.(+) as if it is overloaded by these two instances!:

open Loaded
let _ = 
  assert (1 + 2 = 3);
  assert (1.2 + 3.4 = 4.6) (* See it is not +. but + !!! *)

Hey, I used Loaded.(+), which is actually a C primitive without C code! Is it ok? It is NOT, without our compiler mod. The mod must replace the use of Loaded.(+) by Loaded.Int.(+) or Loaded.Float.(+) appropriately depending on its type from the context: the first + is int -> int -> int and the second is float -> float -> float:

(* See overload/mod.ml *)
let resolve_overloading e lidloc path = ...

module MapArg : TypedtreeMap.MapArgument = struct
  include TypedtreeMap.DefaultMapArgument

  let enter_expression = function
    | ({ exp_desc= Texp_ident (path, lidloc, vdesc) } as e)-> 
        begin match vdesc.val_kind with
        | Val_prim { Primitive.prim_name = "%OVERLOADED" } ->
            resolve_overloading e lidloc path
        | _ -> e
        end
    | e -> e
end

module Map = TypedtreeMap.MakeMap(MapArg)

Here is (some part of) the code of the mod. It defines a map of Typedtree AST, which transforms variables in expressions which are declared as "%OVERLOADED" primitive using resolve_overloading. To skip the boilerplate coding to dig into the AST, I used TypedtreeMap, a module for generic map for typed trees available in compiler-libs.

The actual overload resolution happens in resolve_overloading is quite simple if you know the internals of OCaml type-checker. But if you don't, it is just too painful to read. So it is skipped :^) (see mod.ml if you are really interested). The big picture is: traverse the module which defines the primitive to find the values with the same name, then filter out those which do not match the context type. If there is none left, error. If there are more than one matches, error (ambiguous). If there is only one candidate, replace the primitive use by the candidate variable.

Anyway, building and playing this mod is very easy:

$ cd overload
$ make

It creates a bytecode compiler poorman. Well, compared to the full overloading by type classes, this is very simple, a poorman's overloading solution. We have a test code at test/test.ml so you can try compiling it by poorman:

$ ./poorman -o test/test test/test.ml
$ ./test/test  # Well, it just tests some assertions

You can see the uses of Loaded.(+) are replaced by pluses which match with the typing contexts appropriately.

Do you see how the overloaded instances are declared in test/test.ml? They are separately defined in modules and then gathered under Loaded with the %OVERLOADED primitive by module aliases. Actually it is very powerful mechanism to tailor overloading!

Conclusion

This kind of compiler modifications are of course possible even in the previous versions of OCaml compilers, but their distributions had to be as patches against the original compilers, and the users need to recompile the whole compiler sets, which took about 10 minutes. But now, with compiler-libs, it is less than one minute. Compiler-libs are not just for strange compiler mods, but also good for compiler related tool development. It is really encouraging for us, OCaml mutators, since we can deliver our compiler prototypes very easily to end users!

One more thing! -ppx!!

From 4.02.0, -ppx option is fully available. This makes the compiler mod trick even easier to deploy for the other users: if your compiler modification does not involve with OCaml syntax, you can distribute your compiler mod as a ppx filter. People can use your compiler modification with the official OCaml compiler!

A ppx filter gets a parse tree AST as an input from the vanilla OCaml compiler. It modifies the AST, then returns another parse tree AST back to the vanilla. During this conversion of parse trees the filter can what ever it wants, even type-check the AST just like we have done in poorman. Good news are that some of important compiler options given to the vanilla OCaml compiler like -I dir are sent to ppx filters along with parse tree ASTs, therefore the ppx filters can run their type checker with the same configuration as the vanilla compiler almost automatically.

In ppx directory I put this ppx filter version of poorman:

$ cd ppx
$ make

This should build a ppx filter named ppx_overload. With this filer you can enjoy the overloading with the vanilla OCaml compiler:

$ ocamlc -ppx ./ppx_overload test/test.ml
$ ./a.out

Unfortunately this ppx trick does not work for the toplevel. It is since OCaml toplevel re-execute the ppx filter each time it gets a toplevel phrase. This makes the ppx filter hard to keep its state, in this case, the typing environment.