Commits

camlspotter committed 55af6fd

document

Comments (0)

Files changed (1)

+=====================================================
+A safe but strange way of modifying OCaml compiler
+=====================================================
+
+OCaml 4.00.0 is out! Have you tried it? GADT? Better first class modules? More warnings?
+
+I am talking something different, something more exciting at least for me. Compiler-libs.
+
+Compiler-libs are modules of the compiler itself, and now available for everyone as a library, even for binary package users. This means that we can deliver compilish hacks much easier to everyone. If the hack is reasonably small we can publish them not as a compiler patch which requires boring source download + patching + entire recompilation of  the compiler, but as a stand alone tool which compiles in really shorter time. Here, I am going to demonstrate such a small compiler hack, SML style overloading, my favorite compiler mod.
+
+Safe compiler mod ever
+============================
+
+What is great about 4.00.0 is it also have an untyper and an AST printer. They are not in the part of the compiler-libs, but found in tools dir. (So for binary package users we must copy them but they are very small, and I hope they are soon in compiler-libs in 4.00.1 or 4.01.0.)
+
+The untyper takes a type-checked source tree (Typedtree), strips away its attached type information, then returns the corresponding untyped tree (Parsetree). The AST printer prints out a Parsetree as a reparseable OCaml source code.
+
+Using them we can create safe compiler mods: our modified compiler can do whatever it wants, then it makes the result back to Parsetree and refeeds it to the original typechecker and compiler. If the mod does something wrong, the original compiler part should find it. If a user is paranoiac about what our mod does, we can always print out the result as vanilla OCaml code. Cool.
+
+
+Preparation
+===============
+
+All the code is available here::
+
+    hg clone https://bitbucket.org/camlspotter/compiler-libs-hack
+
+It contains the full source tree of the official OCaml 4.00.0, but it is attached only for the copyright requirements. We only need few files of it. And of course, you must have OCaml 4.00.0 installed.  
+
+Vanilla compiler
+====================
+
+First of all, lets start cloning a vanilla compiler from compiler-libs. It is very easy::
+
+    $ cd vanilla
+    $ make
+
+    cp ../ocaml/driver/main.ml main.ml
+    ocamlc -I +compiler-libs -I +unix -c main.ml
+    ocamlc -o vanilla -I +compiler-libs ocamlcommon.cma ocamlbytecomp.cma main.cmo
+
+    cp ../ocaml/driver/optmain.ml optmain.ml
+    ocamlc -I +compiler-libs -I +unix -c optmain.ml
+    ocamlc -o vanillaopt -I +compiler-libs ocamlcommon.cma ocamloptcomp.cma optmain.cmo
+
+To build a vanilla ocamlc, we need the original ``main.ml`` and link it with ``ocamlcommon.cma`` and ``ocamlbytecomp.cma``. ``main.ml`` must be copied from the original source tree, since it is not included in the compiler-libs.
+
+For the native code compiler, instead of ``main.ml`` and ``ocamlbytecomp.cma``, we use ``optmain.ml`` and ``ocamloptcompo.cma``.
+
+Now you have two executables ``vanilla`` and ``vanillaopt``, which are actually clones of ``ocamlc`` and ``ocamlopt``. Try using them to compile some simple modules to see they are really working.
+
+Now you know how to use compiler-libs. Let's do something more interesting.
+
+Compiler with untype+retyping
+===============================
+
+The next thing is to use the untyper and the AST printer. Here we modify the bytecode compiler workflow a bit, so that once the original compiler type-check the source code, we untype it, then print it as readable OCaml source, then retype it again. The workflow is implemented in ``ocaml/driver/compile.ml``::
+
+      Pparse.file ppf inputfile Parse.implementation ast_impl_magic_number
+      ++ print_if ppf Clflags.dump_parsetree Printast.implementation
+      ++ Typemod.type_implementation sourcefile outputprefix modulename env
+      ++ Translmod.transl_implementation modulename
+      ++ print_if ppf Clflags.dump_rawlambda Printlambda.lambda
+      ++ Simplif.simplify_lambda
+      ++ print_if ppf Clflags.dump_lambda Printlambda.lambda
+      ++ Bytegen.compile_implementation modulename
+      ++ print_if ppf Clflags.dump_instr Printinstr.instrlist
+      ++ Emitcode.to_file oc modulename;
+
+Simple. The source file is first parsed by ``Pparse.file``, then the result is sent to the next line of the parsetree dumper, then sent to the type checker, and so on... The source is pipelined from the top line to the bottom.
+
+We here insert few extra steps into this pipeline to untype and print::
+
+      Pparse.file ppf inputfile Parse.implementation ast_impl_magic_number
+      ++ print_if ppf Clflags.dump_parsetree Printast.implementation
+      ++ Typemod.type_implementation sourcefile outputprefix modulename env
+      ++ (fun (str, _) -> 
+        let ptree =  Untypeast.untype_structure str in
+        Format.eprintf "%a@." Pprintast.structure ptree;
+        ptree
+      )
+      ++ Translmod.transl_implementation modulename
+      ++ print_if ppf Clflags.dump_rawlambda Printlambda.lambda
+      ++ Simplif.simplify_lambda
+      ++ print_if ppf Clflags.dump_lambda Printlambda.lambda
+      ++ Bytegen.compile_implementation modulename
+      ++ print_if ppf Clflags.dump_instr Printinstr.instrlist
+      ++ Emitcode.to_file oc modulename;
+
+Typed structure ``str`` from ``Typemod.type_implementation`` is untyped back to ``ptree`` by ``Untypeast.untype_structure``, then it is printed out by ``Pprintast.structure``. The untyped tree is sent again to the type checker and the later steps.
+
+Does it really work? Yes!::
+
+    $ cd retype
+    $ make
+
+It creates a bytecode compiler ``retype``. It just works as ``ocamlc``, but it also prints out the source code.
+
+Now compiler mod!
+===============================
+
+Now you should get the idea of compiler modification with compiler-libs: your compiler mod somehow creates a untyped AST, then feed it to the original typechecker and the following compiler pipeline. The original type-checker assures the safety of the output of your mod. The output can be printed as a normal OCaml code by the AST printer, too.
+
+By this, you can even have your own parser and you own type-checker in order to implement a completely diffrent language which uses OCaml as a backend!
+
+But for this time, I would like to demonstrate something much simpler: using the original parser and type-checker, then modify that typedtree: adding another pipeline step after the first type checking of the ``retype`` compiler::
+
+      (* See overload/compile.ml *)
+      ...
+      ++ Typemod.type_implementation sourcefile outputprefix modulename env
+      ++ (fun (str, _) -> Mod.structure str)   (* We modify the tree! *)
+      ++ (fun str -> 
+        let ptree =  Untypeast.untype_structure str in
+        Format.eprintf "%a@." Pprintast.structure ptree;
+        ptree)
+      ++ Typemod.type_implementation sourcefile outputprefix modulename env
+      ++ ...
+
+``Mod.structure : Typedtree.structure -> Typedtree.structure`` does something fancy, in this article, SML styple overloading resolution!
+
+SML style overloading
+===============================
+
+SML style overloading is very simple way to overload things. Much simpler than Haskell type classes, so you cannot derive overloading from overloaded values. You can get the idea from my past article `http://camlspotter.blogspot.sg/2011/09/small-patch-for-bizarre-but-user.html`. Let's try to overload ``(+)`` here too.
+
+The design of the mod of this time is as follows. We need a seed of overloaded values, with a polymorphic type, but without any actual definition. Fortunately, we have a way for this in OCaml: primitive declaration::
+
+    module Loaded = struct
+      external (+) : 'a -> 'a -> 'a = "OVERLOADDED"
+    end
+
+Here we declare ``Loaded.(+)`` to be a polymorphic function whose implementation is by C function named ``OVERLODED``. But we do not give any C code. The name ``OVERLOADED`` is just a marking for our overloading. Very luckily, we can have such a fake polymorphic value in OCaml as far as such a value is never used.
+
+In this ``Loaded`` module we stack sub-modules which provide overloaded instances for this ``(+)``:: 
+
+    module Loaded = struct
+      external (+) : 'a -> 'a -> 'a = "OVERLOADDED"
+      module Int = struct
+        let (+) = Pervasives.(+)
+      end
+      module Float = struct
+        let (+) = Pervasives.(+.)
+      end
+    end
+
+Here we have pluses for ``int`` and ``float``. Now the preparation is done! Let's use ``Loaded.(+)`` as if it is overloaded by these two instances!::
+
+    open Loaded
+    let _ = 
+      assert (1 + 2 = 3);
+      assert (1.2 + 3.4 = 4.6)
+
+Hey, I used ``Loaded.(+)``, which is actually a C primitive without C code! Is it ok? It is NOT, without our compiler mod. The mod must replace the use of ``Loaded.(+)`` by ``Loaded.Int.(+)`` or ``Loaded.Float.(+)`` appropriately depending on its type from the context: the first ``+`` is ``int -> int -> int`` and the second is ``float -> float -> float``::
+
+    (* See overload/mod.ml *)
+    let resolve_overloading e lidloc path = ...
+
+    class map = object (self)
+      inherit Ttmap.map as super
+    
+      method! expression = function
+        | ({ exp_desc= Texp_ident (path, lidloc, vdesc) } as e)-> 
+            begin match vdesc.val_kind with
+            | Val_prim { Primitive.prim_name = "OVERLOADED" } ->
+                self, resolve_overloading e lidloc path
+            | _ -> super#expression e
+            end
+        | e -> super#expression e
+    end
+    
+    let structure str = 
+      let o = new map in
+      let _, str =  o#structure str in
+      str
+
+Here is (some part of) the code of the mod. It is a function of ``Typedtree.structure -> Typedtree.structure``, but we are only interested in the uses of identifiers whose definitions are by primitives ``OVERLOADED``. So the boilerplate code to dig into the AST data types I used a generic map class ``Ttmap`` created by a CamlP4 hack. For each identifier whose definition is ``OVERLOADED`` is converted by the function ``resolve_overloading`` function.
+
+The actual overload resolution is quite simple, if you know the internals of OCaml type-checker. But if you don't, it is just too painful to read. So skipped (see ``mod.ml`` if you are really interested). The big picture is: traverse the module which defines the primitive to find the values with the same name, then filter out those which do not match the context type. If there is none matched, error. If there are more than one matches, error. If there is only one candidate, replace the primitive use by the candidate variable.
+
+Anyway, building and playing this mod is very easy::
+
+    $ cd overlaod
+    $ make
+
+It creates a bytecode compiler ``poorman``. Well, compared to the full overloading by type classes, this is very simple, a poorman's solution. We have a test code at ``test/test.ml`` so you can try compiling it by ``poorman``::
+
+    $ ./poorman -o test/test test/test.ml
+    $ ./test/test  # Well, it just tests some assertions
+
+Do you see how the overloaded instances are declared in ``test/test.ml``? They are separately defined in modules and then gathered under ``Loaded`` with the ``OVERLOADED`` primitive by module aliases. Actually it is very powerful mechanism to tailor overloading!
+
+
+That's all, folks!
+===============================
+
+This kind of compiler modifications are of course possible even in the previous versions of OCaml compilers, but their distributions had to be as patches against the original compilers, and the users need to recompile the whole compiler sets, which took about 10 minutes. But now, with compiler-libs, it is less than one minute. Compiler-libs are not just for strange compiler mods, but also good for compiler related tool development. It is really encouraging for us, OCaml mutators, since we can deliver our compiler prototypes very easily to end users!
+