thesis / related.tex

Full commit
\chapter{Related work}
This chapter draws connections to two related work on optimizing Haskell with
trace-based compilation: One, \citet{peixotto} with DynamoRIO and Htrace
(\autoref{sec:peixotto}); And two, \citet{schilling} with a method based on
LuaJIT (\autoref{sec:lambdachine}). While the LLVM backend for \gls{ghc}
contains a \gls{jit}, it is not trace-based (\autoref{sec:llvm}).
These optimizing attempts and how they relate to this thesis are summarized in

\section{LLVM backend for GHC}
\citet{terei} have created a new backend for \gls{ghc} that uses LLVM, which
is an optimizing compiler framework.

\Gls{ghc} started with translating \gls{stg} to C, which made \gls{ghc}
portable across multiple architectures and operating systems. This C backend
targets the GNU C compiler and its language extensions to get access to
proper tail calls, first-class labels and more.

\Gls{ghc}'s \gls{ncg} was created to solve some of the downsides of the C
backend: Relative long compilation time; Generated assembly code that are
inefficient; And a complex Perl script nicknamed ``the evil mangler'' used to
rewrite assembly code.

According to \citeauthor{terei}, the LLVM backend provides two main benefits
compared to the \gls{ncg} and C backends. First, offloading of work by
outsourcing native code generating to the externally maintained LLVM project.
Second, better performance by generating more efficient assembly and
the LLVM \gls{jit}.

\Gls{ghc}'s \gls{cmm} intermediate representation is the input to the LLVM
backend, just as the other two backends. The LLVM backend therefor maintains
\gls{abi} compatibility with the C and \gls{ncg} backends.
The backend performs the following steps:
    \item Translate \gls{cmm} code to unoptimized LLVM code.
    \item Convert variables into \gls{ssa} form, as demanded by LLVM\@.
        The backend allocate each mutable \gls{cmm} variable on the stack,
        and then uses \mycode{mem2reg} from LLVM to perform \gls{ssa}
    \item Map \gls{stg} registers to hardware registers, by creating a new
        calling convention for LLVM where arguments of function calls
        are stored in registers. These arguments are only in the
        appropriate hardware registers on entry to any function, which is
        enough to satisfy \gls{ghc} \gls{abi}.

\citeauthor{terei} have evaluated the LLVM backend against the C and \gls{ncg}
backends, in regards to their code size and the performance of the assembly
code they generate. The LLVM backend is the smallest with 3.1 \gls{kloc}, the
C backend is 5.3 \gls{kloc}, and the \gls{ncg} backend is 20.5 \gls{kloc}.

The performance evaluation uses the \emph{nofib}~\cite{partain} benchmark
suite and some additional benchmarks. The three backends had overall little
difference on runtime for the nofib benchmarks. The \gls{ncg} was \( 0.1 \% \)
better than the LLVM, and the C backend was \( 2.7 \% \) slower than
LLVM~\cite{terei}. \citeauthor{terei} believes that the \gls{cmm} code used
as input by all three backends are hard to optimize. The \gls{cmm} code is
essentially memory bound as Haskell uses lazy evaluation. They tested this
with other benchmarks with tight loops, where they saw considerable better
runtimes for the LLVM backend.

\section{DynamoRIO and Htrace}
David M.~\citet{peixotto} has attempted to optimize low-level imperative code
generated by \gls{ghc}, with trace-based binary optimization techniques. He
created a hand-coded case study that showed trace-based optimizations can be
profitable for Haskell programs. Then he tested two different methods: First,
with a dynamic binary trace-based optimizer; and second, with a static
trace-based optimizer.

\subsection{Nofib benchmark suite}
\citeauthor{peixotto}'s initial investigations used the nofib~\cite{partain}
benchmark suite. He felt \emph{nofib} was ``difficult to collect accurate
benchmark numbers''~\cite[p.~2]{peixotto} with, as many of them ran in less
than one second. \citeauthor{peixotto} therefor created the Fibon benchmark
suite to more accurately evaluate the effects of compiler optimizations. Fibon
consist of 32 benchmarks from four sources:
Hackage\footnote{Hackage: \url{}},
Shootout\footnote{The Computer Language Benchmarks Game: \url{}},
Repa~\cite{repa}, and \gls{dph}~\cite{dph}.

DynamoRIO allows automatic tracing of programs at runtime, which
\citeauthor{peixotto} has used for exploring dynamic trace-based optimizations
of Haskell. DynamoRIO build traces by monitoring application's stream of
instructions, and therefor works on unmodified program binaries without need
for the application's source code. \citeauthor{peixotto} discovered that
DynamoRIO added a 57\% overhead to just find traces, so optimizations of
traces must overcome this overhead for this method to be beneficial.

\citeauthor{peixotto} believed the main problem was the heuristics used by
DynamoRIO to build traces are not suitable for Haskell programs. Another issue
was that the traces included arbitrary parts of \gls{ghc}'s runtime, such as
the garbage collector.

To avoid the runtime overhead from DynamoRIO, \citeauthor{peixotto} created
Htrace, which ``finds traces in a separate profiling run and uses them to
restructure the program offline''~\cite[p.~87]{peixotto}. Htrace consist of
three distinct phases: One, finding traces; Two, restricting low-level code
around traces; Three, optimizing the traces.

Htrace is build with \gls{ghc} and LLVM --- traces are optimized with compiler
optimizations provided by LLVM. \gls{ghc} required some small changes to
enable dynamic linking of some C functions into LLVM\@. Htrace disables
\gls{ghc}'s ``tables next to code'' optimization and uses a pure
Haskell library for integer arithmetic, \mycode{integer-simple}. Two new
passes were added to LLVM: inserting trace instrumentation and building
traces. The LLVM bitcode interpreter, \mycode{lli}, was changed to add
callbacks to the trace runtime. Htrace performs four main tasks:
    \item Create LLVM bitcode from program source.
    \item Create LLVM bitcode from Haskell libraries.
    \item Determine external libraries used.
    \item Create makefile to do the build.

\citeauthor{peixotto} compared Htrace against \gls{ghc} with the LLVM backend.
His Htrace results show an average speedup of 5\% on the Fibon benchmarks, and
a maximum speedup of 86\% on a single benchmark. Only two benchmarks showed
over 5\% performance degradation.

Thomas~\citet{schilling} has implemented a prototype \gls{vm} with a tracing
\gls{jit} compiler called lambdachine. This \gls{vm} use \gls{ghc} as a
frontend to compile Haskell code into Core. Core is then compiled into a
custom bytecode format that lambdachine interprets. The \gls{vm} and the
bytecode format adopt ideas and techniques from
LuaJIT~2\footnote{The LuaJIT project: \url{}}.

\citeauthor{schilling} has identified and formulated three challenges for
optimizing lazy evaluation using trace-compilation:
    \item[Challenge 1] Can we use specialization to
        remove the overhead of evaluation and indirect function calls due to
        type class dictionaries. If the number of possible shapes per
        evaluation site is small, then the size of trace trees will remain
        small and thus remain efficient. It is, however, likely that there a
        few functions which are megamorphic, \ie exhibit a large amount of
        different argument shapes~\cite[p.~5]{schilling}.
    \item[Challenge 2] Sharing information at runtime
        is important to enable deforestation with a dynamic compiler. Both
        static and dynamic approaches are possible and likely have different
        trade-offs in terms of accuracy, implementation cost, and runtime
        overhead. It is not clear which trade-offs will work best for dynamic
        optimization systems~\cite[p.~7]{schilling}.
    \item[Challenge 3] Evaluate different trace
        selection schemes by their coverage and code size. Functional
        programming benchmarks may exhibit different execution behavior from
        standard benchmarks for imperative/object-oriented

Lambdachine performs the following standard optimization techniques, where
the forward optimization are performed immediately, and the rest happen right
before machine code generation:
    \item Common sub-expression elimination.
    \item Constant folding, algebraic optimizations and reassociation.
    \item Redundant load removal.
    \item Store-to-load forwarding.
    \item Redundant guard removal.
    \item Dead code elimination.
    \item Loop unrolling.
    \item Allocation removal.

According to \citeauthor{schilling}, lambdachine support: ``basic Haskell
programs that use only a small subset of built-in types, Char, Bool and Int.
All user defined types are supported, but the IO monad or arrays are not
supported.''~\cite[p.~13]{schilling}. Loop unrolling are only applied to
traces with constant stack usage, and side traces are not implemented.

\citeauthor{schilling} has evaluated lambdachine on two simple benchmarks with
the \gls{jit} enabled or disabled. His results show that lambdachine are able
to remove significant amount of operations executed, when the \gls{jit} is

The attempts to optimize \gls{ghc} described in this chapter are summarized in
\autoref{tab:related}. The LLVM backend for \gls{ghc} has not achieved any clear
performance advantage~\cite{terei} over \gls{ghc}'s \gls{ncg}. \gls{ghc} with
DynamoRIO was abandoned as tracing created too much overhead, and produced
poor traces~\cite{peixotto}. Htrace produced an average of 5\% speedup over
the LLVM backend~\cite{peixotto}. Lambdachine is still in development, and has
not yet published any results comparing it to any other \gls{ghc}

\begin{table}[tbp] \footnotesize \center
\caption{Overview of GHC optimizations attempts described\label{tab:related}}
\begin{tabular}{l c c c l}
    Name & Trace-based & \gls{jit} & \gls{ghc} version & Creator \\
    GHC LLVM backend & No & Yes & 7.6+ & \citet{terei} \\
    GHC with DynamoRIO & Yes & Yes & 7.4 & \citet{peixotto} \\
    Htrace & Yes & No & 7.4 & \citet{peixotto} \\
    Lambdachine & Yes & Yes & 7.0 & \citet{schilling} \\
    PyHaskell & Yes & Yes & 7.6 & TODO \\

\mytodo{TODO: maybe add a paragraph where I summarize how the different
approaches differ from my approach.\
draw lessons from LLVM backend to my RPython backend.
Something about RPython backend translates from STG instead of Cmm.
Therefor no ABI compatibility with other backends. And none of the steps
LLVM backend takes are relevant for us. But the runtime evaluation methods
are same/similar, and benefit that the LLVM backend provides, so do we.\
explain how we use RPython/PyPy while lambdachine ``use''
LuaJIT, and how this makes our goals the same but different methods.
Also his challenges.}