# thesis / related.tex

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 %--------------------- \chapter{Related work} %--------------------- \label{cha:related} This chapter draws connections to two related work on optimizing Haskell with trace-based compilation: One, \citet{peixotto} with DynamoRIO and Htrace (\autoref{sec:peixotto}); And two, \citet{schilling} with a method based on LuaJIT (\autoref{sec:lambdachine}). While the LLVM backend for \gls{ghc} contains a \gls{jit}, it is not trace-based (\autoref{sec:llvm}). These optimizing attempts and how they relate to this thesis are summarized in \mypref{sec:relatedsummary}. \section{LLVM backend for GHC} %----------------------------- \label{sec:llvm} \citet{terei} have created a new backend for \gls{ghc} that uses LLVM, which is an optimizing compiler framework. \Gls{ghc} started with translating \gls{stg} to C, which made \gls{ghc} portable across multiple architectures and operating systems. This C backend targets the GNU C compiler and its language extensions to get access to proper tail calls, first-class labels and more. \Gls{ghc}'s \gls{ncg} was created to solve some of the downsides of the C backend: Relative long compilation time; Generated assembly code that are inefficient; And a complex Perl script nicknamed the evil mangler'' used to rewrite assembly code. According to \citeauthor{terei}, the LLVM backend provides two main benefits compared to the \gls{ncg} and C backends. First, offloading of work by outsourcing native code generating to the externally maintained LLVM project. Second, better performance by generating more efficient assembly and the LLVM \gls{jit}. \Gls{ghc}'s \gls{cmm} intermediate representation is the input to the LLVM backend, just as the other two backends. The LLVM backend therefor maintains \gls{abi} compatibility with the C and \gls{ncg} backends. The backend performs the following steps: \begin{itemize} \item Translate \gls{cmm} code to unoptimized LLVM code. \item Convert variables into \gls{ssa} form, as demanded by LLVM\@. The backend allocate each mutable \gls{cmm} variable on the stack, and then uses \mycode{mem2reg} from LLVM to perform \gls{ssa} conversion. \item Map \gls{stg} registers to hardware registers, by creating a new calling convention for LLVM where arguments of function calls are stored in registers. These arguments are only in the appropriate hardware registers on entry to any function, which is enough to satisfy \gls{ghc} \gls{abi}. \end{itemize} \citeauthor{terei} have evaluated the LLVM backend against the C and \gls{ncg} backends, in regards to their code size and the performance of the assembly code they generate. The LLVM backend is the smallest with 3.1 \gls{kloc}, the C backend is 5.3 \gls{kloc}, and the \gls{ncg} backend is 20.5 \gls{kloc}. The performance evaluation uses the \emph{nofib}~\cite{partain} benchmark suite and some additional benchmarks. The three backends had overall little difference on runtime for the nofib benchmarks. The \gls{ncg} was $$0.1 \%$$ better than the LLVM, and the C backend was $$2.7 \%$$ slower than LLVM~\cite{terei}. \citeauthor{terei} believes that the \gls{cmm} code used as input by all three backends are hard to optimize. The \gls{cmm} code is essentially memory bound as Haskell uses lazy evaluation. They tested this with other benchmarks with tight loops, where they saw considerable better runtimes for the LLVM backend. \section{DynamoRIO and Htrace} %----------------------------- \label{sec:peixotto} David M.~\citet{peixotto} has attempted to optimize low-level imperative code generated by \gls{ghc}, with trace-based binary optimization techniques. He created a hand-coded case study that showed trace-based optimizations can be profitable for Haskell programs. Then he tested two different methods: First, with a dynamic binary trace-based optimizer; and second, with a static trace-based optimizer. \subsection{Nofib benchmark suite} \citeauthor{peixotto}'s initial investigations used the nofib~\cite{partain} benchmark suite. He felt \emph{nofib} was difficult to collect accurate benchmark numbers''~\cite[p.~2]{peixotto} with, as many of them ran in less than one second. \citeauthor{peixotto} therefor created the Fibon benchmark suite to more accurately evaluate the effects of compiler optimizations. Fibon consist of 32 benchmarks from four sources: Hackage\footnote{Hackage: \url{http://hackage.haskell.org}}, Shootout\footnote{The Computer Language Benchmarks Game: \url{http://shootout.alioth.debian.org/}}, Repa~\cite{repa}, and \gls{dph}~\cite{dph}. \subsection{DynamoRIO} DynamoRIO allows automatic tracing of programs at runtime, which \citeauthor{peixotto} has used for exploring dynamic trace-based optimizations of Haskell. DynamoRIO build traces by monitoring application's stream of instructions, and therefor works on unmodified program binaries without need for the application's source code. \citeauthor{peixotto} discovered that DynamoRIO added a 57\% overhead to just find traces, so optimizations of traces must overcome this overhead for this method to be beneficial. \citeauthor{peixotto} believed the main problem was the heuristics used by DynamoRIO to build traces are not suitable for Haskell programs. Another issue was that the traces included arbitrary parts of \gls{ghc}'s runtime, such as the garbage collector. \subsection{Htrace} To avoid the runtime overhead from DynamoRIO, \citeauthor{peixotto} created Htrace, which finds traces in a separate profiling run and uses them to restructure the program offline''~\cite[p.~87]{peixotto}. Htrace consist of three distinct phases: One, finding traces; Two, restricting low-level code around traces; Three, optimizing the traces. Htrace is build with \gls{ghc} and LLVM --- traces are optimized with compiler optimizations provided by LLVM. \gls{ghc} required some small changes to enable dynamic linking of some C functions into LLVM\@. Htrace disables \gls{ghc}'s tables next to code'' optimization and uses a pure Haskell library for integer arithmetic, \mycode{integer-simple}. Two new passes were added to LLVM: inserting trace instrumentation and building traces. The LLVM bitcode interpreter, \mycode{lli}, was changed to add callbacks to the trace runtime. Htrace performs four main tasks: \begin{itemize} \item Create LLVM bitcode from program source. \item Create LLVM bitcode from Haskell libraries. \item Determine external libraries used. \item Create makefile to do the build. \end{itemize} \citeauthor{peixotto} compared Htrace against \gls{ghc} with the LLVM backend. His Htrace results show an average speedup of 5\% on the Fibon benchmarks, and a maximum speedup of 86\% on a single benchmark. Only two benchmarks showed over 5\% performance degradation. \section{Lambdachine} %-------------------- \label{sec:lambdachine} Thomas~\citet{schilling} has implemented a prototype \gls{vm} with a tracing \gls{jit} compiler called lambdachine. This \gls{vm} use \gls{ghc} as a frontend to compile Haskell code into Core. Core is then compiled into a custom bytecode format that lambdachine interprets. The \gls{vm} and the bytecode format adopt ideas and techniques from LuaJIT~2\footnote{The LuaJIT project: \url{http://luajit.org/}}. \citeauthor{schilling} has identified and formulated three challenges for optimizing lazy evaluation using trace-compilation: \begin{description} \item[Challenge 1] Can we use specialization to remove the overhead of evaluation and indirect function calls due to type class dictionaries. If the number of possible shapes per evaluation site is small, then the size of trace trees will remain small and thus remain efficient. It is, however, likely that there a few functions which are megamorphic, \ie exhibit a large amount of different argument shapes~\cite[p.~5]{schilling}. \item[Challenge 2] Sharing information at runtime is important to enable deforestation with a dynamic compiler. Both static and dynamic approaches are possible and likely have different trade-offs in terms of accuracy, implementation cost, and runtime overhead. It is not clear which trade-offs will work best for dynamic optimization systems~\cite[p.~7]{schilling}. \item[Challenge 3] Evaluate different trace selection schemes by their coverage and code size. Functional programming benchmarks may exhibit different execution behavior from standard benchmarks for imperative/object-oriented languages~\cite[p.~8]{schilling}. \end{description} Lambdachine performs the following standard optimization techniques, where the forward optimization are performed immediately, and the rest happen right before machine code generation: \begin{itemize} \item Common sub-expression elimination. \item Constant folding, algebraic optimizations and reassociation. \item Redundant load removal. \item Store-to-load forwarding. \item Redundant guard removal. \item Dead code elimination. \item Loop unrolling. \item Allocation removal. \end{itemize} According to \citeauthor{schilling}, lambdachine support: basic Haskell programs that use only a small subset of built-in types, Char, Bool and Int. All user defined types are supported, but the IO monad or arrays are not supported.''~\cite[p.~13]{schilling}. Loop unrolling are only applied to traces with constant stack usage, and side traces are not implemented. \citeauthor{schilling} has evaluated lambdachine on two simple benchmarks with the \gls{jit} enabled or disabled. His results show that lambdachine are able to remove significant amount of operations executed, when the \gls{jit} is enabled. \section{Summary} %---------------- \label{sec:relatedsummary} The attempts to optimize \gls{ghc} described in this chapter are summarized in \autoref{tab:related}. The LLVM backend for \gls{ghc} has not achieved any clear performance advantage~\cite{terei} over \gls{ghc}'s \gls{ncg}. \gls{ghc} with DynamoRIO was abandoned as tracing created too much overhead, and produced poor traces~\cite{peixotto}. Htrace produced an average of 5\% speedup over the LLVM backend~\cite{peixotto}. Lambdachine is still in development, and has not yet published any results comparing it to any other \gls{ghc} backends~\cite{schilling}. \begin{table}[tbp] \footnotesize \center \caption{Overview of GHC optimizations attempts described\label{tab:related}} \begin{tabular}{l c c c l} \toprule Name & Trace-based & \gls{jit} & \gls{ghc} version & Creator \\ \midrule GHC LLVM backend & No & Yes & 7.6+ & \citet{terei} \\ GHC with DynamoRIO & Yes & Yes & 7.4 & \citet{peixotto} \\ Htrace & Yes & No & 7.4 & \citet{peixotto} \\ Lambdachine & Yes & Yes & 7.0 & \citet{schilling} \\ PyHaskell & Yes & Yes & 7.6 & TODO \\ \bottomrule \end{tabular} \end{table} \mytodo{TODO: maybe add a paragraph where I summarize how the different approaches differ from my approach.\ draw lessons from LLVM backend to my RPython backend. Something about RPython backend translates from STG instead of Cmm. Therefor no ABI compatibility with other backends. And none of the steps LLVM backend takes are relevant for us. But the runtime evaluation methods are same/similar, and benefit that the LLVM backend provides, so do we.\ explain how we use RPython/PyPy while lambdachine use'' LuaJIT, and how this makes our goals the same but different methods. Also his challenges.}