# thesis / ghc.tex

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 %------------------------ \chapter{Haskell and GHC} %------------------------ This chapter describes: The Haskell programming language (\autoref{sec:haskell}); \gls{ghc}, the main implementation of Haskell (\autoref{sec:ghc}); and Core, a \gls{ghc} intermediate language (\autoref{sec:core}). \section{Haskell language} %------------------------- \label{sec:haskell} Haskell is a purely functional, lazy programming language. Purely functional means functions cannot have side effects or mutate data. Lazy means evaluation is delayed until required, \eg function arguments are passed unevaluated and then evaluated on demand. Delaying evaluation comes with several costs, \eg being less efficient and making it hard to predict space behavior of programs. Purity is pretty much required for lazy languages, but I/O is very clumsy in pure languages, so monadic I/O was invented for Haskell to solve this problem. A monad is a powerful abstraction that consist of a type constructor and a pair of functions: return and bind. Monads provide syntax to express structured procedures that are not otherwise supported by functional languages. Haskell also introduced type classes to support overloading of built-in numeric operators. Type classes define behavior of types (which behave similar to Java's interfaces) to provide support for ad-hoc polymorphism~\cite{hudak, ghc}. Other notable Haskell features include lazy evaluation, pattern matching, list comprehensions, and type inference. Type inference means type annotations are rarely needed despite Haskell being statically typed. Haskell's list comprehensions have been adopted by both Python and JavaScript~\cite{hudak, haskell98, ghc}. Type classes and Monads are perhaps Haskell's most distinctive design features, which have influenced many other programming languages. For example: Scala, Coq, and Rust have adopted both monads and type classes, while C\#'s \gls{linq} was directly inspired by Haskell's monad comprehensions~\cite{hudak}. Haskell was created to consolidate more than a dozen similar functional languages that was started in late 1970s and early 1980s. A meeting in January 1988 defined the following six goals for Haskell~\cite[p.~4]{hudak}: \begin{enumerate} \item It should be suitable for teaching, research, and applications, including building large systems. \item It should be completely described via the publication of a formal syntax and semantics. \item It should be freely available. Anyone should be permitted to implement the language and distribute it to whomever they please. \item It should be usable as a basis for further language research. \item It should be based on ideas that enjoy a wide consensus. \item It should reduce unnecessary diversity in functional programming languages. \end{enumerate} The second goal was not realized as Haskell's syntax and semantics have never been formally described. The plan for the last goal was to base Haskell on an an existing functional language called OL\@. This plan was abandoned early. Haskell has successfully achieved the majority of the remaining goals, although some features such as type classes were added without regard for goal five~\cite{hudak}. As Haskell is designed by committee, it is a rather big language, and there are usually more than one way to do something in Haskell. As the language grew it also quickly evolved, which was problematic for teaching and application that require stability. The committee therefor defined a stable version of the language, Haskell 98'', that implementations committed to support indefinitely. In 2005 design of Haskell´ (pronounced Haskell Prime) was started, to succeed Haskell 98 and to cover heavily used extensions~\cite{hudak}. Haskell 2010 is the latest stable version of the Haskell programming language~\cite{haskell2010}. \section{The Glasgow Haskell Compiler} %------------------------------------- \label{sec:ghc} \glsreset{ghc} \Gls{ghc} is the most fully featured Haskell compiler today, and has been the main Haskell implementation since the release of Haskell Report 1.0'' in 1990~\cite{hudak}. \citeauthor{ghc} states that today \gls{ghc} releases are downloaded by hundreds of thousands of people, the online repository of Haskell libraries has over 3,000 packages, \gls{ghc} is used to teach Haskell in many undergraduate courses''~\cite[p.~1]{ghc}. \gls{ghc} can be divided into three distinct parts: \begin{itemize} \item The compiler, a Haskell program that converts Haskell source code to machine code. \item The boot libraries that the compiler depend on. \item The runtime system. Large library of C code that handles running the compiled Haskell code, such as garbage collection, threads, and exceptions.~\cite{ghc} \end{itemize} The compiler itself is also divided into three parts: \begin{itemize} \item The compilation manager, which manages compilation of multiple Haskell files, the order of compilation and check if any modules does not require recompilation. \item The \gls{hsc}, which compiles a single Haskell file to machine code, depending on the selected backend. \item The pipeline, which handles Haskell code that interface with or require external programs. For example if a source file require preprocessing with a C preprocessor.~\cite{ghc} \end{itemize} The compiler is also a library, called \gls{ghc} \gls{api}, which provides access to internal parts of \gls{ghc} and allows working with Haskell code. \subsection{Compilation process} \label{sec:compilation} \gls{ghc}'s compilation process is a linear process divided into the frontend and backend with several optimization passes in the middle. \autoref{fig:ghc-compilation} provides an overview of this process~\cite{ghc}. \begin{figure}[tbp] \center \includegraphics[scale=0.70]{./img/ghc-compilation} \caption{GHC's compilation process\label{fig:ghc-compilation}} \end{figure} The frontend starts with \textit{parsing} Haskell source files with a fully functional parser that produces an \gls{ast} where identifiers are simple strings. The second step is called \textit{renaming}, where identifiers are turned into fully qualified names. This step also spots duplicate declarations, rearrange infix expressions, collect the equations of functions together, and more. The third step of frontend process is \textit{type checking}. Any program that passes the type checker is type-safe and guaranteed to not crash at runtime. The last step of the frontend is \textit{desugaring}, where all syntactic sugar'' is removed and the full Haskell syntax is converted into a much smaller intermediate language called Core. Core is further described in \mypref{sec:core}. The \textit{optimization} stage, which joins the frontend and the backend process, consist of several passes where code represented as Core is transformed into more optimized code (which is still Core). These optimizations, and other optimizations performed by \gls{ghc}, are described in \autoref{sec:ghc-optimizations}. After the \textit{optimization} stage the code can either be turned over to the backend to generate low-level code, or it can be turned into bytecode. The bytecode is used by the interactive Haskell interpreter, \gls{ghci}. Before code generation, Core is transformed into another intermediate representation called \gls{stg}. \gls{stg} is a A-normalized lambda calculus that defines \gls{ghc}'s execution model~\cite{jones89}. The next step is the actual \textit{code generation}, which convert \gls{stg} into another intermediate representation, \gls{cmm}. \gls{cmm} is a variant of the C-{}- language, and is almost a subset of C that support proper tail calls~\cite{terei, ghc}. All remaining non-strict and functional aspects of Haskell are removed during code generation, which allow \gls{cmm} to be a simple intermediate language. \gls{cmm} is finally the input for one of three object code generating backends: \begin{itemize} \item The C code generator, which pretty-print \gls{cmm} to C code. The C code is then compiled with \gls{gcc}. Is very portable, as it can be used on most architectures that support \gls{gcc}, but the produced code is not as fast as the other two backends and the compilation process is significantly slower. \item The \gls{ncg} that only support a few architectures, but produces faster code than the C backend. \item The LLVM code generator, which produce LLVM IR that is compiled with LLVM\@. This backend is described in more detail in \mypref{sec:llvm}. \end{itemize} An overview over the process of these code generator backends can be seen in \autoref{fig:ghc-backends}~\cite{ghc, terei}. \begin{figure}[tbp] \center \includegraphics[scale=0.50]{./img/ghc-backends} \caption{Code generator backends included with GHC\label{fig:ghc-backends}} \end{figure} \subsection{Optimizations} \label{sec:ghc-optimizations} \gls{ghc} starts the optimization process after syntactic sugar is removed and Haskell is transformed to Core. As can be seen in \autoref{fig:ghc-optimizations}, there are several different optimization steps, and the simplifier is run in between. The simplifier applies a lot of small, local optimizations. The actual steps taken depend on optimization level specified, where one can trade compilation speed for generated code with is faster~\cite{ghc}. During these steps \gls{ghc} performs traditional optimizations such as common sub-expression elimination, unboxing, inlining, and more. \begin{figure}[tbp] \center \includegraphics[scale=0.50]{./img/ghc-optimizations} \caption{Optimizations steps performed by GHC\label{fig:ghc-optimizations}} \end{figure} The strictness analyzer finds variables and arguments that can be treated strictly, which enable optimizations such as unboxing that would not be allowed for lazy arguments~\cite{jones93}. Let-floating moves let bindings closer to where they are used, which avoids unnecessary allocations if they are on a branch that is never executed~\cite{jones96}. Constructor specialization enables specialization based on call-patterns, which specialize recursive functions according to their argument shapes~\cite{jones07}. \gls{ghc} also perform some optimizations at later stages of the compilation process. For example \textit{code generation} includes the \gls{tntc} optimization, which places meta-data of closures right before the code for the closure. \gls{tntc} allow accessing both closure meta-data and code from a single pointer~\cite{terei, peixotto}. \subsection{Extensibility} \label{sec:ghc-extensibility} \gls{ghc} support extensibility in several ways, the most significant is probably with the \gls{ghc} \gls{api}. Basically, \gls{ghc} has been built as a library, and the \gls{ghc} executable is just a small Main module linked with this library. \gls{ghc}'s functionality is exposed through an \gls{api}, which provide access to individual steps of the compilation process and access to data structures and intermediate representations. Another extensibility of \gls{ghc} is external core, which is a runtime option for \gls{ghc} to serialize Core into an external human-readable representation. This option is further described in \mypref{sec:extcore}. \section{Core intermediate language} %----------------------------------- \label{sec:core} While Haskell is a very large implicitly-typed language, it can be fully translated into Core, an explicitly-typed and statically-typed intermediate language. The theory behind Core has changed over the years, and the Core name is used for the implementation of the intermediate language in \gls{ghc}. Currently Core is the implementation of \sysfp (which have been recently defined formally in a technical paper by \citet{eisenberg}). Core was initially based on lambda calculus. To be able to decorate Core with types, it was upgraded to a polymorphic lambda calculus \sysfo~\cite{hudak}. \sysfo is based on \sysf, which was originally developed as a foundational calculus for typed computation. To support type equality constraints and safe coercions, Core was further extended to \sysfc~\cite{sulzmann, tolmach, vytiniotis}. \sysfc also provide simple support for \texttt{kinds} (the type of a type). \citet{weirich} introduced \sysft, which simplified \sysfc and decorated \texttt{kinds} with roles to mark type contexts. \sysfp is Core's most recent upgrade, which has more complex \texttt{kinds} that provide better support for \texttt{type families} and \texttt{\gls{gadt}}~\cite{vytiniotis, yorgey}.