Source

thesis / ghc.tex

Full commit
%------------------------
\chapter{Haskell and GHC}
%------------------------
This chapter describes: The Haskell programming language
(\autoref{sec:haskell}); \gls{ghc}, the main implementation of Haskell
(\autoref{sec:ghc}); and Core, a \gls{ghc} intermediate language
(\autoref{sec:core}).


\section{Haskell language}
%-------------------------
\label{sec:haskell}
Haskell is a purely functional, lazy programming language. Purely functional
means functions cannot have side effects or mutate data. Lazy means
evaluation is delayed until required, \eg function arguments are passed
unevaluated and then evaluated on demand. Delaying evaluation comes with
several costs, \eg being less efficient and making it hard to predict space
behavior of programs. Purity is pretty much required for lazy languages, but
I/O is very clumsy in pure languages, so monadic I/O was invented for Haskell
to solve this problem. A monad is a powerful abstraction that consist of a type
constructor and a pair of functions: return and bind. Monads provide syntax to
express structured procedures that are not otherwise supported by functional
languages. Haskell also introduced type classes to support overloading of
built-in numeric operators. Type classes define behavior of types (which behave
similar to Java's interfaces) to provide support for ad-hoc
polymorphism~\cite{hudak, ghc}.

Other notable Haskell features include lazy evaluation, pattern matching, list
comprehensions, and type inference. Type inference means type annotations are
rarely needed despite Haskell being statically typed. Haskell's list
comprehensions have been adopted by both Python and
JavaScript~\cite{hudak, haskell98, ghc}.

Type classes and Monads are perhaps Haskell's most distinctive design features,
which have influenced many other programming languages. For example: Scala,
Coq, and Rust have adopted both monads and type classes, while C\#'s \gls{linq}
was directly inspired by Haskell's monad comprehensions~\cite{hudak}.

Haskell was created to consolidate more than a dozen similar functional
languages that was started in late 1970s and early 1980s. A meeting in January
1988 defined the following six goals for Haskell~\cite[p.~4]{hudak}:
\begin{enumerate}
    \item It should be suitable for teaching, research, and applications,
        including building large systems.
    \item It should be completely described via the publication of a formal
        syntax and semantics.
    \item It should be freely available. Anyone should be permitted to
        implement the language and distribute it to whomever they please.
    \item It should be usable as a basis for further language research.
    \item It should be based on ideas that enjoy a wide consensus.
    \item It should reduce unnecessary diversity in functional programming
        languages.
\end{enumerate}
The second goal was not realized as Haskell's syntax and semantics have never
been formally described. The plan for the last goal was to base Haskell on an
an existing functional language called OL\@. This plan was abandoned early.
Haskell has successfully achieved the majority of the remaining goals, although
some features such as type classes were added without regard for goal
five~\cite{hudak}.

As Haskell is designed by committee, it is a rather big language, and there are
usually more than one way to do something in Haskell. As the language grew it
also quickly evolved, which was problematic for teaching and application that
require stability. The committee therefor defined a stable version of the
language, ``Haskell 98'', that implementations committed to support
indefinitely. In 2005 design of Haskell´ (pronounced Haskell Prime) was
started, to succeed Haskell 98 and to cover heavily used
extensions~\cite{hudak}. Haskell 2010 is the latest stable version of the
Haskell programming language~\cite{haskell2010}.


\section{The Glasgow Haskell Compiler}
%-------------------------------------
\label{sec:ghc}
\glsreset{ghc}
\Gls{ghc} is the most fully featured Haskell compiler today, and has been the
main Haskell implementation since the release of ``Haskell Report 1.0'' in
1990~\cite{hudak}. \citeauthor{ghc} states that today ``\gls{ghc} releases are
downloaded by hundreds of thousands of people, the online repository of Haskell
libraries has over 3,000 packages, \gls{ghc} is used to teach Haskell in many
undergraduate courses''~\cite[p.~1]{ghc}.

\gls{ghc} can be divided into three distinct parts:
\begin{itemize}
	\item The compiler, a Haskell program that converts Haskell source code to
        machine code.
	\item The boot libraries that the compiler depend on.
	\item The runtime system. Large library of C code that handles running the
		compiled Haskell code, such as garbage collection, threads, and
		exceptions.~\cite{ghc}
\end{itemize}
The compiler itself is also divided into three parts:
\begin{itemize}
    \item The compilation manager, which manages compilation of multiple
        Haskell files, the order of compilation and check if any modules
        does not require recompilation.
    \item The \gls{hsc}, which compiles a single Haskell file to machine code,
        depending on the selected backend.
    \item The pipeline, which handles Haskell code that interface with or
        require external programs. For example if a source file require
        preprocessing with a C preprocessor.~\cite{ghc}
\end{itemize}
The compiler is also a library, called \gls{ghc} \gls{api}, which provides
access to internal parts of \gls{ghc} and allows working with Haskell code.

\subsection{Compilation process}
\label{sec:compilation}
\gls{ghc}'s compilation process is a linear process divided into the frontend
and backend with several optimization passes in the middle.
\autoref{fig:ghc-compilation} provides an overview of this process~\cite{ghc}.

\begin{figure}[tbp]
	\center
	\includegraphics[scale=0.70]{./img/ghc-compilation}
    \caption{GHC's compilation process\label{fig:ghc-compilation}}
\end{figure}

The frontend starts with \textit{parsing} Haskell source files with a fully
functional parser that produces an \gls{ast} where identifiers are simple
strings. The second step is called \textit{renaming}, where identifiers are
turned into fully qualified names. This step also spots duplicate declarations,
rearrange infix expressions, collect the equations of functions together, and
more. The third step of frontend process is \textit{type checking}. Any
program that passes the type checker is type-safe and guaranteed to not crash
at runtime. The last step of the frontend is \textit{desugaring}, where all
``syntactic sugar'' is removed and the full Haskell syntax is converted into a
much smaller intermediate language called Core. Core is further described
in \mypref{sec:core}.

The \textit{optimization} stage, which joins the frontend and the backend
process, consist of several passes where code represented as Core is
transformed into more optimized code (which is still Core). These
optimizations, and other optimizations performed by \gls{ghc}, are described in
\autoref{sec:ghc-optimizations}.

After the \textit{optimization} stage the code can either be turned over to the
backend to generate low-level code, or it can be turned into bytecode. The
bytecode is used by the interactive Haskell interpreter, \gls{ghci}. Before
code generation, Core is transformed into another intermediate representation
called \gls{stg}. \gls{stg} is a A-normalized lambda calculus that defines
\gls{ghc}'s execution model~\cite{jones89}.

The next step is the actual \textit{code generation}, which convert \gls{stg}
into another intermediate representation, \gls{cmm}. \gls{cmm} is a variant of
the C-{}- language, and is almost a subset of C that support proper tail
calls~\cite{terei, ghc}. All remaining non-strict and functional aspects of
Haskell are removed during code generation, which allow \gls{cmm} to be a
simple intermediate language. \gls{cmm} is finally the input for one of three
object code generating backends:
\begin{itemize}
    \item The C code generator, which pretty-print \gls{cmm} to C code. The C
        code is then compiled with \gls{gcc}. Is very portable, as it can be
        used on most architectures that support \gls{gcc}, but the produced
        code is not as fast as the other two backends and the compilation
        process is significantly slower.
    \item The \gls{ncg} that only support a few architectures, but
        produces faster code than the C backend.
    \item The LLVM code generator, which produce LLVM IR that is compiled with
        LLVM\@. This backend is described in more detail in \mypref{sec:llvm}.
\end{itemize}
An overview over the process of these code generator backends can be seen in
\autoref{fig:ghc-backends}~\cite{ghc, terei}.

\begin{figure}[tbp]
	\center
	\includegraphics[scale=0.50]{./img/ghc-backends}
    \caption{Code generator backends included with GHC\label{fig:ghc-backends}}
\end{figure}

\subsection{Optimizations}
\label{sec:ghc-optimizations}
\gls{ghc} starts the optimization process after syntactic sugar is removed and
Haskell is transformed to Core. As can be seen in
\autoref{fig:ghc-optimizations}, there are several different optimization
steps, and the simplifier is run in between. The simplifier applies a lot of
small, local optimizations such as let-to-case transformation and case
elimination. The actual steps taken depend on optimization level specified,
where one can trade compilation speed for generated code with is
faster~\cite{ghc, jones94}. During these steps \gls{ghc} performs traditional
optimizations such as common sub-expression elimination, unboxing, inlining,
and more~\cite{jones91}.
\begin{figure}[tbp]
	\center
	\includegraphics[scale=0.50]{./img/ghc-optimizations}
    \caption{Optimizations steps performed by GHC\label{fig:ghc-optimizations}}
\end{figure}

The strictness analyzer finds variables and arguments that can be treated
strictly, which enable optimizations such as unboxing that would not be allowed
for lazy arguments~\cite{jones93}. Let-floating moves let bindings closer to
where they are used, which avoids unnecessary allocations if they are on a
branch that is never executed~\cite{jones96}. Constructor specialization
enables specialization based on call-patterns, which specialize recursive
functions according to their argument shapes~\cite{jones07}.

\gls{ghc} also perform some optimizations at later stages of the compilation
process. For example \textit{code generation} includes the \gls{tntc}
optimization, which places meta-data of closures right before the code for
the closure. \gls{tntc} allow accessing both closure meta-data and code from
a single pointer~\cite{terei, peixotto}.

\subsection{Extensibility}
\label{sec:ghc-extensibility}
\gls{ghc} support extensibility in several ways, the most significant is
probably with the \gls{ghc} \gls{api}. Basically, \gls{ghc} has been built as a
library, and the \gls{ghc} executable is just a small Main module linked with
this library. \gls{ghc}'s functionality is exposed through an \gls{api}, which
provide access to individual steps of the compilation process and access to
data structures and intermediate representations.

Another extensibility of \gls{ghc} is external core, which is a runtime option
for \gls{ghc} to serialize Core into an external human-readable representation.
This option is further described in \mypref{sec:extcore}.


\section{Core intermediate language}
%-----------------------------------
\label{sec:core}
While Haskell is a very large implicitly-typed language, it can be fully
translated into Core, an explicitly- and statically-typed intermediate
language.

The theory behind Core has changed over the years, and the Core name
is used for the implementation of the intermediate language in \gls{ghc}.
Currently Core is the implementation of \sysfp (which have been recently
defined formally in a technical paper by \citet{eisenberg}).

Core was initially based on lambda calculus. To be able to decorate Core with
types, it was upgraded to a polymorphic lambda calculus \sysfo~\cite{hudak}.
\sysfo is based on \sysf, which was originally developed as a foundational
calculus for typed computation. To support type equality constraints and safe
coercions, Core was further extended to
\sysfc~\cite{sulzmann, tolmach, vytiniotis}. \sysfc also provide simple support
for \texttt{kinds} (the type of a type). \citet{weirich} introduced \sysft,
which simplified \sysfc and decorated \texttt{kinds} with roles to mark type
contexts. \sysfp is Core's most recent upgrade, which has more complex
\texttt{kinds} that provide better support for \texttt{type families} and
\texttt{\gls{gadt}}~\cite{vytiniotis, yorgey}.