\usepackage[parfill]{parskip}

+\newcommand{\specialcell}[2][c]{%

+ \begin{tabular}[#1]{@{}l@{}}#2\end{tabular}}

\title{Short Answer Scoring by Stacking}

% model ensembling, if any

\section{Features Selection / Extraction}

% 2. Features Selection / Extraction

runner that broke records.}``. The first rater was obviously impressed

by the answer, awarding it the highest grade possible. The second

rater much less so, which he demonstrated by rating the answer with 0.

In the datasets provided, 71 samples were found to be outliers and were

\subsection{Dataset Construction}

All answers were transformed to lowercase. Only letters and numbers were

retained, all other characters were removed. The text was then cut into

-character n-grams and was represented with the bag-of-words model. I

-constructed six training datasets:

+character n-grams and was represented with the bag-of-words model. Six

+training sets were constructed:

\item[4c] Use character four-grams.

\item[6c] Use character six-grams.

-\item[4cc] Correct~~ing~~ spelling mistakes and apply the Porter stemming

+\item[4cc] Correct spelling mistakes and apply the Porter stemming

algorithm~\cite{porter1980algorithm} to each word. Proceed by using

character four-grams as in \textbf{4c}.

\item[6cc] Same as \textbf{4cc} but use character six-grams instead.

\item[4cp500] Same as \textbf{4cp200} but keep the first 500 components.

-For example, the text ``\textit{Hello, hello!}'', after undergoing basic

-preprocessing, was represented as the bag

-\textit{\{'hell', 'ello', 'llo ', 'lo h', 'o he', ' hel'\}}, under

-he \textbf{4c} model. Note that I did not count how many times a n-gram

-appeared. In the example above, the four-gram \verb+'hell'+ appears twice,

+Consider the text ``\textit{Hello, hello!}''. After undergoing basic

+preprocessing, it was represented as the bag \textit{\{'hell', 'ello',

+'llo ', 'lo h', 'o he', ' hel'\}}, under the \textbf{4c} model. Note

+that I did not count how many times a n-gram appeared. In the example

+above, the four-grams \verb+'hell'+ and \verb+'ello'+ appear twice,

but the representation I chose discarded this information. A separate

bag-of-words model was built for each essay prompt. Using \textit{tf-idf}

or words instead of character n-grams did not improve my score. The number

\subsection{Base Learners}

+\begin{tabular}{ll}\hline

+\textbf{Model} & \textbf{Parameters} \\\hline\hline

+Ridge regression & $\lambda = 2^4, 2^{4.2}, 2^{4.4}, \ldots, 2^{9.8}$ \\\hline

+$C = 2^{-11}, 2^{-10.8}, 2^{-10.6}, \ldots, 2^{-5.8}$ \\\hline

+\specialcell{$C = 2^{-1}, 2^0, \ldots, 2^{5}$\\

+$\gamma = 2^{-13}, 2^{-12}, \ldots, 2^{-7}$}\\\hline

+GBM & \specialcell{n\_estimators = 500, 1000, 2000\\

+max\_features = 0.5, 1} \\\hline

+Random forests & -- \\\hline

+K-Nearest Neighbor & $k = 3, 4, \ldots, 20$ \\\hline

+\caption{This table shows some data}

\subsubsection{Ridge Regression}

We are given a list of $m$ training examples $\{(x^{(i)}, y^{(i)}); i =

-1, \ldots, m\}$, where $x^{(i)} \in \mathbb{R}^n$ is the $i$-th feature

-and $y^{(i)} \in \mathbb{R}$ is the $i$-th target or grade in our case.

-In ridge regression we want to compute the parameter vector $\theta$,

-which minimizes the cost function $J(\theta)$, defined as

+1, \ldots, m\}$, where $x^{(i)} \in \mathbb{R}^n$ is the $i$-th answer

+and $y^{(i)} \in \mathbb{R}$ is the grade of the $i$-th answer.

+In ridge regression (RR) we want to compute the parameter vector

+$\theta$, which minimizes the cost function $J(\theta)$, defined as

by computing the gradient $\nabla_\theta J(\theta)$ and using the

limited-memory BFGS algorithm~\cite{byrd1995limited} to find the minimum

-of $J(\theta)$. The regularization parameter $\lambda$ is used

+of $J(\theta)$. The regularization parameter $\lambda$ is used to

+prevent overfitting. Prior to training, a constant feature with value 1

+is appended to each sample $x^{(i)}$. I implementation ridge regression

+in Python with heavy utilization of the numpy and scipy libraries.

-~~I used my own implementation of ridge regression written in Python.~~

+\subsubsection{Support Vector Regression}

+Support vector regression (SVR) was implemented using the LIBSVM and

+LIBLINEAR~\cite{chang2011libsvm, fan2008liblinear} software libraries.

+I used both the linear and the radial basis function (RBF) kernel. $C >

+0$ is the penalty parameter of the error term and $\gamma$ is the RBF

\subsubsection{Gradient Boosting Machine}

-\subsubsection{Random Forest}

+Gradient boosting machines (GBM) were introduced by

+Friedman~\cite{friedman2001greedy,friedman2002stochastic}. I used the

+implementation from Scikit-learn~\cite{pedregosa2011scikit}. Characer n-grams

+appearing in less then 30 answers were removed from the dataset. The parameters

+\item[n\_estimators] The number of boosting stages to perform.

+\item[max\_depth] The maximum depth of the individual regression estimators.

+\item[max\_features] Determines the amount of features to consider when looking

+for the best split. This parameter was set to either 1 (consider all features) or

+0.5 (consider only half of the features).

+\item[subsample] The fraction of samples to be used for fitting the

+individual base learners. This parameter was set to 0.5.

+\item[learn\_rate] The amount by which the contribution of each tree is shrunk.

+\subsubsection{Random Forests}

+Employing the same feature selection procedure as before, i.e., keeping

+the features which appear in at least 30, I trained a random forest (RF)

+classifier~\cite{breiman2001random} on 50 trees using the Scikit-learn

\subsubsection{K-Nearest Neighbor}

+So far, the $j$-th feature of the $i$-th sample, was defined as

+\[x_j^{(i)} = 1\{\text{answer $x^{(i)}$ contains the $j$-th n-gram}\},\]

+where $1\{\cdot\}$ denotes the indicator function, i.e., $1\{\mbox{True}\} = 1$ and

+$1\{\mbox{False}\} = 0$. For the k-nearest neighbor (KNN) algorithm, we change the

+\[x_j^{(i)} = 1\{\text{answer $x^{(i)}$ contains the $j$-th n-gram}\} \times \theta_j, \]

+where $\theta_j$ is the parameter vector of ridge regression. In other

+words, we weigh each feature by the corresponding ridge regression

+coefficient. I then performed the k-nearest neighbor algorithm on the

+resulting training set using the cosine similarity measure.

+Stacking~\cite{wolpert1992stacked} is a technique for combining

+predictions of several learners. It was used successfully in past

+competitions~\cite{toscher2010collaborative,yu2011feature,zbontar2012team},

+most notably in the Netflix Prize~\cite{koren2009bellkor}.

+I conducted 5-fold cross-validation on all base learners to obtain

+out-of-sample predictions for all training examples. These predictions are

+subsequently used as training data for ridge regression, which creates

+an affine combination of the predictions of the base learners. Informally,

+we can say that predictions obtained in this way are usually better

+then predictions produced by any base learner.

+\begin{tabular}{ll}\hline

+\textbf{Dataset} & \textbf{Models} \\\hline\hline

+4c & RR, RF, KNN, GBM(3, 1), GBM(4, 0.5), GBM(5, 0.5) \\\hline

+4cc & RR, RF, SVM\_RBF, GBM(3, 1), GBM(5, 0.5) \\\hline

+6c & RR, SVM\_RBF \\\hline

+6cc & RR, SVM\_RBF, KNN \\\hline

+4cp200 & KNN, SVM\_RBF, SVM\_LINEAR \\\hline

+4cp500 & SVM\_RBF, KNN, GBR(3, 1) \\\hline

+\caption{This table presents all the dataset model/combinations used in my final model.

+The GBM learner is parametrized by the max\_depth and subsample parameters.}

-% Provide a description of your code here. Code itself should be commented

-% clearly and concisely. For each function provide

-The source code of my solution can be obtained from an online repository

-at \url{https://bitbucket.org/jzbontar/asap}. In order to replicate

-my results exactly, you might want to fetch revision number a22e080625de,

-which is the revision I used for my final submission and can be found at

-\url{https://bitbucket.org/jzbontar/asap/src/a22e080625de}.

\section{How To Generate the Solution}

% 5. How To Generate the Solution (aka README file)

% from the code provided. Include that description here and in a separate

% README file to accompany the code.

+\subsection{Obtaining the Source Code}

+The source code can be downloaded from an online repository at

+\url{https://bitbucket.org/jzbontar/asap}. In order to replicate my

+results exactly, you might want to fetch revision number a22e080625de,

+which is the revision I used for my final submission and can be found

+at \url{https://bitbucket.org/jzbontar/asap/src/a22e080625de}.

\subsection{Software Used}