Commits

Jure Žbontar committed 19fe188

Work on report.

Comments (0)

Files changed (2)

report/jzbontar.tex

 \usepackage{hyperref}
 \usepackage[parfill]{parskip}
 \usepackage{amsfonts}
+\usepackage{amsmath}
+
+\newcommand{\specialcell}[2][c]{%
+  \begin{tabular}[#1]{@{}l@{}}#2\end{tabular}}
 
 \begin{document}
 \title{Short Answer Scoring by Stacking}
 %     model ensembling, if any
 %     external data sources
 
+\cite{zbontar2012team}
+
 \section{Features Selection / Extraction}
 % 2. Features Selection / Extraction
 % 
 runner that broke records.}``. The first rater was obviously impressed
 by the answer, awarding it the highest grade possible. The second
 rater much less so, which he demonstrated by rating the answer with 0.
-
 In the datasets provided, 71 samples were found to be outliers and were
 consequently removed.
 
 \subsection{Dataset Construction}
 All answers were transformed to lowercase. Only letters and numbers were
 retained, all other characters were removed. The text was then cut into
-character n-grams and was represented with the bag-of-words model. I
-constructed six training datasets:
+character n-grams and was represented with the bag-of-words model. Six
+training sets were constructed:
 
 \begin{description}
 \item[4c] Use character four-grams.
 \item[6c] Use character six-grams.
-\item[4cc] Correcting spelling mistakes and apply the Porter stemming
+\item[4cc] Correct spelling mistakes and apply the Porter stemming
 algorithm~\cite{porter1980algorithm} to each word. Proceed by using
 character four-grams as in \textbf{4c}.
 \item[6cc] Same as \textbf{4cc} but use character six-grams instead.
 \item[4cp500] Same as \textbf{4cp200} but keep the first 500 components.
 \end{description}
 
-For example, the text ``\textit{Hello, hello!}'', after undergoing basic
-preprocessing, was represented as the bag 
-\textit{\{'hell', 'ello', 'llo ', 'lo h', 'o he', ' hel'\}}, under
-he \textbf{4c} model. Note that I did not count how many times a n-gram
-appeared. In the example above, the four-gram \verb+'hell'+ appears twice,
+Consider the text ``\textit{Hello, hello!}''. After undergoing basic
+preprocessing, it was represented as the bag \textit{\{'hell', 'ello',
+'llo ', 'lo h', 'o he', ' hel'\}}, under the \textbf{4c} model. Note
+that I did not count how many times a n-gram appeared. In the example
+above, the four-grams \verb+'hell'+ and \verb+'ello'+ appear twice,
 but the representation I chose discarded this information. A separate
 bag-of-words model was built for each essay prompt. Using \textit{tf-idf}
 or words instead of character n-grams did not improve my score. The number
 % obtained and used.
 
 \subsection{Base Learners}
+
+\begin{table}[htb]
+\centering
+\begin{tabular}{ll}\hline
+\textbf{Model} & \textbf{Parameters} \\\hline\hline
+Ridge regression & $\lambda = 2^4, 2^{4.2}, 2^{4.4}, \ldots, 2^{9.8}$ \\\hline
+
+SVR linear kernel & 
+$C = 2^{-11}, 2^{-10.8}, 2^{-10.6}, \ldots, 2^{-5.8}$ \\\hline
+
+SVR RBF kernel &
+\specialcell{$C = 2^{-1}, 2^0, \ldots, 2^{5}$\\
+$\gamma = 2^{-13}, 2^{-12}, \ldots, 2^{-7}$}\\\hline
+
+GBM & \specialcell{n\_estimators = 500, 1000, 2000\\
+max\_depth = 3, 4, 5\\
+max\_features = 0.5, 1} \\\hline
+
+Random forests & -- \\\hline
+
+K-Nearest Neighbor & $k = 3, 4, \ldots, 20$ \\\hline
+
+\end{tabular}
+\caption{This table shows some data}
+\label{tab:models}
+\end{table}
+
 \subsubsection{Ridge Regression}
 
 We are given a list of $m$ training examples $\{(x^{(i)}, y^{(i)}); i =
-1, \ldots, m\}$, where $x^{(i)} \in \mathbb{R}^n$ is the $i$-th feature
-and $y^{(i)} \in \mathbb{R}$ is the $i$-th target or grade in our case.
-In ridge regression we want to compute the parameter vector $\theta$,
-which minimizes the cost function $J(\theta)$, defined as
+1, \ldots, m\}$, where $x^{(i)} \in \mathbb{R}^n$ is the $i$-th answer
+and $y^{(i)} \in \mathbb{R}$ is the grade of the $i$-th answer.
+In ridge regression (RR) we want to compute the parameter vector
+$\theta$, which minimizes the cost function $J(\theta)$, defined as
 
 $$
 J(\theta) = 
 
 by computing the gradient $\nabla_\theta J(\theta)$ and using the
 limited-memory BFGS algorithm~\cite{byrd1995limited} to find the minimum
-of $J(\theta)$. The regularization parameter $\lambda$ is used
-to prevent overfitting.
+of $J(\theta)$. The regularization parameter $\lambda$ is used to
+prevent overfitting. Prior to training, a constant feature with value 1
+is appended to each sample $x^{(i)}$. I implementation ridge regression
+in Python with heavy utilization of the numpy and scipy libraries.
 
-I used my own implementation of ridge regression written in Python.
+\subsubsection{Support Vector Regression}
 
-\subsubsection{SVM}
+Support vector regression (SVR) was implemented using the LIBSVM and
+LIBLINEAR~\cite{chang2011libsvm, fan2008liblinear} software libraries.
+I used both the linear and the radial basis function (RBF) kernel. $C >
+0$ is the penalty parameter of the error term and $\gamma$ is the RBF
+kernel parameter.
+
 \subsubsection{Gradient Boosting Machine}
-\subsubsection{Random Forest}
+
+Gradient boosting machines (GBM) were introduced by
+Friedman~\cite{friedman2001greedy,friedman2002stochastic}. I used the
+implementation from Scikit-learn~\cite{pedregosa2011scikit}. Characer n-grams
+appearing in less then 30 answers were removed from the dataset. The parameters
+of this method are:
+
+\begin{description}
+\item[n\_estimators] The number of boosting stages to perform.
+\item[max\_depth] The maximum depth of the individual regression estimators.
+\item[max\_features] Determines the amount of features to consider when looking
+for the best split. This parameter was set to either 1 (consider all features) or
+0.5 (consider only half of the features).
+\item[subsample] The fraction of samples to be used for fitting the
+individual base learners. This parameter was set to 0.5.
+\item[learn\_rate] The amount by which the contribution of each tree is shrunk.
+
+\end{description}
+
+\subsubsection{Random Forests}
+
+Employing the same feature selection procedure as before, i.e., keeping
+the features which appear in at least 30, I trained a random forest (RF)
+classifier~\cite{breiman2001random} on 50 trees using the Scikit-learn
+implementation.
+
 \subsubsection{K-Nearest Neighbor}
 
+So far, the $j$-th feature of the $i$-th sample, was defined as
+
+\[x_j^{(i)} = 1\{\text{answer $x^{(i)}$ contains the $j$-th n-gram}\},\]
+
+where $1\{\cdot\}$ denotes the indicator function, i.e., $1\{\mbox{True}\} = 1$ and
+$1\{\mbox{False}\} = 0$. For the k-nearest neighbor (KNN) algorithm, we change the
+definition to:
+
+\[x_j^{(i)} = 1\{\text{answer $x^{(i)}$ contains the $j$-th n-gram}\} \times \theta_j, \]
+
+where $\theta_j$ is the parameter vector of ridge regression. In other
+words, we weigh each feature by the corresponding ridge regression
+coefficient.  I then performed the k-nearest neighbor algorithm on the
+resulting training set using the cosine similarity measure.
+
 \subsection{Stacking}
 
+Stacking~\cite{wolpert1992stacked} is a technique for combining
+predictions of several learners.  It was used successfully in past
+competitions~\cite{toscher2010collaborative,yu2011feature,zbontar2012team},
+most notably in the Netflix Prize~\cite{koren2009bellkor}.  
+
+I conducted 5-fold cross-validation on all base learners to obtain
+out-of-sample predictions for all training examples. These predictions are
+subsequently used as training data for ridge regression, which creates
+an affine combination of the predictions of the base learners. Informally,
+we can say that predictions obtained in this way are usually better
+then predictions produced by any base learner.
+
+
+\begin{table}[htb]
+\centering
+\begin{tabular}{ll}\hline
+\textbf{Dataset} & \textbf{Models} \\\hline\hline
+4c & RR, RF, KNN, GBM(3, 1), GBM(4, 0.5), GBM(5, 0.5) \\\hline
+4cc & RR, RF, SVM\_RBF, GBM(3, 1), GBM(5, 0.5) \\\hline
+6c & RR, SVM\_RBF \\\hline
+6cc & RR, SVM\_RBF, KNN \\\hline
+4cp200 & KNN, SVM\_RBF, SVM\_LINEAR \\\hline
+4cp500 & SVM\_RBF, KNN, GBR(3, 1) \\\hline
+
+
+\end{tabular}
+\caption{This table presents all the dataset model/combinations used in my final model.
+The GBM learner is parametrized by the max\_depth and subsample parameters.}
+\label{tab:stacking}
+\end{table}
+
 \subsection{Calibrating}
 
-\section{Source Code}
-% 4. Code Description
-% 
-% Provide a description of your code here. Code itself should be commented
-% clearly and concisely. For each function provide
-% 
-%     function inputs
-%     function outputs
-%     what function does
-
-The source code of my solution can be obtained from an online repository
-at \url{https://bitbucket.org/jzbontar/asap}. In order to replicate
-my results exactly, you might want to fetch revision number a22e080625de,
-which is the revision I used for my final submission and can be found at
-\url{https://bitbucket.org/jzbontar/asap/src/a22e080625de}.
-
- 
-
 \section{How To Generate the Solution}
 % 5. How To Generate the Solution (aka README file)
 % 
 % from the code provided. Include that description here and in a separate
 % README file to accompany the code.
 
+\subsection{Obtaining the Source Code}
+The source code can be downloaded from an online repository at
+\url{https://bitbucket.org/jzbontar/asap}. In order to replicate my
+results exactly, you might want to fetch revision number a22e080625de,
+which is the revision I used for my final submission and can be found
+at \url{https://bitbucket.org/jzbontar/asap/src/a22e080625de}.
+
 \subsection{Software Used}
 
 \begin{itemize}

report/references.bib

   year={1995},
   publisher={Philadelphia, PA: SIAM, c1993-}
 }
+
+@article{fan2008liblinear,
+  title={LIBLINEAR: A library for large linear classification},
+  author={Fan, R.E. and Chang, K.W. and Hsieh, C.J. and Wang, X.R. and Lin, C.J.},
+  journal={The Journal of Machine Learning Research},
+  volume={9},
+  pages={1871--1874},
+  year={2008},
+  publisher={JMLR. org}
+}
+
+@article{chang2011libsvm,
+  title={LIBSVM: a library for support vector machines},
+  author={Chang, C.C. and Lin, C.J.},
+  journal={ACM Transactions on Intelligent Systems and Technology (TIST)},
+  volume={2},
+  number={3},
+  pages={27},
+  year={2011},
+  publisher={ACM}
+}
+
+@article{pedregosa2011scikit,
+  title={Scikit-learn: Machine learning in Python},
+  author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and others},
+  journal={The Journal of Machine Learning Research},
+  volume={12},
+  pages={2825--2830},
+  year={2011},
+  publisher={JMLR. org}
+}
+
+@article{friedman2001greedy,
+  title={Greedy function approximation: a gradient boosting machine},
+  author={Friedman, J.H.},
+  journal={Annals of Statistics},
+  pages={1189--1232},
+  year={2001},
+  publisher={JSTOR}
+}
+
+@article{friedman2002stochastic,
+  title={Stochastic gradient boosting},
+  author={Friedman, J.H.},
+  journal={Computational Statistics \& Data Analysis},
+  volume={38},
+  number={4},
+  pages={367--378},
+  year={2002},
+  publisher={Elsevier}
+}
+
+@article{breiman2001random,
+  title={Random forests},
+  author={Breiman, L.},
+  journal={Machine learning},
+  volume={45},
+  number={1},
+  pages={5--32},
+  year={2001},
+  publisher={Springer}
+}
+
+@article{wolpert1992stacked,
+  title={Stacked generalization*},
+  author={Wolpert, D.H.},
+  journal={Neural networks},
+  volume={5},
+  number={2},
+  pages={241--259},
+  year={1992},
+  publisher={Elsevier}
+}
+
+@inproceedings{yu2011feature,
+  title={Feature engineering and classifier ensemble for KDD cup 2010},
+  author={Yu, H.F. and Lo, H.Y. and Hsieh, H.P. and Lou, J.K. and McKenzie, T.G. and Chou, J.W. and Chung, P.H. and Ho, C.H. and Chang, C.F. and Wei, Y.H. and others},
+  booktitle={JMLR Workshop and Conference Proceedings},
+  year={2011}
+}
+
+@article{toscher2010collaborative,
+  title={Collaborative filtering applied to educational data mining},
+  author={Toscher, A. and Jahrer, M.},
+  journal={KDD Cup},
+  year={2010}
+}
+
+
+@article{koren2009bellkor,
+  title={The bellkor solution to the netflix grand prize},
+  author={Koren, Y.},
+  journal={Netflix prize documentation},
+  year={2009}
+}