Jure Žbontar committed 19fe188

Work on report.

Comments (0)

Files changed (2)


+  \begin{tabular}[#1]{@{}l@{}}#2\end{tabular}}
 \title{Short Answer Scoring by Stacking}
 %     model ensembling, if any
 %     external data sources
 \section{Features Selection / Extraction}
 % 2. Features Selection / Extraction
 runner that broke records.}``. The first rater was obviously impressed
 by the answer, awarding it the highest grade possible. The second
 rater much less so, which he demonstrated by rating the answer with 0.
 In the datasets provided, 71 samples were found to be outliers and were
 consequently removed.
 \subsection{Dataset Construction}
 All answers were transformed to lowercase. Only letters and numbers were
 retained, all other characters were removed. The text was then cut into
-character n-grams and was represented with the bag-of-words model. I
-constructed six training datasets:
+character n-grams and was represented with the bag-of-words model. Six
+training sets were constructed:
 \item[4c] Use character four-grams.
 \item[6c] Use character six-grams.
-\item[4cc] Correcting spelling mistakes and apply the Porter stemming
+\item[4cc] Correct spelling mistakes and apply the Porter stemming
 algorithm~\cite{porter1980algorithm} to each word. Proceed by using
 character four-grams as in \textbf{4c}.
 \item[6cc] Same as \textbf{4cc} but use character six-grams instead.
 \item[4cp500] Same as \textbf{4cp200} but keep the first 500 components.
-For example, the text ``\textit{Hello, hello!}'', after undergoing basic
-preprocessing, was represented as the bag 
-\textit{\{'hell', 'ello', 'llo ', 'lo h', 'o he', ' hel'\}}, under
-he \textbf{4c} model. Note that I did not count how many times a n-gram
-appeared. In the example above, the four-gram \verb+'hell'+ appears twice,
+Consider the text ``\textit{Hello, hello!}''. After undergoing basic
+preprocessing, it was represented as the bag \textit{\{'hell', 'ello',
+'llo ', 'lo h', 'o he', ' hel'\}}, under the \textbf{4c} model. Note
+that I did not count how many times a n-gram appeared. In the example
+above, the four-grams \verb+'hell'+ and \verb+'ello'+ appear twice,
 but the representation I chose discarded this information. A separate
 bag-of-words model was built for each essay prompt. Using \textit{tf-idf}
 or words instead of character n-grams did not improve my score. The number
 % obtained and used.
 \subsection{Base Learners}
+\textbf{Model} & \textbf{Parameters} \\\hline\hline
+Ridge regression & $\lambda = 2^4, 2^{4.2}, 2^{4.4}, \ldots, 2^{9.8}$ \\\hline
+SVR linear kernel & 
+$C = 2^{-11}, 2^{-10.8}, 2^{-10.6}, \ldots, 2^{-5.8}$ \\\hline
+SVR RBF kernel &
+\specialcell{$C = 2^{-1}, 2^0, \ldots, 2^{5}$\\
+$\gamma = 2^{-13}, 2^{-12}, \ldots, 2^{-7}$}\\\hline
+GBM & \specialcell{n\_estimators = 500, 1000, 2000\\
+max\_depth = 3, 4, 5\\
+max\_features = 0.5, 1} \\\hline
+Random forests & -- \\\hline
+K-Nearest Neighbor & $k = 3, 4, \ldots, 20$ \\\hline
+\caption{This table shows some data}
 \subsubsection{Ridge Regression}
 We are given a list of $m$ training examples $\{(x^{(i)}, y^{(i)}); i =
-1, \ldots, m\}$, where $x^{(i)} \in \mathbb{R}^n$ is the $i$-th feature
-and $y^{(i)} \in \mathbb{R}$ is the $i$-th target or grade in our case.
-In ridge regression we want to compute the parameter vector $\theta$,
-which minimizes the cost function $J(\theta)$, defined as
+1, \ldots, m\}$, where $x^{(i)} \in \mathbb{R}^n$ is the $i$-th answer
+and $y^{(i)} \in \mathbb{R}$ is the grade of the $i$-th answer.
+In ridge regression (RR) we want to compute the parameter vector
+$\theta$, which minimizes the cost function $J(\theta)$, defined as
 J(\theta) = 
 by computing the gradient $\nabla_\theta J(\theta)$ and using the
 limited-memory BFGS algorithm~\cite{byrd1995limited} to find the minimum
-of $J(\theta)$. The regularization parameter $\lambda$ is used
-to prevent overfitting.
+of $J(\theta)$. The regularization parameter $\lambda$ is used to
+prevent overfitting. Prior to training, a constant feature with value 1
+is appended to each sample $x^{(i)}$. I implementation ridge regression
+in Python with heavy utilization of the numpy and scipy libraries.
-I used my own implementation of ridge regression written in Python.
+\subsubsection{Support Vector Regression}
+Support vector regression (SVR) was implemented using the LIBSVM and
+LIBLINEAR~\cite{chang2011libsvm, fan2008liblinear} software libraries.
+I used both the linear and the radial basis function (RBF) kernel. $C >
+0$ is the penalty parameter of the error term and $\gamma$ is the RBF
+kernel parameter.
 \subsubsection{Gradient Boosting Machine}
-\subsubsection{Random Forest}
+Gradient boosting machines (GBM) were introduced by
+Friedman~\cite{friedman2001greedy,friedman2002stochastic}. I used the
+implementation from Scikit-learn~\cite{pedregosa2011scikit}. Characer n-grams
+appearing in less then 30 answers were removed from the dataset. The parameters
+of this method are:
+\item[n\_estimators] The number of boosting stages to perform.
+\item[max\_depth] The maximum depth of the individual regression estimators.
+\item[max\_features] Determines the amount of features to consider when looking
+for the best split. This parameter was set to either 1 (consider all features) or
+0.5 (consider only half of the features).
+\item[subsample] The fraction of samples to be used for fitting the
+individual base learners. This parameter was set to 0.5.
+\item[learn\_rate] The amount by which the contribution of each tree is shrunk.
+\subsubsection{Random Forests}
+Employing the same feature selection procedure as before, i.e., keeping
+the features which appear in at least 30, I trained a random forest (RF)
+classifier~\cite{breiman2001random} on 50 trees using the Scikit-learn
 \subsubsection{K-Nearest Neighbor}
+So far, the $j$-th feature of the $i$-th sample, was defined as
+\[x_j^{(i)} = 1\{\text{answer $x^{(i)}$ contains the $j$-th n-gram}\},\]
+where $1\{\cdot\}$ denotes the indicator function, i.e., $1\{\mbox{True}\} = 1$ and
+$1\{\mbox{False}\} = 0$. For the k-nearest neighbor (KNN) algorithm, we change the
+definition to:
+\[x_j^{(i)} = 1\{\text{answer $x^{(i)}$ contains the $j$-th n-gram}\} \times \theta_j, \]
+where $\theta_j$ is the parameter vector of ridge regression. In other
+words, we weigh each feature by the corresponding ridge regression
+coefficient.  I then performed the k-nearest neighbor algorithm on the
+resulting training set using the cosine similarity measure.
+Stacking~\cite{wolpert1992stacked} is a technique for combining
+predictions of several learners.  It was used successfully in past
+most notably in the Netflix Prize~\cite{koren2009bellkor}.  
+I conducted 5-fold cross-validation on all base learners to obtain
+out-of-sample predictions for all training examples. These predictions are
+subsequently used as training data for ridge regression, which creates
+an affine combination of the predictions of the base learners. Informally,
+we can say that predictions obtained in this way are usually better
+then predictions produced by any base learner.
+\textbf{Dataset} & \textbf{Models} \\\hline\hline
+4c & RR, RF, KNN, GBM(3, 1), GBM(4, 0.5), GBM(5, 0.5) \\\hline
+4cc & RR, RF, SVM\_RBF, GBM(3, 1), GBM(5, 0.5) \\\hline
+6c & RR, SVM\_RBF \\\hline
+6cc & RR, SVM\_RBF, KNN \\\hline
+4cp200 & KNN, SVM\_RBF, SVM\_LINEAR \\\hline
+4cp500 & SVM\_RBF, KNN, GBR(3, 1) \\\hline
+\caption{This table presents all the dataset model/combinations used in my final model.
+The GBM learner is parametrized by the max\_depth and subsample parameters.}
-\section{Source Code}
-% 4. Code Description
-% Provide a description of your code here. Code itself should be commented
-% clearly and concisely. For each function provide
-%     function inputs
-%     function outputs
-%     what function does
-The source code of my solution can be obtained from an online repository
-at \url{}. In order to replicate
-my results exactly, you might want to fetch revision number a22e080625de,
-which is the revision I used for my final submission and can be found at
 \section{How To Generate the Solution}
 % 5. How To Generate the Solution (aka README file)
 % from the code provided. Include that description here and in a separate
 % README file to accompany the code.
+\subsection{Obtaining the Source Code}
+The source code can be downloaded from an online repository at
+\url{}. In order to replicate my
+results exactly, you might want to fetch revision number a22e080625de,
+which is the revision I used for my final submission and can be found
+at \url{}.
 \subsection{Software Used}


   publisher={Philadelphia, PA: SIAM, c1993-}
+  title={LIBLINEAR: A library for large linear classification},
+  author={Fan, R.E. and Chang, K.W. and Hsieh, C.J. and Wang, X.R. and Lin, C.J.},
+  journal={The Journal of Machine Learning Research},
+  volume={9},
+  pages={1871--1874},
+  year={2008},
+  publisher={JMLR. org}
+  title={LIBSVM: a library for support vector machines},
+  author={Chang, C.C. and Lin, C.J.},
+  journal={ACM Transactions on Intelligent Systems and Technology (TIST)},
+  volume={2},
+  number={3},
+  pages={27},
+  year={2011},
+  publisher={ACM}
+  title={Scikit-learn: Machine learning in Python},
+  author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and others},
+  journal={The Journal of Machine Learning Research},
+  volume={12},
+  pages={2825--2830},
+  year={2011},
+  publisher={JMLR. org}
+  title={Greedy function approximation: a gradient boosting machine},
+  author={Friedman, J.H.},
+  journal={Annals of Statistics},
+  pages={1189--1232},
+  year={2001},
+  publisher={JSTOR}
+  title={Stochastic gradient boosting},
+  author={Friedman, J.H.},
+  journal={Computational Statistics \& Data Analysis},
+  volume={38},
+  number={4},
+  pages={367--378},
+  year={2002},
+  publisher={Elsevier}
+  title={Random forests},
+  author={Breiman, L.},
+  journal={Machine learning},
+  volume={45},
+  number={1},
+  pages={5--32},
+  year={2001},
+  publisher={Springer}
+  title={Stacked generalization*},
+  author={Wolpert, D.H.},
+  journal={Neural networks},
+  volume={5},
+  number={2},
+  pages={241--259},
+  year={1992},
+  publisher={Elsevier}
+  title={Feature engineering and classifier ensemble for KDD cup 2010},
+  author={Yu, H.F. and Lo, H.Y. and Hsieh, H.P. and Lou, J.K. and McKenzie, T.G. and Chou, J.W. and Chung, P.H. and Ho, C.H. and Chang, C.F. and Wei, Y.H. and others},
+  booktitle={JMLR Workshop and Conference Proceedings},
+  year={2011}
+  title={Collaborative filtering applied to educational data mining},
+  author={Toscher, A. and Jahrer, M.},
+  journal={KDD Cup},
+  year={2010}
+  title={The bellkor solution to the netflix grand prize},
+  author={Koren, Y.},
+  journal={Netflix prize documentation},
+  year={2009}