Commits

Jure Žbontar committed 09ed57c

what works well

Comments (0)

Files changed (2)

         id, set_id, score1, score2, text = line.split('\t')
         if id == 'Id' or id in outliers:
             continue
-        ys[int(set_id) - 1].append(int(score1))
+        ys[int(set_id) - 1].append((int(score1), int(score2)))
         essay_sets[int(set_id) - 1].append(text.lower())
     ys = map(np.array, ys)
 
             preprocessor=preprocessor,
             binary=True,
             analyzer='char',
-            min_n=max_n,
-            max_n=max_n,
+            ngram_range=(max_n, max_n),
         )
 
         def process(i):
         for id, grade in zip(public_ids[i], p):
             f.write('%d,%d\n' % (id, grade))
 
-    print mqwk(scores)
+    #print scores
+
+    y = pickle.load(open('data/y.pkl'))
+    for i in range(10):
+        rater = qwk(y[i][:,0], y[i][:,1])
+        model = scores[i]
+        print '%d %.4f %.4f %.4f' % (i + 1, model, rater, model / rater)
+        
+
+    #print mqwk(scores)
     sys.exit()
 
 
 if sys.argv[1] != 'stack':
     data = sys.argv.pop(1)
     X = pickle.load(open('data/%s.pkl' % data))[essay_set]
-    y = pickle.load(open('data/y.pkl'))[essay_set]
+    y = pickle.load(open('data/y.pkl'))[essay_set][:,0]
 
     np.random.seed(42)
     kf = StratifiedKFold(y, k=folds)
 def gbr(X, y, X_test):
     from sklearn.ensemble import GradientBoostingRegressor
 
-    m = GradientBoostingRegressor(learn_rate=learn_rate,
+    m = GradientBoostingRegressor(learning_rate=learn_rate,
         n_estimators=n_estimators, max_depth=max_depth,
         subsample=subsample, max_features=max_features)
     m.fit(X, y)

report/works_well.tex

+\documentclass[a4paper]{article}
+\usepackage[utf8]{inputenc}
+\usepackage{hyperref}
+\usepackage[parfill]{parskip}
+\usepackage{amsfonts}
+\usepackage{amsmath}
+
+\begin{document}
+\title{Characteristics of items that can be successfully scored}
+\author{Jure Zbontar\\
+\texttt{jure.zbontar@gmail.com}}
+\date{May 22, 2013}
+\maketitle
+
+In order to understand the strengths and weakness of my approach it is
+important to keep in mind that all my models represent their input as a ``bag
+of n-grams''. The strength of my model is in figuring out how important
+each n-gram is and what are the interactions between them.
+
+To understand the ``bag of n-grams'' representation, let's look at an example:
+the input ``birds can fly but penguins can't'' is first chopped into n
+consecutive characters or n-grams (4-grams in this example) to produce the set
+{``bird'', ``irds'', ``rds '', `` ds c'', \ldots, `` can'', ``cant''}.  This
+set is then further expanded into a very large binary vector. The size of this
+binary vector is the same as the number of all different 4-grams in all the
+responses. Each position in the binary vector tells us if a particular n-gram
+is present or not present in the response.
+
+
+\begin{table}[htb]
+\begin{tabular}{l c c c c c c}
+& bird & peng & t bi & t pe & \ldots & dino \\
+birds can fly but penguins can't & 1 & 1 & 1 & 0 & & 0 \\
+\end{tabular}
+\end{table}
+
+This representation has two interesting properties:
+
+\begin{enumerate}
+
+\item The order in which words appear in the response is almost completely lost. 
+Another way of saying this is that the inputs ``birds can fly but penguins can't''
+and ``penguins can fly but birds can't'' have nearly identical binary vector
+representations, which makes it hard for the learning algorithms to assign
+different grades to them, although it is obvious that the second sentence is
+nonsense and should receive a lower grade that the first one. This property
+also makes it hard for the algorithm to check facts as one could easily get
+away with writing ``World war II stared in 1945 and ended in 1939'', although,
+admittedly, there are other reasons why checking facts is hard.
+
+\item The number of times a n-gram appears is lost. I don't see how this could
+present a problem, though it is something we should keep in mind. During the
+competition I also tried to replace the binary vector by a vector that keeps
+count of the number of times each n-gram appears, but that gave worse results.
+
+\end{enumerate}
+
+What my model is good at is figuring out exactly what n-grams the top scorers
+use and the interaction between those n-grams (for example it could figure out
+that ``poisonous'' and ``venomous'' are synonyms and it's enough to use one
+or the other when describing snakes).
+
+
+\section{ASAP Kaggle dataset}
+
+In order to help us understand which types of items my models scores well I
+measure the quadratic weighted kappa score on each of the 10 items from the
+competition and compare them to the quadratic weighted kappa score of the
+reference 2nd rater
+
+
+\begin{table}[htb]
+\begin{tabular}{c c c c}
+dataset ID & my score & 2nd rater's score & ratio (my score / 2nd rater's score) \\
+9 & 0.7785 & 0.8367 & 0.9305 \\
+4 & 0.6907 & 0.7564 & 0.9132 \\
+6 & 0.8717 & 0.9612 & 0.9069 \\
+3 & 0.6871 & 0.7598 & 0.9043 \\
+1 & 0.8431 & 0.9429 & 0.8942 \\
+10 & 0.7456 & 0.8838 & 0.8436 \\
+5 & 0.7983 & 0.9534 & 0.8373 \\
+2 & 0.7486 & 0.9138 & 0.8192 \\
+8 & 0.6533 & 0.8981 & 0.7274 \\
+7 & 0.6945 & 0.9701 & 0.7160 \\
+\end{tabular}
+\label{tbl:score}
+\end{table}
+
+
+\end{document}