e450ca1
committed
Commits
Comments (0)
Files changed (3)

+124 122IJDAR/ijdar.bib

+1 1IJDAR/ijdar.tex

+36 25IJDAR/prediction.tex
IJDAR/ijdar.bib
 booktitle={Proceedings of the International Conference on Frontiers in Handwriting Recognition (ICFHR 2010)},
+ Booktitle = {Proceedings of the International Conference on Frontiers in Handwriting Recognition (ICFHR 2010)},
 booktitle={Proceedings of the International Conference on Document Analysis and Recognition (ICDAR 2011)},
+ Booktitle = {Proceedings of the International Conference on Document Analysis and Recognition (ICDAR 2011)},
 booktitle = {Proceedings of the Symposium on Document Image Understanding Technology (SDIUT 2003)},
+ Booktitle = {Proceedings of the Symposium on Document Image Understanding Technology (SDIUT 2003)},
keywords={ image restoration; perspective distortion; polynomial regression; scanned grayscale image; warped text line straightening; document image processing; image restoration; optical distortion; polynomials; statistical analysis; text analysis;},
+ Keywords = {image restoration; perspective distortion; polynomial regression; scanned grayscale image; warped text line straightening; document image processing; image restoration; optical distortion; polynomials; statistical analysis; text analysis;},
 Booktitle = {Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003},
+ Booktitle = {Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003},
 Booktitle = {Proceedings of of the International Conference on Document Analysis and Recognition (ICDAR 2011)},
+ Booktitle = {Proceedings of of the International Conference on Document Analysis and Recognition (ICDAR 2011)},
 title = {Ancient documents bleedthrough evaluation and its application for predicting OCR error rates},
+ Title = {Ancient documents bleedthrough evaluation and its application for predicting OCR error rates},
IJDAR/ijdar.tex
This paper presented $18$ features that characterize the quality of a document image. These features are used in stepwise multivariate linear regression to create prediction models for $11$ binarization methods. Repeated random subsampling crossvalidation shows that $10$ of $11$ models are very accurate and can be used to automatically choose the best binarization method. Moreover, given the stepwise approach of the linear regression, these models are not over parameterized.
+This paper presented $18$ features that characterize the quality of a document image. These features are used in stepwise multivariate linear regression to create prediction models for $12$ binarization methods. Repeated random subsampling crossvalidation shows that these $12$ models are accurate (max percentage error equals 11\%) and can be used to automatically choose the best binarization method. Moreover, given the stepwise approach of the linear regression, these models are not over parameterized.
%In \cite{rabeux2011ancient}, similar features are used with a multivariate linear regression to predict the OCR error rate.
IJDAR/prediction.tex
\item Li \cite{li1998iterative} is a crossentropic thresholding method based on the minimization of an information theoretic distance (KullbackLeibler).
%\item Niblack \cite{niblack1985introduction} : is a locally adaptive thresholding method using pixels intensity variance.
+ \item Niblack \cite{niblack1985introduction} : is a locally adaptive thresholding method using pixels intensity variance.
\item Ridler \cite{calvard1978picture} is an iterative thresholding method based on twoclass Gaussian mixture models.
\item Shanbag \cite{shanbhag1994utilization} is a fuzzy entropic thresholding technique that considers fuzzy memberships as an indication of how strongly a gray value belongs to the background or to the foreground.
\item Sauvola \cite{sauvola2000adaptive} is a locally adaptive thresholding method using pixel intensity variance.
 \item Shijian \cite{su2011combination} is a recent method based on an \textit{adhoc} combination of existing techniques. \cite{su2011combination} has proven to have very good accuracy on the ICDAR 2011 Binarization Contest.
+ \item Shijian \cite{lu2010document} is a recent method based on an \textit{adhoc} combination of existing techniques. \cite{lu2010document} has proven to have very good accuracy on the ICDAR 2009 Binarization Contest.
Some binarization methods rely on parameters. In this article, we do not focus on parameter optimization. Therefore, we chose to use the parameters given by the authors of each method in their corresponding original articles. Table \ref{parameters} summarizes the values of these parameters. Importantly, note that the prediction models created are only able to predict the performance of a binarization method with a specific set of parameters. However, a binarization method can have several prediction models, one for each set of parameters. To illustrate the difference between two sets of parameters, we will create two different prediction models for Sauvola's method. The second set of parameters was manually chosen (Table \ref{parameters}).
\caption{Methods parameters: we chose to use the parameters given by each author in their original articles.}
\caption{Statistical results of $11$ binarization algorithms applied to all DIBCO images. Except for the Sahoo algorithm, all binarization methods have a significant min/max fscore gap and standard deviation between $0.1$ and $0.3$, indicating that the dataset is heterogeneous and well suited for the learning step of our prediction model.}
+\caption{Statistical results of $12$ binarization algorithms applied to all DIBCO images. Except for the Sahoo algorithm, all binarization methods have a significant min/max fscore gap and standard deviation between $0.1$ and $0.3$, indicating that the dataset is heterogeneous and well suited for the learning step of our prediction model.}
The best theoretical value for $ R^{2}$ is 1. Moreover, a pvalue is computed for each selected feature indicating its significance : a low pvalue leads to reject the hypothesis that the selected feature is not significant (null hypothesis).
There is no automatic rule to decide whether a model is valid. In our tests, we choose to keep the model only if $R^2 > 0.7$ and if a majority of pvalues are lower than $0.1$.
+At this step, there is no automatic rule to decide whether a model is valid or not. The $R^{2}$ value computed at this step gives an indication of how well the model can be used in practice. The model still needs to be statically validated. This statistical validation is done at the next step.
+%However, in our tests, we choose to keep the model only if a majority of pvalues are lower than $0.1$.
%??? We also look at the slope coefficient of the validation regression, which also needs to be the closest to 1.
\caption{Otsu prediction model : all selected features are significant (pvalue $<0.1$), and the model is likely to correctly predict future unknown images given that the $R^{2}$ value is higher than $0.9$. $\hat{mpe}$ denotes the mean percentage error.}
%\caption{Sauvola prediction Model : all features are significant ($pvalue <0.1$), the model is also likely to predict correctly future unknown images given that the $R^{2}$ equals $0.8$ and adjusted $R^{2}$ equals 0.77. }
%\caption{Shijian prediction Model : the model is likely to predict correctly future unknown images given that the $R^{2}$ equals $0.86$ and adjusted $R^{2}$ equals 0.82. }
The same experiment was conducted on the other binarization methods (see Table~\ref{otherPredictionModel}). Except for Sahoo's method, all prediction models have an $R^{2}$ value higher than $0.7$, indicating that it is possible to predict the results of $10$ of $11$ binarization methods.
+The same experiment was conducted on the other binarization methods (see Table~\ref{otherPredictionModel}). All prediction models have an $\bar{R^{2}}$ value higher than $0.7$, indicating that it is possible to predict the results of $12$ binarization methods.
\caption{Accuracy of the prediction model for the other eight binarization methods. The selected features are different from one method to another. The accuracy and robustness of the prediction models are good ($R^2 > 0.7$, cross validation $\bar{R^{2}} > 0.83$). $\hat{mpe}$ denotes the mean percentage error of each model.}
+\caption{Accuracy of the prediction model for the other eight binarization methods. The selected features are different from one method to another. The accuracy and robustness of the prediction models are good (cross validation $\bar{R^{2}} > 0.7$). $\hat{mpe}$ denotes the mean percentage error of each model.}
+Kapur & $ \mIInk$; $\mA$; $\mu$; $v$; $s_{D}$; $v_{I}$; $\mu_{D}$; $\mu_{I}$ & 0.78 & 0.99 & 2\% \\
\caption{Binarization of the DIBCO dataset. Comparison between the best theoretical fscore (computed from the ground truth), fscores obtained using only Shijian's method and fscores obtained from our automatic selection.}