Commits

vialard committed 46467bd

Dernieres correction.

Comments (0)

Files changed (2)

ICDAR2013/ICDAR2013_paper/biblio.bib

 	Title = {Combination of Document Image Binarization Techniques},
 	Year = {2011}}
 
+@article{lu2010document,
+  title={Document image binarization using background estimation and stroke edges},
+  author={Lu, Shijian and Su, Bolan and Tan, Chew Lim},
+  journal={International journal on document analysis and recognition},
+  volume={13},
+  number={4},
+  pages={303--314},
+  year={2010},
+  publisher={Springer}
+}
+
+
 @article{white1983image,
 	Author = {White, J.M. and Rohrer, G.D.},
 	Date-Added = {2012-06-06 16:22:35 +0200},

ICDAR2013/ICDAR2013_paper/icdar.tex

 \end{figure}
 
 
-Let  $c_{\inkp} \in \inkc$  be an ink component and  $c_{\degp} \in \dc$ be a degradation component. We denote the predicate returning true by $SG(c_{\inkp}, c_{\degp})$  if $ c_{\inkp}  $ and $ c_{\degp}  $ are connected~:
+Let  $c_{\inkp} \in \inkc$  be an ink component and  $c_{\degp} \in \dc$ be a degradation component. We denote the predicate returning true if $ c_{\inkp}  $ and $ c_{\degp}  $ are connected by $SG(c_{\inkp}, c_{\degp})$:
 
 $$ SG (c_{\inkp}, c_{\degp}) =  \exists (p_{\inkp}, p_{\degp}) \in c_{\inkp} \times c_{\degp} \mid p_{\inkp} \mbox{~and~} p_{\degp} \mbox{~are 4-connected}$$ 
 
-We distinguish three different cases that can produce different types of binarization errors~: 
+We distinguish three different cases that can produce different types of binarization errors: 
 
 
 \begin{enumerate}
 
-\item  If  $c_{\inkp}$ and  $c_{\degp}$  are not connected (figure \ref{locations}.a), the original character will not be altered by the binarization process. If this configuration occurs numerous times, the binarization can lead to a document image highly degraded by many small black spots between characters. Let $\cma$ be  the set of degradation components that are not connected to any ink component~:
+\item  If  $c_{\inkp}$ and  $c_{\degp}$  are not connected (figure \ref{locations}.a), the original character will not be altered by the binarization process. If this configuration occurs numerous times, the binarization can lead to a document image highly degraded by many small black spots between characters. Let $\cma$ be  the set of degradation components that are not connected to any ink component:
 
 $$
 \cma = \{c_{\degp} \in \dc \mid \forall c_{\inkp} \in \inkc, SG (c_{\inkp}, c_{\degp})=false \}
 $$
 
 
-The relative quantity of non-connected ink and degradation components is measured by $ \mA $~:
+The relative quantity of non-connected ink and degradation components is measured by $ \mA $:
 
 $$ \mA = \displaystyle\frac{ \card{ \cma } }{ \card{\inkc} } $$
 
 
 The measures introduced in this paper characterize a document's quality. In this paper we focus on a use case that is rarely presented in the state of the art : the prediction of binarization methods accuracy.
 
-This section presents a unified methodology that is able to predict most types of binarization methods (for example, adaptive thresholding, clustering, entropic, document dedicated). Our methodology is evaluated on $11$ binarization methods used in document analysis. The methods are referenced in the text by their author's names : Bernsen ; Kapur ; Kittler ; Li ; Ridler ; Sauvola ; Otsu (these $7$ methods are described in \cite{stathis2008evaluation}); Sahoo  \cite{sauvola2000adaptive}; Shanbag  \cite{shanbhag1994utilization};  White \cite{white1983image}; Shijian \cite{su2011combination}.
+This section presents a unified methodology that is able to predict most types of binarization methods (for example, adaptive thresholding, clustering, entropic, document dedicated). Our methodology is evaluated on $11$ binarization methods used in document analysis. The methods are referenced in the text by their author's names : Bernsen ; Kapur ; Kittler ; Li ; Ridler ; Sauvola ; Otsu (these $7$ methods are described in \cite{stathis2008evaluation}); Sahoo  \cite{sauvola2000adaptive}; Shanbag  \cite{shanbhag1994utilization};  White \cite{white1983image}; Shijian \cite{lu2010document}.
 
 %Some binarization methods rely on parameters. In this article, we do not focus on parameter optimization. Therefore, we chose to use the parameters given by the authors of each method in their corresponding original articles. Importantly, note that the prediction models created are only able to predict the performance of a binarization method with a specific set of parameters. However, a binarization method can have several prediction models, one for each set of parameters.
  
 \label{methodology}
 To create the prediction model, we use a multivariate step wise linear regression \cite{thompson1978selectionp2}, followed by a repeated random sub-sampling validation (cross validation). This over all process can be divided in several steps :
 \begin{enumerate}
-	\item  Features and F-scores computation: The 18 proposed features are computed for each image.  We also run the binarization algorithm on the overall dataset and measure its accuracy relative to the ground truth. In the follow-ing section, these f-scores are called ground truth f-scores.
+	\item  Features and F-scores computation: The 18 proposed features are computed for each image.  We also run the binarization algorithm on the overall dataset and measure its accuracy relative to the ground truth. In the following section, these f-scores are called ground truth f-scores.
 	\item Generation of the predictive model : This step consists of applying a step wise multivariate linear regression to the overall dataset, allowing us to select the most significant features for predicting the given binarization algorithm. The output of this step is a linear function that gives a predicted f-score value for any image, for one binarization algorithm, knowing the selected features.
-	\item Evaluation of model accuracy: The $R^2$ value indicates the proportion of variability in a data set that is accounted for by the statistical model and provides a measure of how well the model predicts future outcomes. The best theoretical value for $R^2$ is 1. Moreover, a p-value is computed for each selected feature indicating its significance. We choose to keep the model only if $R^2 > 0.7$ and if a majority of p-values are lower than $0.1$.
+	\item Evaluation of model accuracy: The $R^2$ value indicates the proportion of variability in a dataset that is accounted for by the statistical model and provides a measure of how well the model predicts future outcomes. The best theoretical value for $R^2$ is 1. Moreover, a p-value is computed for each selected feature indicating its significance. We choose to keep the model only if $R^2 > 0.7$ and if a majority of p-values are lower than $0.1$.
 
 	\item Model validation using Cross-Validation : the training of a prediction model and its accuracy measurement is done a several times (in our experiments : 100 times) on different subsets of a data-set :
 \begin{enumerate}
 
 \section{Conclusion and research perspectives}
 
-This paper presented $18$ features that characterize the quality of a document image. These features are used in step-wise multivariate linear regression to create prediction models for  $11$ binarization methods. Repeated random sub-sampling cross-validation shows that $10$ of $11$ models are very accurate and can be used to automatically choose the best binarization method. Moreover, given the step-wise approach of the linear regression, these models are not over parameterized.  
+This paper presented $18$ features that characterize the quality of a document image. These features are used a in step-wise multivariate linear regression to create prediction models for  $11$ binarization methods. Repeated random sub-sampling cross-validation shows that $10$ of $11$ models are very accurate and can be used to automatically choose the best binarization method. Moreover, given the step-wise approach of the linear regression, these models are not over parameterized.  
 
 One of our future research goals is to apply the same methodology to predict OCR error rates. 
 %However, OCRs today are very complex engines that are able to restore documents and perform layout analysis. Therefore, OCR failure cases are not only the result of a document's quality but also of its complexity (font, tables, figures, mathematical formulas). This complexity has to be evaluated with new OCR dedicated features. Our second research goal is to improve the binarization algorithm selection method. We believe that the method can be tuned by studying different strategies. One notion is to take into account  $R^{2}$ and p-values measures in the automatic selection of a method. Another idea is to weight predicted fscores with computational costs: with similar accuracy, choosing the quickest one may be preferable.