62d87f9
committed
Commits
Comments (0)
Files changed (4)

+2 2IJDAR/ijdar.tex

+8 8IJDAR/intro.tex

+3 3IJDAR/measures.tex

+17 17IJDAR/prediction.tex
IJDAR/ijdar.tex
This paper presented $18$ features that characterize the quality of a document image. These features are used in stepwise multivariate linear regression to create prediction models for $12$ binarization methods. Repeated random subsampling crossvalidation shows that the models are accurate (max percentage error equals 11\%). Moreover, given the stepwise approach of the linear regression, these models are not over parameterized. As a result, 10 models out of 12 are validated and show sufficient accuracy to be used in an automated selection method of the optimal binarization method for each image.
+This paper presented $18$ features that characterize the quality of a document image. These features are used in stepwise multivariate linear regression to create prediction models for $12$ binarization methods. Repeated random subsampling crossvalidation shows that the models are accurate (max percentage error equals 11\%). Moreover, given the stepwise approach of the linear regression, these models are not overfit. As a result, 10 models out of 12 are validated and show sufficient accuracy to be used in an automated selection method of the optimal binarization method for each image.
Our second research goal is to improve the binarization algorithm selection method. We believe that the method can be tuned by studying different strategies. One notion is to take into account $R^{2}$ and pvalues measures in the automatic selection of a method. Another idea is to weight predicted fscores with computational costs: with similar accuracy, choosing the quickest one may be preferable.
At last, 2 prediction models are rejected due to the lack of accuracy of the latters. New dedicated features have to be created and used in the presented methodology to circumvent this issue.
+At last, 2 prediction models are rejected due to their lack of accuracy. New dedicated features have to be created and used in the presented methodology to circumvent this issue.
IJDAR/intro.tex
\caption{Effect of binarization errors on a layout analysis algorithm. The processed document is composed of four text paragraphs. The background is severly degraded by a large amount of bleedthrough particularly in the margins (a). Three binarization algorithms are applied to the document image. Depending on the chosen binarization method, the bleed though of the original document will induce more or fewer binarization errors. For example, the Bernsen algorithm clearly fails where bleedthrough is important (bc). In the next step, a layout analysis algorithm is applied to the binarized document. The more binarization errors there are, the more inaccurate the layout analysis (in light red, text blocks are extracted with a white spaces segmentation algorithm).}
+\caption{Effect of binarization errors on a layout analysis algorithm. The processed document is composed of four text paragraphs. The background is severly degraded by a large amount of bleedthrough particularly in the margins (a). Three binarization algorithms are applied to the document image. Depending on the chosen binarization method, the bleed though of the original document will induce more or fewer binarization errors. For example, the Bernsen's algorithm clearly fails where bleedthrough is important (bc). In the next step, a layout analysis algorithm is applied to the binarized document. The more binarization errors there are, the more inaccurate the layout analysis (in light red, text blocks are extracted with a white spaces segmentation algorithm).}
\caption{OCR (ABBYY Finereader 9) errors from bleedthrough: a document with a low level of bleedthrough (a), and the same document with a higher level bleedthrough(b). The OCR fails when the document contains more bleedthrough. A red zone corresponds to a low confidence measure, whereas a green zone corresponds to a high confidence measure: the OCR may be confident that some bleedthrough regions are text regions (b).}
+\caption{OCR (ABBYY Finereader 9) errors from bleedthrough: a document with a low level of bleedthrough (a), and the same document with a higher level of bleedthrough(b). The OCR fails when the document contains more bleedthrough. A red zone corresponds to a low confidence measure, whereas a green zone corresponds to a high confidence measure: the OCR may be confident that some bleedthrough regions are text regions (b).}
We propose another approach, based on algorithm prediction models, to select the best algorithm for a specific task. Our approach is based on the following fact: the global quality of a document image directly affects the result of any processing algorithm (binarization, segmentation). Thus, we aim to predict the result of an algorithm according to the degradation type and quantity of the processed document. In this paper, we focus on binarization algorithm prediction.
For a given binarization algorithm and a set of images with their binarization groundtruth, the significant correlation between algorithm performance and the quality of the images allows us to build a prediction function. The document image quality is characterized by new, dedicated features. This prediction function can forecast the binarization algorithm result for any new image on which quality features have been previously computed.
+For a given binarization algorithm and a set of images with their binarization ground truth, the significant correlation between algorithm performance and the quality of the images allows us to build a prediction function. The document image quality is characterized by new, dedicated features. This prediction function can forecast the binarization algorithm result for any new image on which quality features have been previously computed.
%On the contrary, with algorithms prediction models, the optimal selection of an algorithm can be done in an automated way.
%Knowing the type and quantity of the document degradations, we are able to create precise error rates prediction models of such algorithms. With the resulting prediction we can choose the best performing algorithm or decide to use another solution (hand made transcriptions for the OCR for example).
Second, none of the presented articles dealing with prediction models analyze the significance of each feature. Only the authors of \cite{reed2008correlating} propose an interesting correlation analysis between several quality metrics %previously cited \cite{cannon1999quality}
This preliminary study shows that, even if most features are highly correlated with the defects, some are not. However, this study does not address the essential issue of the selection of relevant quality features to avoid overparameterization of an algorithm prediction.
+This preliminary study shows that, even if most features are highly correlated with the defects, some are not. However, this study does not address the essential issue of the selection of relevant quality features to avoid overfitting of an algorithm prediction.
%But even if the measures are more or less correlated to the OCR results, it does not mean that they are essential for the prediction.
IJDAR/measures.tex
\caption{The three classes of pixels. a) the original grayscale document image b) its grayscale histogram with two thresholds $ s_{0} $ and $ s_{1} $ obtained by a 3means algorithm. c) classification result : pixels lower than the threshold $ s_{0} $ in black, pixels between $ s_{0} $ and $ s_{1} $ in gray and pixels higher that $ s_{1} $ in white. The gray set of pixels (between $ s_{0} $ and $ s_{1} $) contains most of the instances of document degradation, such as bleed through, speckles, spots and ink loss.}
+\caption{The three classes of pixels. (a) the original grayscale document image. (b) its grayscale histogram with two thresholds $ s_{0} $ and $ s_{1} $ obtained by a 3means algorithm. (c) classification result : pixels lower than the threshold $ s_{0} $ in black, pixels between $ s_{0} $ and $ s_{1} $ in gray and pixels higher that $ s_{1} $ in white. The gray set of pixels (between $ s_{0} $ and $ s_{1} $) contains most of the instances of document degradation, such as bleed through, speckles, spots and ink loss.}
\caption{Examples of global graylevel histograms : a. a relatively clean document, b. its corresponding graylevel histogram, c. an ancient degraded document, d. its corresponding histogram (more scattered and irregular than in b). The graylevel histogram is used to provide a first indication of the quality of the document.}
+\caption{Examples of global graylevel histograms : a relatively clean document (a), its corresponding graylevel histogram (b) , an ancient degraded document (c) and (d) its corresponding histogram (more scattered and irregular than in b). The graylevel histogram is used to provide a first indication of the quality of the document.}
\caption{The different locations of a degradation component on the page: a. the degradation component is not connected to an ink component, b. a small degradation component is adjacent to an ink component, c. a large degradation component is adjacent to an ink component.}
+\caption{The different locations of a degradation component on the page. The degradation component is not connected to an ink component (a), a small degradation component is adjacent to an ink component (b) and a large degradation component is adjacent to an ink component (c).}
IJDAR/prediction.tex
\item Sauvola \cite{sauvola2000adaptive} is a locally adaptive thresholding algorithm using pixel intensity variance.
 \item Shijian \cite{lu2010document} is a recent method based on an \textit{adhoc} combination of existing techniques. \cite{lu2010document} has proven to have very good accuracy on the ICDAR 2009 Binarization Contest.
+ \item Shijian Lu \cite{lu2010document} is a recent method based on an \textit{adhoc} combination of existing techniques. \cite{lu2010document} has proven to have very good accuracy on the ICDAR 2009 Binarization Contest.
Some binarization methods rely on parameters. In this article, we do not focus on parameter optimization. Therefore, we chose to use the parameters given by the authors of each method in their corresponding original articles. Table \ref{parameters} summarizes the values of these parameters. Importantly, note that the prediction models created are only able to predict the performance of a binarization method with a specific set of parameters. However, a binarization method can have several prediction models, one for each set of parameters. To illustrate the difference between two sets of parameters, we will create two different prediction models for Sauvola's method. The second set of parameters was manually chosen (Table \ref{parameters}).
%The measures introduced in this paper characterize a document's quality. They can be used in a lot of applications such as human perceived quality prediction or restoration algorithms selection. In this paper we focus on a use case that is rarely presented in the state of the art : the prediction of binarization methods accuracy.
To predict the accuracy of the binarization method, we follow a methodology based on a stepwise linear regression. Section \ref{subsectiondataset} presents the dataset we used to train and validate our prediction models. This predictive methodology can be applied to all types of binarization methods and is presented in a general way in section~\ref{subsectionprediction}. We then detail the prediction models corresponding to three popular binarization methods for document images: Otsu's, Sauvola's and Shijian's (section \ref{subsectioncommonmethods}). The prediction model accuracy for the other methods is presented in section \ref{subsectionotherprediction} to highlight the generality of the presented methodology.
+To predict the accuracy of the binarization method, we follow a methodology based on a stepwise linear regression. Section \ref{subsectiondataset} presents the dataset we used to train and validate our prediction models. This predictive methodology can be applied to all types of binarization methods and is presented in a general way in section~\ref{subsectionprediction}. We then detail the prediction models corresponding to three popular binarization methods for document images: Otsu's, Sauvola's and Shijian Lu's (section \ref{subsectioncommonmethods}). The prediction model accuracy for the other methods is presented in section \ref{subsectionotherprediction} to highlight the generality of the presented methodology.
%In the next two subsections we present two prediction models for the main binarization methods : Otsu's \cite{otsu1975threshold} and Sauvola's \cite{sauvola2000adaptive}. We chose two different types of approach to binarise a document in order to show that even if our measure are defined on a global level they can be used to predict adaptive local methods like Sauvola's.
To estimate the fscore of a binarization algorithm, we automatically build a prediction model based on the most significant features among the $18$. More precisely, the prediction model is computed using multivariate step wise linear regression \cite{thompson1978selectionp1,thompson1978selectionp2,hocking,cohen}, followed by repeated random subsampling validation (cross validation).
The linear regression models, as an hyperplane, the relationship between the features and the groundtruthed fscores. This result can then be used to predict a fscore according to a set of computed features. The prediction can be improved by using only a pertinent subset of features among the $18$ independent computed features.
+The linear regression models, as an hyperplane, the relationship between the features and the ground truth fscores. This result can then be used to predict a fscore according to a set of computed features. The prediction can be improved by using only a pertinent subset of features among the $18$ independent computed features.
There are three main ways to carry out a selection. First, the forward strategy consists in computing a criteria (linked to the $R^2$ value) by adding one feature at a time. On the contrary, a second approach (backward strategy) consists in starting with all the features and deleting them one at a time. After each deletion the criteria is computed. The last strategy consists in testing all the possible combinations. As we have only 18 features, we decided to use the exhaustive strategy.
This overall process which is simplified on figure \ref{shema} can be divided into five steps\footnote{The overall R project script and our evaluation data can be downloaded from the following website \texttt{https://bitbucket.org/vrabeux/qualityevaluation}}~:
+This overall process which is presented on figure \ref{shema} can be divided into five steps\footnote{The overall R project script and our evaluation data can be downloaded from the following website \texttt{https://bitbucket.org/vrabeux/qualityevaluation}}~:
 \item \textbf{Fscores computation:} We run the binarization algorithm and compute the fscore for each image by comparing the binarization result and the ground truth. In the following, these groundtruth fscores and denoted by $f_i, i\in[0,N]$ (with $N=\#images$)
+ \item \textbf{Fscores computation:} We run the binarization algorithm and compute the fscore for each image by comparing the binarization result and the ground truth. In the following, these ground truth fscores and denoted by $f_i, i\in[0,N]$ (with $N$ the total number of images).
\item \textbf{Generation of the predictive model}: This step consists of applying a step wise multivariate linear regression to the overall dataset, allowing us to select the most significant features for predicting the given binarization algorithm. Keeping all features in each prediction model would lead to overfitt models. Indeed, some features may not be significant for predicting a specific binarization method. Moreover, even if a feature is highly correlated to the accuracy of an algorithm, it may have a weak contribution to the final prediction model. The output of this step is a linear function that gives a predicted fscore value for any image, for one binarization algorithm, knowing the selected features. We denote by $\hat{f_i}, i\in[0,N]$ the predicted fscores.
The prediction model for Otsu's, Sauvola's and Shijian's binarization algorithms were generated with the methodology described in section~\ref{subsectionprediction}. The coefficients associated with the most significant selected features, their pvalues and the intercept of the linear predictive function are detailed in Tables~\ref{otsuPredictionModel}, \ref{sauvolaPredictionModel} and \ref{shijianPredictionModel}. If a feature is not present in a table, then it was not selected by the step wise algorithm. As mentioned in the previous section, the cross validation for each model gives the pair $(\bar{\alpha}, \bar{R^{2}})$.
+The prediction model for Otsu's, Sauvola's and Shijian Lu's binarization algorithms were generated with the methodology described in section~\ref{subsectionprediction}. The coefficients associated with the most significant selected features, their pvalues and the intercept of the linear predictive function are detailed in Tables~\ref{otsuPredictionModel}, \ref{sauvolaPredictionModel} and \ref{shijianPredictionModel}. If a feature is not present in a table, then it was not selected by the step wise algorithm. As mentioned in the previous section, the cross validation for each model gives the pair $(\bar{\alpha}, \bar{R^{2}})$.
The most significant select\ed features for Otsu's prediction model are $ \mIInk $, $ v_{I}$, $ v_{B} $, $ \mu_{B} $, $\mu$ and $v$ (see Table~\ref{otsuPredictionModel} for the coefficients of the predictive function). For Otsu's prediction model, we can explain the feature selection by the fact that Otsu's binarization method is based on global thresholding. That is why features such as $\mIInk$, $\mu$ and $v$ are significant and have such low pvalues. The model's $ R^{2}$ equals $0.93$, which is considered very good \cite{cohen}.
\caption{Otsu prediction model : all selected features are significant (pvalue $<0.1$), and the model is likely to correctly predict future unknown images given that the $R^{2}$ value is higher than $0.9$. The mean percentage error is denoted by $mpe$.}
+\caption{Otsu's prediction model : all selected features are significant (pvalue $<0.1$), and the model is likely to correctly predict future unknown images given that the $R^{2}$ value is higher than $0.9$. The mean percentage error is denoted by $mpe$.}
The selected features and their estimated coefficients for Shijian's prediction model are presented in Table \ref{shijianPredictionModel}. The step wise linear regression selects two spatial deformation features, $\mA$ and $\mSG$, and a global feature, $\mIBack$. This choice is not surprising because this method is a combination of several global and window based techniques. The prediction model is also very accurate ($0.86$). The cross validation gives a $\bar{R^{2}}$ of 0.99 and a mean slope $\bar{\alpha}$ of 1.06.
+The selected features and their estimated coefficients for Shijian Lu's prediction model are presented in Table \ref{shijianPredictionModel}. The step wise linear regression selects two spatial deformation features, $\mA$ and $\mSG$, and a global feature, $\mIBack$. This choice is not surprising because this method is a combination of several global and window based techniques. The prediction model is also very accurate ($0.86$). The cross validation gives a $\bar{R^{2}}$ of 0.99 and a mean slope $\bar{\alpha}$ of 1.06.
The methodology previously explained allows the creation of an accurate prediction model for any binarization method. Given a document image, a binarization method and its prediction model, we can compute all of the features required by the model and use them as inputs. The result is the predicted accuracy of this specific binarization method for this specific image. Given several binarization prediction models, we can create a binarization process that uses these prediction models to select the optimal binarization method for each image of a dataset.
For instance, Shijian's method is a performant binarization method which gives the best results on average. However, in some borderline cases Shijian's significantly fails while other methods perform better. This is illustrated in Figure \ref{figshijianfails} where the bleedthrough defect disrupts methods which use a local analysis of the image.
+For instance, Shijian Lu's method is a performant binarization method which gives the best results on average. However, in some borderline cases Shijian Lu's significantly fails while other methods perform better. This is illustrated in Figure \ref{figshijianfails} where the bleedthrough defect disrupts methods which use a local analysis of the image.
\caption{Sophisticated binarization algorithms do not always give the best output : a. original image, b. Shijian's binarization output, c. Sauvola's binarization output, d. Otsu's binarization output. Ostu's algorithm has the best performances on this specific image.}
+\caption{Sophisticated binarization algorithms do not always give the best output : original image (a), Shijian Lu's binarization output (b), Sauvola's binarization output (c) and Otsu's binarization output (d). Ostu's algorithm has the best performances on this specific image.}
More generally, table \ref{selectionRes} presents some fscore statistics obtained from binarizing the DIBCO dataset. The first line corresponds to the best theoretical fscores (having the ground truth, we know for each image the binarization method that will provide the best fscore). The second line corresponds to the fscores obtained using only Shijian's method. The last line corresponds to the fscores obtained using our automatic binarization selection.
+More generally, table \ref{selectionRes} presents some fscore statistics obtained from binarizing the DIBCO dataset. The first line corresponds to the best theoretical fscores (having the ground truth, we know for each image the binarization method that will provide the best fscore). The second line corresponds to the fscores obtained using only Shijian Lu's method. The last line corresponds to the fscores obtained using our automatic binarization selection.
We analyse the accuracy of our binarization method selection algorithms in several ways. As expected, the method has only a slightly better (2\%) mean accuracy than using only Shijian's method. What is significant is that the standard deviation lowers from $0.12$ to $0.04$. It means that the worst binarization result of our method is much higher than Shijian's (56\%).
+We analyse the accuracy of our binarization method selection algorithms in several ways. As expected, the method has only a slightly better (2\%) mean accuracy than using only Shijian Lu's method. What is significant is that the standard deviation lowers from $0.12$ to $0.04$. It means that the worst binarization result of our method is much higher than Shijian Lu's (56\%).
We also compared our method with the optimal selection that we can compute from the ground truth. The results are very similar, indicating that the prediction models are accurate enough to select the best binarization method for each image (70\% perfect match). The mean error of our method is $0.009$ (standard deviation equals $0.02$), and, the worst error equals $0.06$.
\caption{Binarization of the DIBCO dataset. Comparison between the best theoretical fscore (computed from the ground truth), fscores obtained using only Shijian's method and fscores obtained from our automatic selection.}
+\caption{Binarization of the DIBCO dataset. Comparison between the best theoretical fscore (computed from the ground truth), fscores obtained using only Shijian Lu's method and fscores obtained from our automatic selection.}