3ca1ce7
committed
Commits
Comments (0)
Files changed (4)

+5 2IJDAR/ijdar.tex

+2 2IJDAR/intro.tex

+3 4IJDAR/measures.tex

+12 9IJDAR/prediction.tex
IJDAR/ijdar.tex
This paper presented $18$ features that characterize the quality of a document image. These features are used in stepwise multivariate linear regression to create prediction models for $12$ binarization methods. However, 2 prediction models are rejected due to the lack of accuracy of the latters. New dedicated features have to be created and used in the presented methodology to circumvent this issue.
Repeated random subsampling crossvalidation shows that the models are accurate (max percentage error equals 11\%) and can be used to automatically choose the best binarization method. Moreover, given the stepwise approach of the linear regression, these models are not over parameterized.
+This paper presented $18$ features that characterize the quality of a document image. These features are used in stepwise multivariate linear regression to create prediction models for $12$ binarization methods. Repeated random subsampling crossvalidation shows that the models are accurate (max percentage error equals 11\%). Moreover, given the stepwise approach of the linear regression, these models are not over parameterized. As a result, 10 models out of 12 are validated and show sufficient accuracy to be used in an automated selection method of the optimal binarization method for each image.
%In \cite{rabeux2011ancient}, similar features are used with a multivariate linear regression to predict the OCR error rate.
Our second research goal is to improve the binarization algorithm selection method. We believe that the method can be tuned by studying different strategies. One notion is to take into account $R^{2}$ and pvalues measures in the automatic selection of a method. Another idea is to weight predicted fscores with computational costs: with similar accuracy, choosing the quickest one may be preferable.
+At last, 2 prediction models are rejected due to the lack of accuracy of the latters. New dedicated features have to be created and used in the presented methodology to circumvent this issue.
We would like to thanks the DIBCO team for providing datasets. This work was completed within the DIGIDOC project financed by the ANR (\textit{Agence Nationale de la Recherche})
IJDAR/intro.tex
As ancient documents often present significant degradation, we focus this paper on their quality evaluation. However, the global methodology is suited for any damaged document images.
Essentially, most document analysis systems are built by sequentially applying algorithms (preprocessing, binarization, layout analysis, OCR, indexing). These chains of algorithms are \textit{adhoc} workflows, built for a specific set of images. In such a workflow, the result of one algorithm can affect the result of all of the following ones. For example, Figure \ref{figurebinarisationversussegmentation} illustrates the effect of a binarization algorithm on the result of a layout analysis. Thus, choosing the best algorithm available for one specific image is very important at each step of a workflow. This choice needs to be automated to avoid tremendous manual work.
+Essentially, most document analysis systems are created by sequentially applying algorithms (preprocessing, binarization, layout analysis, OCR, indexing). These chains of algorithms are \textit{adhoc} workflows, built for a specific set of images. In such a workflow, the result of one algorithm can affect the result of all of the following ones. For example, Figure \ref{figurebinarisationversussegmentation} illustrates the effect of a binarization algorithm on the result of a layout analysis. Thus, choosing the best algorithm available for one specific image is very important at each step of a workflow. This choice needs to be automated to avoid tremendous manual work.
We propose another approach, based on algorithm prediction models, to select the best algorithm for a specific task. Our approach is based on the following fact: the global quality of a document image directly affects the result of any processing algorithm (binarization, segmentation). Thus, we aim to predict the result of an algorithm according to the degradation type and quantity of the processed document. In this paper, we focus on binarization algorithm prediction.
For a given binarization algorithm and a set of groundtruthed binarized images, the significant correlation between algorithm performance and the quality of the images allows us to build a prediction function. The document image quality is characterized by new, dedicated features. This prediction function can forecast the binarization algorithm result for any new image on which quality features have been previously computed.
+For a given binarization algorithm and a set of images with their binarization groundtruth, the significant correlation between algorithm performance and the quality of the images allows us to build a prediction function. The document image quality is characterized by new, dedicated features. This prediction function can forecast the binarization algorithm result for any new image on which quality features have been previously computed.
%On the contrary, with algorithms prediction models, the optimal selection of an algorithm can be done in an automated way.
%Knowing the type and quantity of the document degradations, we are able to create precise error rates prediction models of such algorithms. With the resulting prediction we can choose the best performing algorithm or decide to use another solution (hand made transcriptions for the OCR for example).
IJDAR/measures.tex
This section analyzes the $18$ features computed on two different document images containing several defects (Table \ref{measuresExamplesOnRealImages}).
%This analysis aims to show that these values are useful as input for a prediction binarization algorithm.
These two examples emphasize the link between the $18$ features and the fscore obtained after having binarized the images with Otsu's and Sauvola's methods (global {\it versus} local thresholding method). There are multiple ways to measure binarization accuracy. In this paper, we used the fscore.
+These two examples emphasize the link between the $18$ features and the fscore obtained after having binarized the images with Otsu's and Sauvola's methods (a global {\it versus} a local thresholding method). There are multiple ways to measure binarization accuracy. In this paper, we used the fscore.
The first document image (Table \ref{measuresExamplesOnRealImages}, line 1) is damaged by a large spot that overlaps text lines. The graylevels of the spot are close to the graylevel of the text pixels. Because the Otsu method is based on a global threshold, the spot pixels tend to be misclassified as ink. On the contrary, the local method is more likely to achieve a correct separation of ink and background on the defective area, which explains why the respective fscores of the Otsu and Sauvola methods are $0.4$ and $0.7$ on this image. The second document image (Table \ref{measuresExamplesOnRealImages}, line 2) presents a noneven background with speckles. Moreover the ink color is light relative to the background color. On this image, the respective fscores of Otsu and Sauvola are $0.8$ and $0.4$. The Sauvola method is not robust to the background speckles, which are classified as ink. The faded ink defect is a drawback for a global method and lowers the performance of Otsu's method.
Table 2 shows that specific defects that reduce binarization performance are captured by the proposed features. Even if the global features based on histogram analysis are meaningful, they are not sufficient in that case to choose the best binarization method. The ink pixels' mean value $\mu_{I}$ of the first image is lower than that of the second one, indicating that the ink layer seems easier to identify using a global thresholding method. However, the skewness of the ink $s_{I}$ is negative, indicating that most pixels are concentrated on the right part of the distribution: there are more gray pixels than really dark pixels. The skewness of the second global histogram $s$ is much higher than that of the the first image, indicating that the background of the second image is easy to separate using a global thresholding method. This separation is confirmed by the global variance $v$. Without additional information, the global thresholding method seems adapted to the second image but we cannot draw a similar conclusion for the first image.
+Table 2 shows that specific defects that reduce binarization performance are captured by the proposed features. Even if the global features based on histogram analysis are meaningful, they are not sufficient in that case to choose the best binarization method. The ink pixels' mean value $\mu_{I}$ of the first image is lower than that of the second one, indicating that the ink layer seems easier to identify using a global thresholding method. However, the skewness of the ink $s_{I}$ is negative, indicating that most pixels are concentrated on the right part of the distribution: there are more gray pixels than really dark pixels. The skewness of the second global histogram $s$ is much higher than that of the the first image, indicating that the background of the second image is easy to separate using a global thresholding method. This separation is confirmed by the global variance $v$. Without additional information, the global thresholding method seems adapted to the second image but we cannot draw a similar conclusion for the first image.
In the first image, the values of $\mIInk$ and $\mIBack$ are low, indicating that a global thresholding method is likely to fail to correctly classify the pixels. The value of $\mSG$ is also high, indicating that there are large spots around the characters. Generally, windowbased methods have better results on this type of document.
\caption{Two document image examples from the DIBCO dataset and their feature vectors. The proposed features capture different degradation types (for example, ink spots, faded ink, background speckles) }
\multicolumn{5}{c}{Image} & \multicolumn{6}{c}{GrayScale Histogram} & \multicolumn{7}{c}{3mean clusters} \\
IJDAR/prediction.tex
\item Li \cite{li1998iterative} is a crossentropic thresholding method based on the minimization of an information theoretic distance (KullbackLeibler).
 \item Niblack \cite{niblack1985introduction} : is a locally adaptive thresholding method using pixels intensity variance.
+ \item Niblack \cite{niblack1985introduction} : is a locally adaptive thresholding method using pixels intensity variance.
\item Ridler \cite{calvard1978picture} is an iterative thresholding method based on twoclass Gaussian mixture models.
\item Shanbag \cite{shanbhag1994utilization} is a fuzzy entropic thresholding technique that considers fuzzy memberships as an indication of how strongly a gray value belongs to the background or to the foreground.
 \item Sauvola \cite{sauvola2000adaptive} is a locally adaptive thresholdingmethodusing pixel intensity variance.
+ \item Sauvola \cite{sauvola2000adaptive} is a locally adaptive thresholding algorithm using pixel intensity variance.
\item Shijian \cite{lu2010document} is a recent method based on an \textit{adhoc} combination of existing techniques. \cite{lu2010document} has proven to have very good accuracy on the ICDAR 2009 Binarization Contest.
%The measures introduced in this paper characterize a document's quality. They can be used in a lot of applications such as human perceived quality prediction or restoration algorithms selection. In this paper we focus on a use case that is rarely presented in the state of the art : the prediction of binarization methods accuracy.
To predict the accuracy of the binarization method, we follow a methodology based on a stepwise linear regression. Section \ref{subsectiondataset} presents the dataset we used to train and validate our prediction models. This predictive methodology can be applied to all types of binarization methods and is presented in a general way in section~\ref{subsectionprediction}. We then detail the prediction models corresponding to three popular binarization methods for document images: Otsu's, Sauvola's and Shijian's (section \ref{subsectioncommonmethods}). The prediction model accuracy for the other eight methods is briefly presented in section \ref{subsectionotherprediction} to highlight the generality of the presented methodology.
+To predict the accuracy of the binarization method, we follow a methodology based on a stepwise linear regression. Section \ref{subsectiondataset} presents the dataset we used to train and validate our prediction models. This predictive methodology can be applied to all types of binarization methods and is presented in a general way in section~\ref{subsectionprediction}. We then detail the prediction models corresponding to three popular binarization methods for document images: Otsu's, Sauvola's and Shijian's (section \ref{subsectioncommonmethods}). The prediction model accuracy for the other methods is presented in section \ref{subsectionotherprediction} to highlight the generality of the presented methodology.
%In the next two subsections we present two prediction models for the main binarization methods : Otsu's \cite{otsu1975threshold} and Sauvola's \cite{sauvola2000adaptive}. We chose two different types of approach to binarise a document in order to show that even if our measure are defined on a global level they can be used to predict adaptive local methods like Sauvola's.
%that are either easy to binarize (high mean fscore), or hard (low mean fscore) is needed. %Indeed, without an heterogenous dataset, we would train our prediction model on a subset of images. The trained prediction model would not be usable in real life.
We develop a new dataset by merging the DIBCO\footnote{http://users.iit.demokritos.gr/~bgat/DIBCO2009/} and HDIBCO\footnote{http://users.iit.demokritos.gr/~bgat/HDIBCO2010/} datasets \cite{gatos2009icdar,pratikakis2010h,pratikakis2011icdar}. These datasets are primarily used as data for binarization contests and contain a heterogeneous set of images from difficult to easy to binarize. Table \ref{fscoredistrib} summarizes some statistical results of the $11$ binarization algorithms applied to all DIBCO images ($36$ images).%\footnote{The overall merged dataset can be obtained at http://sd22392.dedibox.fr:8080/DoQuBookWeb/pagecollections.jsp}).
+We develop a new dataset by merging the DIBCO\footnote{http://users.iit.demokritos.gr/~bgat/DIBCO2009/} and HDIBCO\footnote{http://users.iit.demokritos.gr/~bgat/HDIBCO2010/} datasets \cite{gatos2009icdar,pratikakis2010h,pratikakis2011icdar}. These datasets are primarily used as data for binarization contests and contain a heterogeneous set of images from difficult to easy to binarize. Table \ref{fscoredistrib} summarizes some statistical results of the $12$ binarization algorithms applied to all DIBCO images ($36$ images).%\footnote{The overall merged dataset can be obtained at http://sd22392.dedibox.fr:8080/DoQuBookWeb/pagecollections.jsp}).
 \item \textbf{Fscores computation:} We run the binarization algorithm and compute the fscore for each image by comparing the binarization result and the ground truth. In the following, these fscores are called ground truth fscores and denoted by $f_i, i\in[0,N]$ (with $N=\#images$)
+ \item \textbf{Fscores computation:} We run the binarization algorithm and compute the fscore for each image by comparing the binarization result and the ground truth. In the following, these groundtruth fscores and denoted by $f_i, i\in[0,N]$ (with $N=\#images$)
\item \textbf{Generation of the predictive model}: This step consists of applying a step wise multivariate linear regression to the overall dataset, allowing us to select the most significant features for predicting the given binarization algorithm. Keeping all features in each prediction model would lead to overfitt models. Indeed, some features may not be significant for predicting a specific binarization method. Moreover, even if a feature is highly correlated to the accuracy of an algorithm, it may have a weak contribution to the final prediction model. The output of this step is a linear function that gives a predicted fscore value for any image, for one binarization algorithm, knowing the selected features. We denote by $\hat{f_i}, i\in[0,N]$ the predicted fscores.
The prediction model for Otsu's, Sauvola's and Shijian's binarization algorithms were generated with the methodology described in section~\ref{subsectionprediction}. The coefficients associated with the most significant selected features, their pvalues and the intercept of the linear predictive function are detailed in Tables~\ref{otsuPredictionModel}, \ref{sauvolaPredictionModel} and \ref{shijianPredictionModel}. If a feature is not present in a table, then it was not selected by the step wise algorithm. As mentioned in the previous section, the cross validation for each model gives the pair $(\bar{\alpha}, \bar{R^{2}})$.
The most significant selected features for Otsu's prediction model are $ \mIInk $, $ v_{I}$, $ v_{B} $, $ \mu_{B} $, $\mu$ and $v$ (see Table~\ref{otsuPredictionModel} for the coefficients of the predictive function). For Otsu's prediction model, we can explain the feature selection by the fact that Otsu's binarization method is based on global thresholding. That is why features such as $\mIInk$, $\mu$ and $v$ are significant and have such low pvalues. The model's $ R^{2}$ equals $0.93$, which is considered very good \cite{cohen}.
+The most significant select\ed features for Otsu's prediction model are $ \mIInk $, $ v_{I}$, $ v_{B} $, $ \mu_{B} $, $\mu$ and $v$ (see Table~\ref{otsuPredictionModel} for the coefficients of the predictive function). For Otsu's prediction model, we can explain the feature selection by the fact that Otsu's binarization method is based on global thresholding. That is why features such as $\mIInk$, $\mu$ and $v$ are significant and have such low pvalues. The model's $ R^{2}$ equals $0.93$, which is considered very good \cite{cohen}.
The crossvalidation gives a $\bar{\alpha}$ coefficient of $0.989$ and $\bar{R^{2}}$ of $0.987$. These results indicate that our model does not depend on the chosen training data.
Among the 18 features, most models embed about 7 features. Globally the selected features are consistent with the binarization algorithm : the step wise selection process tends to keep global (resp. local) features for global (resp. local) binarization algorithms. We also note that $\mS$ is never selected by any prediction model. Indeed, the binarization accuracy is measured at the pixel level (fscore). With this accuracy measure, the feature $\mSG$ becomes more significant than $\mS$, which may not have been the case with another evaluation measure.
The $R^{2}$ values show the quality of each prediction model. The prediction models of Sahoo and Niblack binarization methods were not kept for the statistical validation step since the $R^{2}$ values were below 0.7. We can point out that the $R^2$ value lowers when few dedicated features are selected in the model (e.g. 1 for Sahoo's and 0 for Niblack's). Therefore new features should be designed in theses cases in order to obtain more accurate prediction models.
+The $R^{2}$ values show the quality of each prediction model. The prediction models of Sahoo and Niblack binarization methods were not kept for the statistical validation step since the $R^{2}$ values were below 0.7. For these two binarization models new features have to be created in order to obtain more accurate prediction models.
The two values $\bar{R^2}$ and $mpe$ show the accuracy of each prediction model on the validation step. A $\bar{R^{2}}$ value higher than $0.7$ indicates that it is possible to predict the results of a binarization method~\cite{cohen}. As a result, $12$ binarization methods can be well predicted. The mean percentage error ($mpe$) is the average difference between predicted fscores and real fscores. This value is around $5\%$.
\caption{Accuracy of the prediction model for the other eight binarization methods. The selected features are different from one method to another. The accuracy and robustness of the prediction models are good (cross validation $\bar{R^{2}} > 0.7$). The mean percentage error of each model is denoted by $mpe$.}
+\caption{Accuracy of the prediction model for the other binarization methods. The selected features are different from one method to another. Two models where not kept for the crossvalidation step due to their low $R^{2}$ value ($<0.7$). The accuracy and robustness of the other prediction models are good (cross validation $\bar{R^{2}} > 0.7$). The mean percentage error of each model is denoted by $mpe$.}