e4d2c0d

committed
# Commits

# Comments (0)

# Files changed (1)

# ICDAR2013/ICDAR2013_paper/icdar.tex

-The previous global features characterizing the histograms cannot precisely represent the relationship between the ink layer, the degradation layer and the background layer. Therefore, we introduce two last global features extracted from the grayscale histogram to characterize the distance between the three layers : $ \mIInk $ and $ \mIBack $, where $ \mIInk $ corresponds to the distance between the average intensity of degradation pixels and the average intensity of ink pixels and, $ \mIBack $ is the distance between the average intensity of degradation pixels and the average intensity of background pixels.

+The previous global features characterizing the histograms cannot precisely represent the relationship between the ink layer, the degradation layer and the background layer. Therefore, we introduce two last global features extracted from the grayscale histogram to characterize the distance between the three layers : $ \mIInk $ and $ \mIBack $, where $ \mIInk $ corresponds to the distance between the average intensity of degradation pixels and the average intensity of ink pixels and, $ \mIBack $ is the distance between the average intensity of degradation pixels and the average intensity of background pixels. (Defined for a 8bit intensity range image).

+\caption{Example on an image from the DIBCO dataset : extraction of the degradation layer and features values.}

-\caption{Example on an image from the DIBCO dataset : extraction of the degradation layer and features values.}

The selected most significant measures are : $ \mIInk $, $ v_{i}$ , $ v_{b} $, $ \mu_{b} $, $\mu$ and $v$. This can be explained by the fact that Otsu's binarization method is based on a global grayscale histogram thresholding. That is why measures such as $\mIInk$, $\mu$ and $v$ are significant and have such low p-values. The estimated coefficients are presented in table \ref{otsuPredictionModel}. By repeating 100 times a random sub-sampling validation gives a mean slope coefficient of 0.989 and a mean $R^{2}$ of 0.987. This cross validation step estimates that the predictive model will perform in practice.

+\caption{Otsu prediction Model : all measures are significant ($p-value <0.1$), the model is also likely to predict correctly future unknown images given that the $R^{2}$ measures and adjusted $R^{2}$ measure are higher that 0.9. }

-\caption{Otsu prediction Model : all measures are significant ($p-value <0.1$), the model is also likely to predict correctly future unknown images given that the $R^{2}$ measures and adjusted $R^{2}$ measure are higher that 0.9. }

+\caption{Sauvola prediction Model : all measures are significant ($p-value <0.1$), the model is also likely to predict correctly future unknown images given that the $R^{2}$ equals 0.8 and adjusted $R^{2}$ equals 0.77. }

-\caption{Sauvola prediction Model : all measures are significant ($p-value <0.1$), the model is also likely to predict correctly future unknown images given that the $R^{2}$ equals 0.8 and adjusted $R^{2}$ equals 0.77. }

+\caption{Shijian prediction Model : the model is likely to predict correctly future unknown images given that the $R^{2}$ equals 0.86 and adjusted $R^{2}$ equals 0.82. }

-\caption{Shijian prediction Model : the model is likely to predict correctly future unknown images given that the $R^{2}$ equals 0.86 and adjusted $R^{2}$ equals 0.82. }

+\caption{Binarization of the DIBCO dataset. Comparison between the best theoretical f-score (computed from the ground truth), f-scores obtained using only Shijian's method and f-scores obtained from our automatic selection.}

-\caption{Binarization of the DIBCO dataset. Comparison between the best theoretical f-score (computed from the ground truth), f-scores obtained using only Shijian's method and f-scores obtained from our automatic selection.}

-This paper presented $18$ features that characterize the quality of a document image. These features are used a in step-wise multivariate linear regression to create prediction models for $11$ binarization methods. Repeated random sub-sampling cross-validation shows that $10$ of $11$ models are very accurate and can be used to automatically choose the best binarization method. Moreover, given the step-wise approach of the linear regression, these models are not over parameterized.

+This paper presented $18$ features that characterize the quality of a document image. These features are used a in step-wise multivariate linear regression to create prediction models for $11$ binarization methods. Repeated random sub-sampling cross-validation shows that $10$ of $11$ models are very accurate and can be used to automatically choose the best binarization method. Moreover, given the step-wise approach of the linear regression, these models are not over parameterized. One of our future research goals is to apply the same methodology to predict OCR error rates.

%However, OCRs today are very complex engines that are able to restore documents and perform layout analysis. Therefore, OCR failure cases are not only the result of a document's quality but also of its complexity (font, tables, figures, mathematical formulas). This complexity has to be evaluated with new OCR dedicated features. Our second research goal is to improve the binarization algorithm selection method. We believe that the method can be tuned by studying different strategies. One notion is to take into account $R^{2}$ and p-values measures in the automatic selection of a method. Another idea is to weight predicted fscores with computational costs: with similar accuracy, choosing the quickest one may be preferable.