4b0c01b
Merge
committed
Commits
Comments (0)
Files changed (7)

+1 0.hgtags

+152 124ICDAR2013/ICDAR2013_paper/icdar.tex

+18 0IJDAR/ijdar.bib

+1 1IJDAR/ijdar.tex

+4 3IJDAR/intro.tex

+6 7IJDAR/measures.tex

+26 19IJDAR/prediction.tex
ICDAR2013/ICDAR2013_paper/icdar.tex
This article proposes an approach to predict the result of binarization algorithms on a given document image according to its state of degradation. Indeed, historical documents suffer from different types of degradation which result in binarization errors. We intend to characterize the degradation of a document image by using different features based on the intensity, quantity and location of the degradation. These features allow us to build prediction models of binarization algorithms that are very accurate according to $R^2$ values and pvalues. The prediction models are used to select the best binarization algorithm for a given document image. Obviously, this imagebyimage strategy improves the binarization of the entire dataset.
+This article proposes an approach to predict the result of binarization algorithms on a given document image according to its state of degradation. Indeed, historical documents suffer from different types of degradation which result in binarization errors. We intend to characterize the degradation of a document image by using different features based on the intensity, quantity and location of the degradation. These features allow us to build prediction models of binarization algorithms that are very accurate according to \boldmath$R^2$ values and pvalues. The prediction models are used to select the best binarization algorithm for a given document image. Obviously, this imagebyimage strategy improves the binarization of the entire dataset.
This paper involves quality evaluation of ancient document images.% Document quality evaluation is needed at every stage of the digitization workflow, for instance in the scanning stage to ensure that the scanner's settings are optimal, in the processing stage to apply the best algorithms (for example, restoration, binarization, OCR) and in the visualization stage to provide the best image quality. We focus here on the processing stage.
 We propose a methodology based on algorithm prediction models to select the best algorithm for a specific task. Our approach is based on the following fact : the global quality of a document image directly affects the result of any processing algorithm (binarization, segmentation). We aim to predict the result of an algorithm according to the degradation type and quantity of the processed document. In this paper, we focus on binarization prediction.
+This paper involves quality evaluations of ancient document images.% Document quality evaluation is needed at every stage of the digitization workflow, for instance in the scanning stage to ensure that the scanner's settings are optimal, in the processing stage to apply the best algorithms (for example, restoration, binarization, OCR) and in the visualization stage to provide the best image quality. We focus here on the processing stage.
+We propose a methodology based on algorithm prediction models for selecting the best algorithm for a specific task. Our approach is based on the following fact : the global quality of a document image directly impacts the result of any processing algorithm (binarization, segmentation,...). We thus propose to predict the result of an algorithm according to the type and quantity of the degradation of the processed document. We focus now on binarization prediction.
For a given binarization algorithm and a set of groundtruthed binarized images, a prediction function is built by searching a significant correlation between algorithm performances and the quality of the images. The document image quality is characterized by new, dedicated features. This prediction function can then forecast the binarization algorithm result for any new image on which quality features have been previously computed.
+For a given binarization algorithm and a set of groundtruthed binarized images, a prediction function is built by searching a significant correlation between the algorithm performances and the quality of the images. The document image quality is measured with new dedicated features. The prediction function can then be used to predict the binarization algorithm result for any new image on which quality features have been previously computed.
 To our knowledge there are no studies on binarization prediction. The existing work on algorithm prediction in the field of document image analysis entails OCRs which typically use the quality features in order to create prediction models.
The first features related to character quality were introduced in \cite{blando1995prediction}. In this article the authors evaluate the quality of binary text documents by analysing black and white connected components. The OCR result is predicted by thresholding the quality measures (proportion of thick and broken characters). Each document image is finally labbeled as good or poor.
+ To our knowledge there are no work on binarization prediction. The existing work on algorithm prediction in the field of document image analysis entails OCRs which typically use the quality features in order to create prediction models.
+The first quality metrics were introduced in \cite{blando1995prediction}. In this article the authors evaluate the quality of binary text documents by analysing black and white connected components. The OCR result is predicted by thresholding the quality measures (proportion of thick and broken characters). Each document image is finally labbeled as good or poor.
In \cite{cannon1997automated} two new measures are introduced to account for speckles and touching characters. A linear regression is used to predict the OCR performance on handwritten black and white documents. The authors of \cite{gonzalez1998prediction} complete the set of measures with new ones, which are used as inputs to a neural network to classify the images in two classes (poor or good). By reusing a script identification engine, the method proposed in \cite{ablavsky2003} can select the better of two OCRs according to a classification of the text image as {\em broken}, {\em clean} or {\em merged}.
Other works propose strategies to select the best restoration algorithm. As in OCR prediction methods, dedicated defect features are computed on a binary image. These values are then used as inputs for different types of semisupervised classification algorithms. The authors of \cite{souza2003automatic} use the features of \cite{gonzalez1998prediction} with three new ones from \cite{cannon1999quality} to select a restoration algorithm using decision rules with thresholds. %The restoration algorithm selection is based on decision rules using thresholds. Another automatic restoration method selection is presented in \cite{cannon1999quality}. In this latter article, the restoration algorithm selection is based on a linear classifier.
+Other works propose strategies to select the best restoration algorithm. As in OCR prediction methods, dedicated defect features are computed on a binary image. These values are then used as inputs for different types of semisupervised classification algorithms. The authors of \cite{souza2003automatic} use the features of \cite{gonzalez1998prediction} with three new ones from \cite{cannon1999quality} to select a restoration algorithm using a linear classifier. %The restoration algorithm selection is based on decision rules using thresholds. Another automatic restoration method selection is presented in \cite{cannon1999quality}. In this latter article, the restoration algorithm selection is based on a linear classifier.
Previous methods suffer from two main drawbacks. First, most of them require a connected component extraction and, therefore, a binarization step. These methods strongly depend on the accuracy of this preprocessing step. We believe that a better approach consist of directly analyzing the defect pixels in the initial grayscale image. Second, none of the presented articles dealing with prediction models analyze the significance of each feature. %Only the authors of \cite{reed2008correlating} propose an interesting correlation analysis between several quality metrics and the parameters of the degradation model used to produce the test images. This preliminary study shows that, even if most features are highly correlated with the defects, some are not. However, this study does not address the essential issue of the selection of relevant quality features to avoid overparameterization of an algorithm prediction.
+Previous methods suffer from three main drawbacks. First, most of them require a connected component extraction and, therefore, a binarization step. These methods strongly depend on the accuracy of this preprocessing step. We believe that a better approach consist of directly analyzing the defect pixels in the initial grayscale image. Second, none of the presented articles dealing with prediction models analyze the significance of each feature. %Only the authors of \cite{reed2008correlating} propose an interesting correlation analysis between several quality metrics and the parameters of the degradation model used to produce the test images. This preliminary study shows that, even if most features are highly correlated with the defects, some are not. However, this study does not address the essential issue of the selection of relevant quality features to avoid overparameterization of an algorithm prediction.
In the following sections, we introduce different features that characterize ancient documents degradation. These features rely on a decomposition of the document gray levels in three different classes : ink pixels, degradation pixels and background pixels. We characterize the degradation layer by analyzing the distribution of its intensities, its quantity and its location within the image. The proposed features, dedicated to binarization evaluation, are presented in section \ref{measures}. Section \ref{prediction} details the methodology used for creating algorithms prediction models. Prediction models of several binarization methods are then presented, all of which present very high accuracy. Finally, section \ref{sectionoptimalselection} explains how to use the prediction models to select the best binarization algorithm for a specific image.
+In the following sections, we introduce different features characterizing ancient documents degradations. These features rely on a document gray levels decomposition in three different classes : ink pixels, degradation pixels and background pixels. We characterize the degradation layer by analyzing the distribution of its intensities, its quantity and its location within the image. The proposed features, dedicated to binarization evaluation, are presented in section \ref{measures}. Section \ref{prediction} details the methodology used for creating algorithms prediction models. Prediction models of several binarization methods are then presented, all of which present very high accuracy. Finally, section \ref{sectionoptimalselection} explains how to use the prediction models to select the best binarization algorithm for a specific image.
This section details new features used to characterize document image degradation. A first set of global features is extracted directly from grayscale histograms without spatial consideration. A second set of features characterizes the localization of the degradation.
We assume that an ancient document can be modeled as the combination of three different layers: the text pixel layer, the background pixel layer and the degradation pixel layer. Most of the degradation (for example, bleedthrough, spots, speckles, nonuniform illumination, ink loss) appears as connected components with grayscale values that differ from background and ink pixels. We distinguish the three different layers of pixels according to the pixels' gray level. Let us denote the gray level of pixel $p$ by $g(p)$. Let $\inkp$ be the set of ink pixels, $\degp$ be the set of degradation pixels and $\backp$ be the set of background pixels defined as follows:
+In this article, we do not measure each type of degradation separately. Indeed, we assume that an historical document can be modeled as the diffusion of several information layers. Most degradations altering a binarization result are in a layer composed of gray pixels different from ink and background pixels (bleedthrough, spots, speckles, nonuniform illumination, ink loss, ...). That's why we globally measure and characterize document degradations by distinguishing three different layers of pixels :
+ Setting the two thresholds $ s_{0} $ and $ s_{1} $ can be determined using any classification algorithm. Our experiments used a 3means clustering algorithm. We do not aim to enhance the document with a pixel close identification. Therefore, the layers dissociation does not need to be highly accurate. Table \ref{measuresExamplesOnRealImages} shows an example of a 3means clustering algorithm applied on a image with huge defects.
+Once the three different layers are dissociated, we extract the three corresponding sets of pixels : $\inkp = \{ p, g(p) \leq s_{0} \}$; $\degp = \{ p, s_{0} < g(p) < s_{1} \}$;$ \backp = \{ p, g(p) \leq s_{1} \}$.
+From these sets of pixels, we define three sets of values corresponding to the pixels intensities :
+$\inkInt = \{ g(p), p \in \inkp \}$; $ \grayInt = \{ g(p), p \in \degp \}$; $ \backInt = \{ g(p), p \in \backp \}$.
+Let $S$ be a set of pixels. We denote the set of the 4connected components of $S$ by $CC(S)$. In the rest of the section, we use the following notations : $\inkc = CC(\inkp)$, $\dc = CC(\degp)$ and $\backc = CC(\backp)$.
Setting the two thresholds $ s_{0} $ and $ s_{1} $ can be determined using any classification algorithm. Our experiments used a 3means clustering algorithm. Table \ref{measuresExamplesOnRealImages} shows that most degradation present in a document image can be extracted using these two thresholds.% Obviously, it is not possible to perfectly classify the image pixels using only the graylevel histogram.
+The following sections detail our proposition for characterizing the degradation layer previously extracted. A first group of features is extracted directly from the grayscale histogram. A second family of features is dedicated to the characterization of the local degradation surrounding ink components.
We compute the following global statistic features of the grayscale histogram: mean, variance and skewness. We denote the mean of the global histogram by $\mu$, its variance by $v$, and its skewness by $s$. The mean, variance and skewness are also computed on the three \emph{subhistograms} to characterize each layer distribution (ink, background and degradation):
The previous global features characterizing the histograms cannot precisely represent the relationship between the ink layer, the degradation layer and the background layer. Therefore, we introduce two last global features extracted from the grayscale histogram to characterize the distance between the three layers : $ \mIInk $ and $ \mIBack $, where $ \mIInk $ corresponds to the distance between the average intensity of degradation pixels and the average intensity of ink pixels and, $ \mIBack $ is the distance between the average intensity of degradation pixels and the average intensity of background pixels.
+The global grayscale histogram contains information characterizing the document's quality. We aim to compute the following global statistic features of the grayscale histogram: mean, variance and skewness. We denote the mean of the global histogram by $\mu$, its variance by $v$, and its skewness by $s$. The mean, variance and skewness are also computed on the three \emph{subhistograms} to characterize each layer distribution (ink, background and degradation).
+This step provides 12 features : $\mu$, $v$, $s$, $\mu_{\inkInt}$, $v_{\inkInt}$, $s_{\inkInt}$, $\mu_{\grayInt}$, $v_{\grayInt}$, $s_{\grayInt}$, $\mu_{\backInt}$, $v_{\backInt}$, $s_{\backInt}$.
+The previous global measures cannot precisely represent the relation between the ink layer, the degradation layer and the background layer. Therefore, we define two feature $ \mIInk $ and $ \mIBack $, where $ \mIInk $ corresponds to the distance between the average intensity of degradation pixels and the average intensity of ink pixels, and, $ \mIBack $ is the distance between the average intensity of degradation pixels and the average intensity of background pixels.
+%\caption{Two different images illustrating $\mIInk$ and $\mIBack$. a. contains one dark spot, the distance to the ink is low $\mIInk = (80  0) / 255 = 0.3 $ the distance to the background is high $ \mIBack = (25580)/255 = 0.7$
+%b. contains the same spot but with a lighter gray value, the distance to the ink is high $\mIInk = (180  0) / 255 = 0.7 $ the distance to the background lowers $ \mIBack = (255180)/255 = 0.3$.
The grayvalues of the three layers are not the only characteristics that could affect a binarization algorithm. The amount of degradation pixels is also directly correlated with the binarization performance.
We measure this performance as the relative quantity of ink and degradation pixels. We define $ \mQ $ as the following ratio : $\mQ = \frac{\card{\degp}}{\card{ \inkp }}$.
+We aim to measure this performance as the relative quantity of ink and degradation pixels. We define $ \mQ $ as the following ratio : $ \mQ = \frac{\card{\degp}}{\card{ \inkp }}$
+%Figure \ref{MQ} illustrates the value range of $ \mQ $ regarding to the ratio between the quantity of ink pixels and degraded pixels.
+%\caption{$\mQ$ example on two images : a. does not contain a lot of noise, $\mQ$ is low : $\mQ = 1 / 9 = 0.1 $, b. has much more degradation pixels $\mQ$ is higher : $\mQ = (1+3+14)/9=2 $
As a good binarization should preserve the shape of the objects and avoid the creation of unwanted black or white components, the location of the degradation pixels is a significant characteristic that can influence the binarization result. Figure \ref{locations} illustrates the main situations observed in real documents in which the degradation pixels spatially interfere with ink pixels.
Let $S$ be a set of pixels. We denote the set of the 4connected components of $S$ by $CC(S)$. In the rest of the section, we use the following notations : $\inkc = CC(\inkp)$, $\dc = CC(\degp)$ and $\backc = CC(\backp)$.
+The location of the degradation pixels is also a significant characteristic that needs to be considered and measured. Figure \ref{locations} illustrates the three main situations seen in real documents were the degradation pixels spatially interfere with ink pixels.
Let $c_{\inkp} \in \inkc$ be an ink component and $c_{\degp} \in \dc$ be a degradation component. We denote the predicate returning true if $ c_{\inkp} $ and $ c_{\degp} $ are connected by $SG(c_{\inkp}, c_{\degp})$:
$$ SG (c_{\inkp}, c_{\degp}) = \exists (p_{\inkp}, p_{\degp}) \in c_{\inkp} \times c_{\degp} \mid p_{\inkp} \mbox{~and~} p_{\degp} \mbox{~are 4connected}$$
\item If $c_{\inkp}$ and $c_{\degp}$ are not connected (figure \ref{locations}.a), the original character will not be altered by the binarization process. If this configuration occurs numerous times, the binarization can lead to a document image highly degraded by many small black spots between characters. Let $\cma$ be the set of degradation components that are not connected to any ink component~:
+\textbf{Case 1 :} If $c_{\inkp}$ and $c_{\degp}$ are not connected (figure \ref{locations}.a), the original character will not be altered by the binarization process. If this configuration occurs numerous times, the binarization can lead to a document image highly degraded by many small black spots between characters. Let $\cma$ be the set of degradation components that are not connected to any ink component~:
+ \cma = \{c_{\degp} \in \dc \mid \forall p_{\degp} \in c_{\degp}, \nexists c_{\inkp} \in \inkc, p_{\degp} \in \n(c_{\inkp}) \}
+\textbf{Case 2 :} %if $ c_{0} $ and $ c_{1} $ are connected (figure \ref{locations}.b), the original letter will be altered. %The quantity of touching connected components is measured by $ \mS $.
+If $c_{\inkp}$ and $c_{\degp}$ are connected (Figure \ref{locations}.b), the original character may be altered by the binarization: degraded pixels may be misclassified as ink pixels. Let $\cms$ be the set of all ink components that are connected to at least one degradation component:
+\cms = \{ c_{\inkp} \in \inkc \mid \exists p_{\inkp} \in c_{\inkp} \mbox{~and~} c_{\degp} \in \dc, p_{\inkp} \in \n(c_{\degp}) \}
%The range of this feature depends on the image size, but it can still be used to create a prediction model.% This is discussed in section \ref{prediction}.
+ The feature $ \mS $ is defined as the ratio between the number of ink components that may be expended by at least one degradation component and the total number of ink components : $ \mS = \displaystyle\frac{ \card{ \cms } }{ \card{ \inkc}} $
\item If $c_{\inkp}$ and $c_{\degp}$ are connected (Figure \ref{locations}.b), the original character may be altered by the binarization: degraded pixels may be misclassified as ink pixels. Let $\cms$ be the set of all ink components that are connected to at least one degradation component:
The feature $ \mS $ is defined as the ratio between the number of ink components that may be expended by at least one degradation component and the total number of ink components:
\item $ \mSG $ measures the possible extent of ink component deformation using the number of known ink components that may be modified by the binarization process. It is defined as the mean area of the pairs of components that satisfy $ SG $ over the mean area of all ink components:
+\textbf{Case 3 :} $ \mSG $ measures the possible extent of ink component deformation using the number of known ink components that may be modified by the binarization process. It is defined as the mean area of the pairs of components that satisfy $ SG $ over the mean area of all ink components. Let $ c_{\inkp} $ be an ink connected component and $ c_{\degp} $ a degradation connected component. We denote by $SG$ the predicate returning true if $ c_{\inkp} $ and $ c_{\degp} $ are touching :
+$\mSG$ can now be defined as the mean area of all connected components that satisfies $ SG $ over the mean area of all ink components :
\mSG = \frac{\displaystyle {Average}_{\{ (c_{\inkp}, c_{\degp})\mid SG (c_{\inkp}, c_{\degp})\}} (\card{c_{\inkp}} + \card{c_{\degp}})}{\displaystyle {Average}_{c_{\inkp} \in \inkc} (\card{c_{\inkp}}) } $$
+\mSG = \frac{\displaystyle {Average}_{\{ (c_{\inkp}, c_{\degp}), SG (c_{\inkp}, c_{\degp})\}} (\card{c_{\inkp}} + \card{c_{\degp}})}{\displaystyle {Average}_{c_{\inkp} \in \inkc} (\card{c_{\inkp}}) } $$
The higher $ \mSG $ is, the more likely it is that the document has large spots around ink components. Combined with other features (for example, $ \mIInk $), $ \mSG $ helps predict whether the spots lead to binarization errors.
Given all of the previously defined features, each document image is characterized by a vector of dimension $18$. An example is given in Table~\ref{measuresExamplesOnRealImages} which shows the degradation extraction and the values of the proposed features on one document image. The analysis of these values indicates that it may be preferable to use Sauvola's method to binarize this image. Indeed, the values of $\mIInk$ and $\mIBack$ are low meaning that a global thresholding method like Otsu's is likely to fail to correctly classify the pixels. The value of $\mSG$ is also high : there are large spots around the characters. Windowbased method have, most of the time, better results on this kind of documents. This hypothesis is confirmed with the fscore of Otsu's and Sauvola's methods. On this image, Ostu makes a score of 0.4 and Sauvola of 0.7.
%% The location of the degradation pixels is also a significant characteristic that needs to be considered and measured. Figure \ref{locations} illustrates the three main situations seen in real documents were the degradation pixels spatially interfere with ink pixels.
%% \caption{The different locations of a degradation component on the page: a. the degradation component is not connected to an ink component, b. a small degradation component is adjacent to an ink component, c. a large degradation component is adjacent to an ink component.}
%% \textbf{Case 1 :} If $c_{\inkp}$ and $c_{\degp}$ are not connected (figure \ref{locations}.a), the original character will not be altered by the binarization process. If this configuration occurs numerous times, the binarization can lead to a document image highly degraded by many small black spots between characters. Let $\cma$ be the set of degradation components that are not connected to any ink component~:
%% \cma = \{c_{\degp} \in \dc \mid \forall p_{\degp} \in c_{\degp}, \nexists c_{\inkp} \in \inkc, p_{\degp} \in \n(c_{\inkp}) \}
%% \textbf{Case 2 :} %if $ c_{0} $ and $ c_{1} $ are connected (figure \ref{locations}.b), the original letter will be altered. %The quantity of touching connected components is measured by $ \mS $.
%% If $c_{\inkp}$ and $c_{\degp}$ are connected (Figure \ref{locations}.b), the original character may be altered by the binarization: degraded pixels may be misclassified as ink pixels. Let $\cms$ be the set of all ink components that are connected to at least one degradation component:
%% \cms = \{ c_{\inkp} \in \inkc \mid \exists p_{\inkp} \in c_{\inkp} \mbox{~and~} c_{\degp} \in \dc, p_{\inkp} \in \n(c_{\degp}) \}
%% The feature $ \mS $ is defined as the ratio between the number of ink components that may be expended by at least one degradation component and the total number of ink components : $ \mS = \displaystyle\frac{ \card{ \cms } }{ \card{ \inkc}} $
%% \textbf{Case 3 :} $ \mSG $ measures the possible extent of ink component deformation using the number of known ink components that may be modified by the binarization process. It is defined as the mean area of the pairs of components that satisfy $ SG $ over the mean area of all ink components. Let $ c_{\inkp} $ be an ink connected component and $ c_{\degp} $ a degradation connected component. We denote by $SG$ the predicate returning true if $ c_{\inkp} $ and $ c_{\degp} $ are touching :
%% $\mSG$ can now be defined as the mean area of all connected components that satisfies $ SG $ over the mean area of all ink components :
%% \mSG = \frac{\displaystyle {Average}_{\{ (c_{\inkp}, c_{\degp}), SG (c_{\inkp}, c_{\degp})\}} (\card{c_{\inkp}} + \card{c_{\degp}})}{\displaystyle {Average}_{c_{\inkp} \in \inkc} (\card{c_{\inkp}}) } $$
%% The higher $ \mSG $ is, the more likely it is that the document has large spots around ink components. Combined with other features (for example, $ \mIInk $), $ \mSG $ helps predict whether the spots lead to binarization errors.
+%Table \ref{locationExemples} shows the values of the three spatial derfomation features on the examples of figure \ref{locations}. With all the previously defined features, each document image is characterized by vector of dimension $18$.
+%\caption{Example of the location measures on previous examples images. $ \mA $ is none 0 in only the first case. $ \mSG $ is higher the less the two connected components have pixels in common.}
0.2 & 0.1 & 0.3 & 0.05 & 0.2 & 3,6 & 0.4 & 0.05 & 0.5 & 741 & 392 & 161 & 66 & 135 & 199 & 1.25 & 2065 & 171 \\
+% \multicolumn{5}{c}{Image} & \multicolumn{6}{c}{GrayScale Histogram} & \multicolumn{7}{c}{3mean clusters} \\
+% $\mIInk $ & $\mIBack$ & $\mQ$ & $ \mA $ & $ \mS $ & $ \mSG $ & $ s_{i} $ & $ s_{g} $ & $ s_{b} $ & $ v_{i} $ & $ v_{g} $ & $ v_{b}$ & $ \mu_{i} $ & $ \mu_{g} $ & $ \mu_{b} $ & s & v & $ \mu $ \\
+% 0.13 & 0.2 & 0.03 & 0.3 & 0.2 & 1.4 & 0.6 & 0.02 & 0.5 & 257 & 206 & 30 & 98 & 146 & 189 & 3 & 356 & 185 \\
\caption{Example onanimage from the DIBCO dataset: extraction of the degradation layer and features values.}
+The table \ref{measuresExamplesOnRealImages} shows the degradation extraction and the values of the presented features on one document image. The manual analysis of these values indicates that it may be preferable to use Sauvola's for this image. Indeed, the values of $\mIInk$ and $\mIBack$ are low meaning that a global thresholding method like Otsu's is likely to fail to correctly classify the pixels. The value of $\mSG$ is also high : there are large spots around the characters. Windowbased method have, most of the time, better results on this kind of documents. This hypothesis is confirmed with the fscore of Otsu's and Sauvola's methods. On this image, Ostu makes a score of 0.4 and Sauvola of 0.7.
%Some binarization methods rely on parameters. In this article, we do not focus on parameter optimization. Therefore, we chose to use the parameters given by the authors of each method in their corresponding original articles. Importantly, note that the prediction models created are only able to predict the performance of a binarization method with a specific set of parameters. However, a binarization method can have several prediction models, one for each set of parameters.
For this specific use case we follow a methodology based on a stepwise linear regression. This methodology can be applied to all types of binarization methods and therefore is first detailed in a general context. The following subsection (\ref{methodology}) presents this methodology and the dataset we used to train and validate our prediction models. In subsection \ref{otsusauvola}, we analyze the accuracy of 3 binarization prediction models that are highly used in document images : Otsu, Sauvola and Shijian. In this article, the accuracy of the last 10 methods is not presented. However, all prediction models are used to create a process that selects the binarization algorithm that is the most suited for a given image. The accuracy of this process (and therefore the accuracy of the prediction models) is analyzed in subsection \ref{sectionoptimalselection}.
+For this specific use case we follow a methodology based on a stepwise linear regression. This methodology can be applied to all types of binarization methods and therefore is first detailed in a general context. The following subsection (\ref{methodology}) presents this methodology and the dataset we used to train and validate our prediction models. In subsection \ref{otsusauvola}, we analyze the accuracy of 3 binarization prediction models that are highly used in document images : Otsu, Sauvola and Shijian. In this article, the accuracy of the last 10 methods is not presented. However, all prediction models are used to create a process that selects the binarization algorithm that is the most suited for an image. The accuracy of this process (and therefore the accuracy of the prediction models) is analyzed in subsection \ref{sectionoptimalselection}.
To create the prediction model, we use a multivariate step wise linear regression \cite{thompson1978selectionp2}, followed by a repeated random subsampling validation (cross validation). This over all process can be divided in several steps :
 \item Features and fscores computation: The 18 proposed features are computed for each image. We also run the binarization algorithm on the overall dataset and measure its accuracy relative to the ground truth. In the following section, these fscores are called ground truth fscores.
 \item Generation of the predictive model : This step consists of applying a step wise multivariate linear regression to the overall dataset, allowing us to select the most signicant features for predicting the given binarization algorithm. The output of this step is a linear function that gives a predicted fscore value for any image, for one binarization algorithm, knowing the selected features.
 \item Evaluation of model accuracy: The $R^2$ value indicates the proportion of variability in a data set that is accounted for by the statistical model and provides a measure of how well the model predicts future outcomes. The best theoretical value for $R^2$ is 1. Moreover, a pvalue is computed for each selected feature indicating its signicance. We choose to keep the model only if $R^2 > 0.7$ and if a majority of pvalues are lower than $0.1$.
+ \item Features and Fscores computation: The 18 proposed features are computed for each image. We also run the binarization algorithm on the overall dataset and measure its accuracy relative to the ground truth. In the following section, these fscores are called ground truth fscores.
+ \item Generation of the predictive model : This step consists of applying a step wise multivariate linear regression to the overall dataset, allowing us to select the most significant features for predicting the given binarization algorithm. The output of this step is a linear function that gives a predicted fscore value for any image, for one binarization algorithm, knowing the selected features.
+ \item Evaluation of model accuracy: The $R^2$ value indicates the proportion of variability in a data set that is accounted for by the statistical model and provides a measure of how well the model predicts future outcomes. The best theoretical value for $R^2$ is 1. Moreover, a pvalue is computed for each selected feature indicating its significance. We choose to keep the model only if $R^2 > 0.7$ and if a majority of pvalues are lower than $0.1$.
\item Model validation using CrossValidation : the training of a prediction model and its accuracy measurement is done a several times (in our experiments : 100 times) on different subsets of a dataset :
\item The overall set of images is randomly split (90\% is used as training set and 10\% as validation set).
 \item On the validation set, the accuracy of the model is measured by two values. The $R^{2}$, which is the proportion of variability in a data set that is accounted for by the statistical model and provides a measure of how well future outcomes are likely to be predicted by the model. We also look at the slope coefficient of the validation regression, which also needs to be the closest to 1.
+ \item On the validation set, the accuracy of the model is measured by two values. The $R^{2}$ and the slope coefficient of the validation regression, which also needs to be the closest to 1.
In this experiment we use a merge of the well known datasets DIBCO\footnote{http://users.iit.demokritos.gr/~bgat/DIBCO2009/} and HDIBCO\footnote{http://users.iit.demokritos.gr/~bgat/HDIBCO2010/}.
+In this experiment we use a merge of the well known datasets named DIBCO\footnote{http://users.iit.demokritos.gr/~bgat/DIBCO2009/} and HDIBCO\footnote{http://users.iit.demokritos.gr/~bgat/HDIBCO2010/}.
 The selected most significant measures are : $ \mIInk $, $ v_{I}$ , $ v_{B} $, $ \mu_{B} $, $\mu$ and $v$. This can be explained by the fact that Otsu's binarization method is based on a global grayscale histogram thresholding. That is why measures such as $\mIInk$, $\mu$ and $v$ are significant and have such low pvalues. The estimated coefficients are presented in table \ref{otsuPredictionModel}. By repeating 100 times a random subsampling validation gives a mean slope coefficient of 0.989 and a mean $R^{2}$ of 0.987. This cross validation step estimates that the predictive model will perform in practice.
+ The selected most significant measures are : $ \mIInk $, $ v_{i}$ , $ v_{b} $, $ \mu_{b} $, $\mu$ and $v$. This can be explained by the fact that Otsu's binarization method is based on a global grayscale histogram thresholding. That is why measures such as $\mIInk$, $\mu$ and $v$ are significant and have such low pvalues. The estimated coefficients are presented in table \ref{otsuPredictionModel}. By repeating 100 times a random subsampling validation gives a mean slope coefficient of 0.989 and a mean $R^{2}$ of 0.987. This cross validation step estimates that the predictive model will perform in practice.
The selected measures are : $\mIBack$, $\mQ$, $\mA$, $\mu$, $s$, $s_{I}$, $v_{I}$ . The estimated coefficients are presented in table \ref{sauvolaPredictionModel}. It is no surprise that $\mA$ is selected for this binarization method. Indeed window based methods are sensitive to small noise components without any real ink information. The cross validation by repeating 100 times a random subsampling validation gives a mean slope coefficient of 1.0007 and a mean $R^{2}$ of 0.99. These results allows us to conclude that, as with Otsu's prediction model, this model is accurate and can be used in practice.
+The selected measures are : $\mIBack$, $\mQ$, $\mA$, $\mu$, $s$, $s_{i}$, $v_{i}$ . The estimated coefficients are presented in table \ref{sauvolaPredictionModel}. It is no surprise that $\mA$ is selected for this binarization method. Indeed window based methods are sensitive to small noise components without any real ink information. The cross validation by repeating 100 times a random subsampling validation gives a mean slope coefficient of 1.0007 and a mean $R^{2}$ of 0.99. These results allows us to conclude that, as with Otsu's prediction model, this model is accurate and can be used in practice.
 Given a document image, a binarization method and its prediction model, we can compute all of the features required by the model and use them as inputs. The result is the predicted accuracy of this specific binarization method for this specific image. Table \ref{selectionRes} presents some fscore statistics obtained from binarizing the DIBCO dataset. The first line corresponds to the best theoretical fscores (having the ground truth, we know for each image the binarization method that will provide the best fscore). The second line corresponds to the fscores obtained using only Shijian's method. The last line corresponds to the fscores obtained using our automatic binarization selection. We analyse the accuracy of our binarization method selection algorithms in several ways. First, the method has a slightly better (2\%) mean accuracy than using only Shijian's method. Importantly, note that our algorithm has a higher global accuracy (the standard deviation equals $0.04$). Second, the worst binarization result of our method is much higher than Shijian's (56\%). Last, we compared our method with the optimal selection that we can compute from the ground truth. The results are very similar, indicating that the prediction models are accurate enough to select the best binarization method for each image (70\% perfect match). The mean error of our method is $0.009$ (standard deviation equals $0.02$), and, the worst error equals $0.06$.
+ Given a document image, a binarization method and its prediction model, we can compute all of the features required by the model and use them as inputs. The result is the predicted accuracy of this specific binarization method for this specific image. Table \ref{selectionRes} presents some fscore statistics obtained from binarizing the DIBCO dataset. The first line corresponds to the best theoretical fscores (having the ground truth, we know for each image the binarization method that will provide the best fscore). The second line corresponds to the fscores obtained using only Shijian's method. The last line corresponds to the fscores obtained using our automatic binarization selection. We analyse the accuracy of our binarization method selection algorithms in several ways. First, the method has a slightly better (2\%) mean accuracy than using only Shijian's method. Importantly, note that our algorithm has a higher global accuracy (the standard deviation equals $0.04$). Last, the worst binarization result of our method is much higher than Shijian's (56\%).
+Second, we compared our method with the optimal selection that we can compute from the ground truth. The results are very similar, indicating that the prediction models are accurate enough to select the best binarization method for each image (70\% perfect match). The mean error of our method is $0.009$ (standard deviation equals $0.02$), and, the worst error equals $0.06$.
IJDAR/ijdar.bib
+ title = {{A Biometrics Invited Paper. The Analysis and Selection of Variables in Linear Regression}},
IJDAR/intro.tex
This paper involves quality evaluations of ancient document images. Document quality evaluation is needed at every stage of the digitization workflow, for instance in the scanning stage to ensure that the scanner's settings are optimal, in the processing stage to apply the best algorithms (for example, restoration, binarization, OCR) and in the visualization stage to provide the best image quality.
+This paper involves quality evaluations of document images. Document quality evaluation is needed at every stage of the digitization workflow, for instance in the scanning stage to ensure that the scanner's settings are optimal, in the processing stage to apply the best algorithms (for example, restoration, binarization, OCR) and in the visualization stage to provide the best image quality.
To improve the results of the processing stage, it is necessary to take into account specific image defects.
Ancient document images suffer from several types of degradation. According to \cite{baird2007state}, degradation can have different origins :
+Document images may suffer from several types of degradation. According to \cite{baird2007state}, degradation can have different origins :
+As ancient documents often present significant degradation, we focus this paper on their quality evaluation. However, the global methodology is suited for any damaged document images.
The second part of our work involves predicting the result of a binarization algorithm. To our knowledge, there are no studies on binarization prediction. The existing work on algorithm prediction in the field of document image analysis entails OCR prediction methods, which typically use the quality features of characters to create prediction models.
+The second part of our work involves predicting the result of a binarization algorithm. To our knowledge, there are no studies on binarization prediction. The existing work on algorithm prediction for document image analysis is only found in the OCR field, which typically use the quality features of characters to create prediction models.
The first features related to character quality were introduced in \cite{blando1995prediction}. In this article the authors evaluate the quality of binary text documents by analyzing black and white connected components. The OCR result is predicted by simply thresholding the quality ratios (proportion of thick and broken characters). Each document image is finally labeled as good or poor.
% This technique as several disadvantages : first, we believe that a predicted classification in two classes (good or poor) for the OCR is not enough; second, the thresholds have to be changed and manually set for each new data set.
IJDAR/measures.tex
We aim to compute the following global statistic features of the grayscale histogram: mean, variance and skewness.
+We aim to compute the following global statistic features of the grayscale histogram: mean, variance and skewness. The skewness quantifies the asymmetry of the histogram. For example, a negative skewness indicates that the distribution of pixels graylevels has relatively few low values.
We denote the mean of the global histogram by $\mu$, its variance by $v$, and its skewness by $s$. A good value for the skewness is a high negative value : the left tail of the histogram is longer, the intensities are concentrated on the right and the histogram has relatively few gray values. In that case, the image is likely easily binarized (see the images in Figure \ref{histogramExample}.a and Table \ref{measuresExamplesOnRealImages} line 2)
The mean, variance and skewness are also computed on the three \emph{subhistograms} to characterize each layer distribution (ink, background and degradation).
Table \ref{locationExemples} shows the values of the three spatial deformation features on the examples in Figure \ref{locations}.
+\caption{Spatial deformation features measured on the examples in Figure \ref{locations}. $ \mA $ is equal to $1$ in Figure \ref{locations}.a and figure \ref{locations}.b because one degradation component is connected to one ink component.
+$ \mSG $ equals to $0$ in Figure \ref{locations}.a because no component are connected. $ \mSG $ has a lower value in Figure \ref{locations}.b than in Figure \ref{locations}.c because the union area size between the ink component and the degradation component is smaller.
\caption{Spatial deformation features measured on the examples in Figure \ref{locations}. $ \mA $ is equal to $1$ in Figure \ref{locations}.a and figure \ref{locations}.b because one degradation component is connected to one ink component.
$ \mSG $ equals to $0$ in Figure \ref{locations}.a because no component are connected. $ \mSG $ has a lower value in Figure \ref{locations}.b than in Figure \ref{locations}.c because the union area size between the ink component and the degradation component is smaller.
+\caption{Two document image examples from the DIBCO dataset and their feature vectors. The proposed features capture different degradation types (for example, ink spots, faded ink, background speckles) }
\multicolumn{5}{c}{Image} & \multicolumn{6}{c}{GrayScale Histogram} & \multicolumn{7}{c}{3mean clusters} \\
0.2 & 0.1 & 0.3 & 0.05 & 0.2 & 3,6 & 0.4 & 0.05 & 0.5 & 741 & 392 & 161 & 66 & 135 & 199 & 1.25 & 2065 & 171 \\
\multicolumn{5}{c}{Image} & \multicolumn{6}{c}{GrayScale Histogram} & \multicolumn{7}{c}{3mean clusters} \\
\caption{Two document image examples from the DIBCO dataset and their feature vectors. The proposed features capture different degradation types (for example, ink spots, faded ink, background speckles) }
IJDAR/prediction.tex
\item Sauvola \cite{sauvola2000adaptive} is a locally adaptive thresholding method using pixel intensity variance.
 \item Shijian \cite{su2011combination} is a recent method based on an \textit{adhoc} combination of existing techniques. \cite{su2011combination} has proved to have very good accuracy on the ICDAR 2011 Binarization Contest.
+ \item Shijian \cite{su2011combination} is a recent method based on an \textit{adhoc} combination of existing techniques. \cite{su2011combination} has proven to have very good accuracy on the ICDAR 2011 Binarization Contest.
Some binarization methods rely on parameters. In this article, we do not focus on parameter optimization. Therefore, we chose to use the parameters given by the authors of each method in their corresponding original articles. Table \ref{parameters} summarizes the values of these parameters. Importantly, note that the prediction models created are only able to predict the performance of a binarization method with a specific set of parameters. However, a binarization method can have several prediction models, one for each set of parameters. To illustrate the difference between two sets of parameters, we will create two different prediction models for Sauvola's method. The second set of parameters was manually chosen (Table \ref{parameters}).
+\caption{Methods parameters: we chose to use the parameters given by each author in their original articles.}
\caption{Methods parameters: we chose to use the parameters given by each author in their original articles.}
+\caption{Statistical results of $11$ binarization algorithms applied to all DIBCO images. Except for the Sahoo algorithm, all binarization methods have a significant min/max fscore gap and standard deviation between $0.1$ and $0.3$, indicating that the dataset is heterogeneous and well suited for the learning step of our prediction model.}
\caption{Statistical results of $11$ binarization algorithms applied to all DIBCO images. Except for the Sahoo algorithm, all binarization methods have a significant min/max fscore gap and standard deviation between $0.1$ and $0.3$, indicating that the dataset is heterogeneous and well suited for the learning step of our prediction model.}
\subsection {Using step wise multivariate linear regression to predict the result of the binarization algorithm}
To estimate the fscore of a binarization algorithm, we automatically build a prediction model based on the most significant features among the $18$. More precisely, the prediction model is computed using multivariate step wise linear regression \cite{thompson1978selectionp1,thompson1978selectionp2}, followed by repeated random subsampling validation (cross validation). This overall process can be divided into five steps\footnote{The overall R project script and our evaluation data can be downloaded from the following website \texttt{https://bitbucket.org/vrabeux/qualityevaluation}}~:
+To estimate the fscore of a binarization algorithm, we automatically build a prediction model based on the most significant features among the $18$. More precisely, the prediction model is computed using multivariate step wise linear regression \cite{thompson1978selectionp1,thompson1978selectionp2,hocking}, followed by repeated random subsampling validation (cross validation).
+The linear regression models (as an hyperplane) the relationship between the features and the groundtruthed fscores. This result can then be used to predict a fscore according to the set of computed features.
+The prediction can be improved by using only a pertinent subset of features among the $18$ independent computed features.
+There are 3 main ways to carry out a selection. First, Forward strategy consist in computing a criteria by adding one feature at a time. On the contrary, a second approach (Backward) consist in starting with all the features and deleting them one at a time. After each deleting a criteria is computed. At last, a strategy consist of testing all the possible combination.
+This overall process can be divided into five steps\footnote{The overall R project script and our evaluation data can be downloaded from the following website \texttt{https://bitbucket.org/vrabeux/qualityevaluation}}~:
The prediction model for Otsu's, Sauvola's and Shijian's binarization algorithms were generated with the methodology described in section~\ref{subsectionprediction}. The coefficients associated with the most significant selected features, their pvalues and the intercept of the linear predictive function are detailed in Tables~\ref{otsuPredictionModel}, \ref{sauvolaPredictionModel} and \ref{shijianPredictionModel}. If a feature is not present in a table, then it was not selected by the step wise algorithm. As mentioned in the previous section, the cross validation for each model gives the pair $(\bar{\alpha}, \bar{R^{2}}$.
+The prediction model for Otsu's, Sauvola's and Shijian's binarization algorithms were generated with the methodology described in section~\ref{subsectionprediction}. The coefficients associated with the most significant selected features, their pvalues and the intercept of the linear predictive function are detailed in Tables~\ref{otsuPredictionModel}, \ref{sauvolaPredictionModel} and \ref{shijianPredictionModel}. If a feature is not present in a table, then it was not selected by the step wise algorithm. As mentioned in the previous section, the cross validation for each model gives the pair $(\bar{\alpha}, \bar{R^{2}})$.
The most significant selected features for Otsu's prediction model are $ \mIInk $, $ v_{I}$, $ v_{B} $, $ \mu_{B} $, $\mu$ and $v$ (see Table~\ref{otsuPredictionModel} for the coefficients of the predictive function). For Otsu's prediction model, we can explain the feature selection by the fact that Otsu's binarization method is based on global thresholding. That is why features such as $\mIInk$, $\mu$ and $v$ are significant and have such low pvalues. The model's $ R^{2}$ equals $0.93$, which is considered very good.
+\caption{Otsu prediction model : all selected features are significant (pvalue $<0.1$), and the model is likely to correctly predict future unknown images given that the $R^{2}$ value is higher than $0.9$. }
\caption{Otsu prediction model : all selected features are significant (pvalue $<0.1$), and the model is likely to correctly predict future unknown images given that the $R^{2}$ value is higher than $0.9$. }
%\caption{Sauvola prediction Model : all features are significant ($pvalue <0.1$), the model is also likely to predict correctly future unknown images given that the $R^{2}$ equals $0.8$ and adjusted $R^{2}$ equals 0.77. }
%\caption{Shijian prediction Model : the model is likely to predict correctly future unknown images given that the $R^{2}$ equals $0.86$ and adjusted $R^{2}$ equals 0.82. }
+\caption{Accuracy of the prediction model for the other eight binarization methods. The selected features are different from one method to another. The accuracy and robustness of the prediction models are good ($R^2 > 0.7$, cross validation $\bar{R^{2}} > 0.83$).}
\caption{Accuracy of the prediction model for the other eight binarization methods. The selected features are different from one method to another. The accuracy and robustness of the prediction models are good ($R^2 > 0.7$, cross validation $\bar{R^{2}} > 0.83$).}
+\caption{Binarization of the DIBCO dataset. Comparison between the best theoretical fscore (computed from the ground truth), fscores obtained using only Shijian's method and fscores obtained from our automatic selection.}
\caption{Binarization of the DIBCO dataset. Comparison between the best theoretical fscore (computed from the ground truth), fscores obtained using only Shijian's method and fscores obtained from our automatic selection.}