\title{Quality evaluation of ancient digitized documents for binarization prediction}
-\author{\IEEEauthorblockN{Vincent Rabeux}
+\author{\IEEEauthorblockN{Rabeux Vincent}
\IEEEauthorblockA{University of Bordeaux\\
-\IEEEauthorblockN{Nicholas Journet}
+\IEEEauthorblockN{Journet Nicholas}
\IEEEauthorblockA{University of Bordeaux\\
-\IEEEauthorblockN{Jean-Philippe Domenger}
+\IEEEauthorblockN{Anne Vialard}
\IEEEauthorblockA{University of Bordeaux\\
-\IEEEauthorblockN{Anne Vialard}
+\IEEEauthorblockN{Domenger Jean Philippe}
\IEEEauthorblockA{University of Bordeaux\\
-This article proposes an approach to predict the result of binarization algorithms on a given document image according to its state of degradation. Indeed, historical documents suffer from different types of degradation which result in binarization errors. We intend to characterize the degradation of a document image by using different features based on the intensity, quantity and location of the degradation. These features allow us to build prediction models of binarization algorithms that are very accurate according to $R^2$ values and p-values. The prediction models are used to select the best binarization algorithm for a given document image. Obviously, this image-by-image strategy improves the binarization of the entire dataset.
+This article proposes an approach to predict the result of binarization algorithms on a given document image according to its state of degradation. Indeed, historical documents suffer from different types of degradation which result in binarization errors. We intend to characterize the degradation of a document image by using different features based on the intensity, quantity and location of the degradation. These features allow us to build prediction models of binarization algorithms that are very accurate according to \boldmath$R^2$ values and p-values. The prediction models are used to select the best binarization algorithm for a given document image. Obviously, this image-by-image strategy improves the binarization of the entire dataset.
-This paper involves quality evaluation of ancient document images.% Document quality evaluation is needed at every stage of the digitization workflow, for instance in the scanning stage to ensure that the scanner's settings are optimal, in the processing stage to apply the best algorithms (for example, restoration, binarization, OCR) and in the visualization stage to provide the best image quality. We focus here on the processing stage.
- We propose a methodology based on algorithm prediction models to select the best algorithm for a specific task. Our approach is based on the following fact : the global quality of a document image directly affects the result of any processing algorithm (binarization, segmentation). We aim to predict the result of an algorithm according to the degradation type and quantity of the processed document. In this paper, we focus on binarization prediction.
+This paper involves quality evaluations of ancient document images.% Document quality evaluation is needed at every stage of the digitization workflow, for instance in the scanning stage to ensure that the scanner's settings are optimal, in the processing stage to apply the best algorithms (for example, restoration, binarization, OCR) and in the visualization stage to provide the best image quality. We focus here on the processing stage.
+We propose a methodology based on algorithm prediction models for selecting the best algorithm for a specific task. Our approach is based on the following fact : the global quality of a document image directly impacts the result of any processing algorithm (binarization, segmentation,...). We thus propose to predict the result of an algorithm according to the type and quantity of the degradation of the processed document. We focus now on binarization prediction.
-For a given binarization algorithm and a set of ground-truthed binarized images, a prediction function is built by searching a significant correlation between algorithm performances and the quality of the images. The document image quality is characterized by new, dedicated features. This prediction function can then forecast the binarization algorithm result for any new image on which quality features have been previously computed.
+For a given binarization algorithm and a set of ground-truthed binarized images, a prediction function is built by searching a significant correlation between the algorithm performances and the quality of the images. The document image quality is measured with new dedicated features. The prediction function can then be used to predict the binarization algorithm result for any new image on which quality features have been previously computed.
- To our knowledge there are no studies on binarization prediction. The existing work on algorithm prediction in the field of document image analysis entails OCRs which typically use the quality features in order to create prediction models.
-The first features related to character quality were introduced in \cite{blando1995prediction}. In this article the authors evaluate the quality of binary text documents by analysing black and white connected components. The OCR result is predicted by thresholding the quality measures (proportion of thick and broken characters). Each document image is finally labbeled as good or poor.
+ To our knowledge there are no work on binarization prediction. The existing work on algorithm prediction in the field of document image analysis entails OCRs which typically use the quality features in order to create prediction models.
+The first quality metrics were introduced in \cite{blando1995prediction}. In this article the authors evaluate the quality of binary text documents by analysing black and white connected components. The OCR result is predicted by thresholding the quality measures (proportion of thick and broken characters). Each document image is finally labbeled as good or poor.
In \cite{cannon1997automated} two new measures are introduced to account for speckles and touching characters. A linear regression is used to predict the OCR performance on handwritten black and white documents. The authors of \cite{gonzalez1998prediction} complete the set of measures with new ones, which are used as inputs to a neural network to classify the images in two classes (poor or good). By reusing a script identification engine, the method proposed in \cite{ablavsky2003} can select the better of two OCRs according to a classification of the text image as {\em broken}, {\em clean} or {\em merged}.
-Other works propose strategies to select the best restoration algorithm. As in OCR prediction methods, dedicated defect features are computed on a binary image. These values are then used as inputs for different types of semi-supervised classification algorithms. The authors of \cite{souza2003automatic} use the features of \cite{gonzalez1998prediction} with three new ones from \cite{cannon1999quality} to select a restoration algorithm using decision rules with thresholds. %The restoration algorithm selection is based on decision rules using thresholds. Another automatic restoration method selection is presented in \cite{cannon1999quality}. In this latter article, the restoration algorithm selection is based on a linear classifier.
+Other works propose strategies to select the best restoration algorithm. As in OCR prediction methods, dedicated defect features are computed on a binary image. These values are then used as inputs for different types of semi-supervised classification algorithms. The authors of \cite{souza2003automatic} use the features of \cite{gonzalez1998prediction} with three new ones from \cite{cannon1999quality} to select a restoration algorithm using a linear classifier. %The restoration algorithm selection is based on decision rules using thresholds. Another automatic restoration method selection is presented in \cite{cannon1999quality}. In this latter article, the restoration algorithm selection is based on a linear classifier.
-Previous methods suffer from two main drawbacks. First, most of them require a connected component extraction and, therefore, a binarization step. These methods strongly depend on the accuracy of this preprocessing step. We believe that a better approach consist of directly analyzing the defect pixels in the initial grayscale image. Second, none of the presented articles dealing with prediction models analyze the significance of each feature. %Only the authors of \cite{reed2008correlating} propose an interesting correlation analysis between several quality metrics and the parameters of the degradation model used to produce the test images. This preliminary study shows that, even if most features are highly correlated with the defects, some are not. However, this study does not address the essential issue of the selection of relevant quality features to avoid over-parameterization of an algorithm prediction.
+Previous methods suffer from three main drawbacks. First, most of them require a connected component extraction and, therefore, a binarization step. These methods strongly depend on the accuracy of this preprocessing step. We believe that a better approach consist of directly analyzing the defect pixels in the initial grayscale image. Second, none of the presented articles dealing with prediction models analyze the significance of each feature. %Only the authors of \cite{reed2008correlating} propose an interesting correlation analysis between several quality metrics and the parameters of the degradation model used to produce the test images. This preliminary study shows that, even if most features are highly correlated with the defects, some are not. However, this study does not address the essential issue of the selection of relevant quality features to avoid over-parameterization of an algorithm prediction.
-In the following sections, we introduce different features that characterize ancient documents degradation. These features rely on a decomposition of the document gray levels in three different classes : ink pixels, degradation pixels and background pixels. We characterize the degradation layer by analyzing the distribution of its intensities, its quantity and its location within the image. The proposed features, dedicated to binarization evaluation, are presented in section \ref{measures}. Section \ref{prediction} details the methodology used for creating algorithms prediction models. Prediction models of several binarization methods are then presented, all of which present very high accuracy. Finally, section \ref{section-optimal-selection} explains how to use the prediction models to select the best binarization algorithm for a specific image.
+In the following sections, we introduce different features characterizing ancient documents degradations. These features rely on a document gray levels decomposition in three different classes : ink pixels, degradation pixels and background pixels. We characterize the degradation layer by analyzing the distribution of its intensities, its quantity and its location within the image. The proposed features, dedicated to binarization evaluation, are presented in section \ref{measures}. Section \ref{prediction} details the methodology used for creating algorithms prediction models. Prediction models of several binarization methods are then presented, all of which present very high accuracy. Finally, section \ref{section-optimal-selection} explains how to use the prediction models to select the best binarization algorithm for a specific image.
\section{Characterization of the degradation layer }
-This section details new features used to characterize document image degradation. A first set of global features is extracted directly from grayscale histograms without spatial consideration. A second set of features characterizes the localization of the degradation.
\subsection{The degradation layer extraction}
-We assume that an ancient document can be modeled as the combination of three different layers: the text pixel layer, the background pixel layer and the degradation pixel layer. Most of the degradation (for example, bleed-through, spots, speckles, non-uniform illumination, ink loss) appears as connected components with grayscale values that differ from background and ink pixels. We distinguish the three different layers of pixels according to the pixels' gray level. Let us denote the gray level of pixel $p$ by $g(p)$. Let $\inkp$ be the set of ink pixels, $\degp$ be the set of degradation pixels and $\backp$ be the set of background pixels defined as follows:
+In this article, we do not measure each type of degradation separately. Indeed, we assume that an historical document can be modeled as the diffusion of several information layers. Most degradations altering a binarization result are in a layer composed of gray pixels different from ink and background pixels (bleed-through, spots, speckles, non-uniform illumination, ink loss, ...). That's why we globally measure and characterize document degradations by distinguishing three different layers of pixels :
- \item $\inkp = \{ p, g(p) \leq s_{0} \}$ ink layer
- \item $\degp = \{ p, s_{0} < g(p) < s_{1} \}$ degradation layer
- \item $\backp = \{ p, g(p) \geq s_{1} \}$ background layer
+ \item really dark pixels (most of the ink pixels) below $ s_{0} $.
+ \item gray pixels (degradations) between $ s_{0} $ and $ s_{1} $.
+ \item background pixels higher than $ s_{1} $.
+ Setting the two thresholds $ s_{0} $ and $ s_{1} $ can be determined using any classification algorithm. Our experiments used a 3-means clustering algorithm. We do not aim to enhance the document with a pixel close identification. Therefore, the layers dissociation does not need to be highly accurate. Table \ref{measuresExamplesOnRealImages} shows an example of a 3-means clustering algorithm applied on a image with huge defects.
+Let us denote by $g(p)$ the intensity of the pixel $p$.
+Once the three different layers are dissociated, we extract the three corresponding sets of pixels : $\inkp = \{ p, g(p) \leq s_{0} \}$; $\degp = \{ p, s_{0} < g(p) < s_{1} \}$;$ \backp = \{ p, g(p) \leq s_{1} \}$.
+From these sets of pixels, we define three sets of values corresponding to the pixels intensities :
+$\inkInt = \{ g(p), p \in \inkp \}$; $ \grayInt = \{ g(p), p \in \degp \}$; $ \backInt = \{ g(p), p \in \backp \}$.
+Some measures are also based on sets of 4-connected components.
+Let $S$ be a set of pixels. We denote the set of the 4-connected components of $S$ by $CC(S)$. In the rest of the section, we use the following notations : $\inkc = CC(\inkp)$, $\dc = CC(\degp)$ and $\backc = CC(\backp)$.
-Setting the two thresholds $ s_{0} $ and $ s_{1} $ can be determined using any classification algorithm. Our experiments used a 3-means clustering algorithm. Table \ref{measuresExamplesOnRealImages} shows that most degradation present in a document image can be extracted using these two thresholds.% Obviously, it is not possible to perfectly classify the image pixels using only the gray-level histogram.
+%\section{Characterization of the degradation layer }
+The following sections detail our proposition for characterizing the degradation layer previously extracted. A first group of features is extracted directly from the grayscale histogram. A second family of features is dedicated to the characterization of the local degradation surrounding ink components.
\subsection{Global Features}
-%The global grayscale histogram contains information characterizing document quality.
-We compute the following global statistic features of the grayscale histogram: mean, variance and skewness. We denote the mean of the global histogram by $\mu$, its variance by $v$, and its skewness by $s$. The mean, variance and skewness are also computed on the three \emph{sub-histograms} to characterize each layer distribution (ink, background and degradation):
- \item $\mu$, $v$, $s$ (global histogram)
- \item $\mu_{\inkp}$, $v_{\inkp}$, $s_{\inkp}$ (ink histogram)
- \item $\mu_{\degp}$, $v_{\degp}$, $s_{\degp}$ (degradation histogram)
- \item $\mu_{\backp}$, $v_{\backp}$, $s_{\backp} $ (background histogram)
-The previous global features characterizing the histograms cannot precisely represent the relationship between the ink layer, the degradation layer and the background layer. Therefore, we introduce two last global features extracted from the grayscale histogram to characterize the distance between the three layers : $ \mIInk $ and $ \mIBack $, where $ \mIInk $ corresponds to the distance between the average intensity of degradation pixels and the average intensity of ink pixels and, $ \mIBack $ is the distance between the average intensity of degradation pixels and the average intensity of background pixels.
+The global grayscale histogram contains information characterizing the document's quality. We aim to compute the following global statistic features of the grayscale histogram: mean, variance and skewness. We denote the mean of the global histogram by $\mu$, its variance by $v$, and its skewness by $s$. The mean, variance and skewness are also computed on the three \emph{sub-histograms} to characterize each layer distribution (ink, background and degradation).
+This step provides 12 features : $\mu$, $v$, $s$, $\mu_{\inkInt}$, $v_{\inkInt}$, $s_{\inkInt}$, $\mu_{\grayInt}$, $v_{\grayInt}$, $s_{\grayInt}$, $\mu_{\backInt}$, $v_{\backInt}$, $s_{\backInt}$.
+The previous global measures cannot precisely represent the relation between the ink layer, the degradation layer and the background layer. Therefore, we define two feature $ \mIInk $ and $ \mIBack $, where $ \mIInk $ corresponds to the distance between the average intensity of degradation pixels and the average intensity of ink pixels, and, $ \mIBack $ is the distance between the average intensity of degradation pixels and the average intensity of background pixels.
- \mIInk = \displaystyle\frac{\moy{\degp} - \moy{\inkp}}{255}
+ \mIInk = \displaystyle\frac{\moy{\grayInt} - \moy{\inkInt}}{255}
- \mIBack = \displaystyle\frac{\moy{\backp} - \moy{\degp}}{255}
+ \mIBack = \displaystyle\frac{\moy{\backInt} - \moy{\grayInt}}{255}
+%Figure \ref{mii} illustrates the value range of $ \mIInk $ and $ \mIBack $ on a simple example.
+%a. \includegraphics[width=70px]{imgs/mii-mib-80.png}
+%b. \includegraphics[width=70px]{imgs/mii-mib-180.png}
+%\caption{Two different images illustrating $\mIInk$ and $\mIBack$. a. contains one dark spot, the distance to the ink is low $\mIInk = (80 - 0) / 255 = 0.3 $ the distance to the background is high $ \mIBack = (255-80)/255 = 0.7$
+%b. contains the same spot but with a lighter gray value, the distance to the ink is high $\mIInk = (180 - 0) / 255 = 0.7 $ the distance to the background lowers $ \mIBack = (255-180)/255 = 0.3$.
-The gray-values of the three layers are not the only characteristics that could affect a binarization algorithm. The amount of degradation pixels is also directly correlated with the binarization performance.
-We measure this performance as the relative quantity of ink and degradation pixels. We define $ \mQ $ as the following ratio : $\mQ = \frac{\card{\degp}}{\card{ \inkp }}$.
+The amount of degradation pixels is also directly correlated with the binarization performance.
+We aim to measure this performance as the relative quantity of ink and degradation pixels. We define $ \mQ $ as the following ratio : $ \mQ = \frac{\card{\degp}}{\card{ \inkp }}$
+%Figure \ref{MQ} illustrates the value range of $ \mQ $ regarding to the ratio between the quantity of ink pixels and degraded pixels.
+%a. \includegraphics[width=70px]{imgs/mq-low.png}
+%b. \includegraphics[width=70px]{imgs/mq-high.png}
+%\caption{$\mQ$ example on two images : a. does not contain a lot of noise, $\mQ$ is low : $\mQ = 1 / 9 = 0.1 $, b. has much more degradation pixels $\mQ$ is higher : $\mQ = (1+3+14)/9=2 $
\subsection{Spatial deformation features}
-As a good binarization should preserve the shape of the objects and avoid the creation of unwanted black or white components, the location of the degradation pixels is a significant characteristic that can influence the binarization result. Figure \ref{locations} illustrates the main situations observed in real documents in which the degradation pixels spatially interfere with ink pixels.
-Let $S$ be a set of pixels. We denote the set of the 4-connected components of $S$ by $CC(S)$. In the rest of the section, we use the following notations : $\inkc = CC(\inkp)$, $\dc = CC(\degp)$ and $\backc = CC(\backp)$.
+The location of the degradation pixels is also a significant characteristic that needs to be considered and measured. Figure \ref{locations} illustrates the three main situations seen in real documents were the degradation pixels spatially interfere with ink pixels.
-Let $c_{\inkp} \in \inkc$ be an ink component and $c_{\degp} \in \dc$ be a degradation component. We denote the predicate returning true if $ c_{\inkp} $ and $ c_{\degp} $ are connected by $SG(c_{\inkp}, c_{\degp})$:
-$$ SG (c_{\inkp}, c_{\degp}) = \exists (p_{\inkp}, p_{\degp}) \in c_{\inkp} \times c_{\degp} \mid p_{\inkp} \mbox{~and~} p_{\degp} \mbox{~are 4-connected}$$
-We distinguish three different cases that can produce different types of binarization errors~:
+More precisely, from these observations, we propose to compute three different measures.
+Let $ c_{0} $ be an ink component and $ c_{1} $ be a degradation component.
+Let us also denote by $\n(c)$ the neighboring pixels of the connected component $c$ :
+\n(c) = \{p \notin c \mid \exists q \in c, p \mbox{~and~} q \mbox{~are 4-connected}\}
+We distinguish three different cases that can produce different type of binarization errors.
-\item If $c_{\inkp}$ and $c_{\degp}$ are not connected (figure \ref{locations}.a), the original character will not be altered by the binarization process. If this configuration occurs numerous times, the binarization can lead to a document image highly degraded by many small black spots between characters. Let $\cma$ be the set of degradation components that are not connected to any ink component~:
+\textbf{Case 1 :} If $c_{\inkp}$ and $c_{\degp}$ are not connected (figure \ref{locations}.a), the original character will not be altered by the binarization process. If this configuration occurs numerous times, the binarization can lead to a document image highly degraded by many small black spots between characters. Let $\cma$ be the set of degradation components that are not connected to any ink component~:
+ \cma = \{c_{\degp} \in \dc \mid \forall p_{\degp} \in c_{\degp}, \nexists c_{\inkp} \in \inkc, p_{\degp} \in \n(c_{\inkp}) \}
-\cma = \{c_{\degp} \in \dc \mid \forall c_{\inkp} \in \inkc, SG (c_{\inkp}, c_{\degp})=false \}
+The feature $\mA$ is defined as : $ \mA = \displaystyle\frac{ \card{ \cma } }{ \card{\inkc} } $
+\textbf{Case 2 :} %if $ c_{0} $ and $ c_{1} $ are connected (figure \ref{locations}.b), the original letter will be altered. %The quantity of touching connected components is measured by $ \mS $.
+If $c_{\inkp}$ and $c_{\degp}$ are connected (Figure \ref{locations}.b), the original character may be altered by the binarization: degraded pixels may be misclassified as ink pixels. Let $\cms$ be the set of all ink components that are connected to at least one degradation component:
+\cms = \{ c_{\inkp} \in \inkc \mid \exists p_{\inkp} \in c_{\inkp} \mbox{~and~} c_{\degp} \in \dc, p_{\inkp} \in \n(c_{\degp}) \}
-The relative quantity of non-connected ink and degradation components is measured by $ \mA $~:
-$$ \mA = \displaystyle\frac{ \card{ \cma } }{ \card{\inkc} } $$
-%The range of this feature depends on the image size, but it can still be used to create a prediction model.% This is discussed in section \ref{prediction}.
+ The feature $ \mS $ is defined as the ratio between the number of ink components that may be expended by at least one degradation component and the total number of ink components : $ \mS = \displaystyle\frac{ \card{ \cms } }{ \card{ \inkc}} $
-\item If $c_{\inkp}$ and $c_{\degp}$ are connected (Figure \ref{locations}.b), the original character may be altered by the binarization: degraded pixels may be misclassified as ink pixels. Let $\cms$ be the set of all ink components that are connected to at least one degradation component:
-\cms = \{ c_{\inkp} \in \inkc \mid \exists c_{\degp} \in \dc, SG(c_{\inkp}, c_{\degp}) \}
-The feature $ \mS $ is defined as the ratio between the number of ink components that may be expended by at least one degradation component and the total number of ink components:
- $$ \mS = \displaystyle\frac{ \card{ \cms } }{ \card{ \inkc}} $$
-\item $ \mSG $ measures the possible extent of ink component deformation using the number of known ink components that may be modified by the binarization process. It is defined as the mean area of the pairs of components that satisfy $ SG $ over the mean area of all ink components:
+\textbf{Case 3 :} $ \mSG $ measures the possible extent of ink component deformation using the number of known ink components that may be modified by the binarization process. It is defined as the mean area of the pairs of components that satisfy $ SG $ over the mean area of all ink components. Let $ c_{\inkp} $ be an ink connected component and $ c_{\degp} $ a degradation connected component. We denote by $SG$ the predicate returning true if $ c_{\inkp} $ and $ c_{\degp} $ are touching :
+$$ SG (c_{\inkp}, c_{\degp}) = \exists p_{\inkp} \in c_{\inkp}, p_{\inkp} \in \n(c_{\degp})$$
+$\mSG$ can now be defined as the mean area of all connected components that satisfies $ SG $ over the mean area of all ink components :
-\mSG = \frac{\displaystyle {Average}_{\{ (c_{\inkp}, c_{\degp})\mid SG (c_{\inkp}, c_{\degp})\}} (\card{c_{\inkp}} + \card{c_{\degp}})}{\displaystyle {Average}_{c_{\inkp} \in \inkc} (\card{c_{\inkp}}) } $$
+\mSG = \frac{\displaystyle {Average}_{\{ (c_{\inkp}, c_{\degp}), SG (c_{\inkp}, c_{\degp})\}} (\card{c_{\inkp}} + \card{c_{\degp}})}{\displaystyle {Average}_{c_{\inkp} \in \inkc} (\card{c_{\inkp}}) } $$
The higher $ \mSG $ is, the more likely it is that the document has large spots around ink components. Combined with other features (for example, $ \mIInk $), $ \mSG $ helps predict whether the spots lead to binarization errors.
-Given all of the previously defined features, each document image is characterized by a vector of dimension $18$. An example is given in Table~\ref{measuresExamplesOnRealImages} which shows the degradation extraction and the values of the proposed features on one document image. The analysis of these values indicates that it may be preferable to use Sauvola's method to binarize this image. Indeed, the values of $\mIInk$ and $\mIBack$ are low meaning that a global thresholding method like Otsu's is likely to fail to correctly classify the pixels. The value of $\mSG$ is also high : there are large spots around the characters. Window-based method have, most of the time, better results on this kind of documents. This hypothesis is confirmed with the f-score of Otsu's and Sauvola's methods. On this image, Ostu makes a score of 0.4 and Sauvola of 0.7.
-%% The location of the degradation pixels is also a significant characteristic that needs to be considered and measured. Figure \ref{locations} illustrates the three main situations seen in real documents were the degradation pixels spatially interfere with ink pixels.
-%% a.\includegraphics[width=70px]{imgs/mA.png}
-%% b.\includegraphics[width=70px]{imgs/mS.png}
-%% c.\includegraphics[width=70px]{imgs/mSG.png}
-%% \caption{The different locations of a degradation component on the page: a. the degradation component is not connected to an ink component, b. a small degradation component is adjacent to an ink component, c. a large degradation component is adjacent to an ink component.}
-%% More precisely, from these observations, we propose to compute three different measures.
-%% Let $ c_{0} $ be an ink component and $ c_{1} $ be a degradation component.
-%% Let us also denote by $\n(c)$ the neighboring pixels of the connected component $c$ :
-%% \n(c) = \{p \notin c \mid \exists q \in c, p \mbox{~and~} q \mbox{~are 4-connected}\}
-%% We distinguish three different cases that can produce different type of binarization errors.
-%% \textbf{Case 1 :} If $c_{\inkp}$ and $c_{\degp}$ are not connected (figure \ref{locations}.a), the original character will not be altered by the binarization process. If this configuration occurs numerous times, the binarization can lead to a document image highly degraded by many small black spots between characters. Let $\cma$ be the set of degradation components that are not connected to any ink component~:
-%% \cma = \{c_{\degp} \in \dc \mid \forall p_{\degp} \in c_{\degp}, \nexists c_{\inkp} \in \inkc, p_{\degp} \in \n(c_{\inkp}) \}
-%% The feature $\mA$ is defined as : $ \mA = \displaystyle\frac{ \card{ \cma } }{ \card{\inkc} } $
-%% \textbf{Case 2 :} %if $ c_{0} $ and $ c_{1} $ are connected (figure \ref{locations}.b), the original letter will be altered. %The quantity of touching connected components is measured by $ \mS $.
-%% If $c_{\inkp}$ and $c_{\degp}$ are connected (Figure \ref{locations}.b), the original character may be altered by the binarization: degraded pixels may be misclassified as ink pixels. Let $\cms$ be the set of all ink components that are connected to at least one degradation component:
-%% \cms = \{ c_{\inkp} \in \inkc \mid \exists p_{\inkp} \in c_{\inkp} \mbox{~and~} c_{\degp} \in \dc, p_{\inkp} \in \n(c_{\degp}) \}
-%% The feature $ \mS $ is defined as the ratio between the number of ink components that may be expended by at least one degradation component and the total number of ink components : $ \mS = \displaystyle\frac{ \card{ \cms } }{ \card{ \inkc}} $
-%% \textbf{Case 3 :} $ \mSG $ measures the possible extent of ink component deformation using the number of known ink components that may be modified by the binarization process. It is defined as the mean area of the pairs of components that satisfy $ SG $ over the mean area of all ink components. Let $ c_{\inkp} $ be an ink connected component and $ c_{\degp} $ a degradation connected component. We denote by $SG$ the predicate returning true if $ c_{\inkp} $ and $ c_{\degp} $ are touching :
-%% $$ SG (c_{\inkp}, c_{\degp}) = \exists p_{\inkp} \in c_{\inkp}, p_{\inkp} \in \n(c_{\degp})$$
-%% $\mSG$ can now be defined as the mean area of all connected components that satisfies $ SG $ over the mean area of all ink components :
-%% \mSG = \frac{\displaystyle {Average}_{\{ (c_{\inkp}, c_{\degp}), SG (c_{\inkp}, c_{\degp})\}} (\card{c_{\inkp}} + \card{c_{\degp}})}{\displaystyle {Average}_{c_{\inkp} \in \inkc} (\card{c_{\inkp}}) } $$
-%% The higher $ \mSG $ is, the more likely it is that the document has large spots around ink components. Combined with other features (for example, $ \mIInk $), $ \mSG $ helps predict whether the spots lead to binarization errors.
+%Table \ref{locationExemples} shows the values of the three spatial derfomation features on the examples of figure \ref{locations}. With all the previously defined features, each document image is characterized by vector of dimension $18$.
+%\begin{tabular}{|c|c|c|c|}
+% & Figure \ref{locations}.a & Figure \ref{locations}.b & Figure \ref{locations}.c \\
+% $ \mA $ & $ 1 $ & 1 & 0 \\
+% $ \mS $ & 0 & 1 & 1 \\
+% $ \mSG $ & 0 & 1.3 & 2.2 \\
+%\caption{Example of the location measures on previous examples images. $ \mA $ is none 0 in only the first case. $ \mSG $ is higher the less the two connected components have pixels in common.}
+%\label{locationExemples}
+%\subsection{Measures example on real life images}
0.2 & 0.1 & 0.3 & 0.05 & 0.2 & 3,6 & -0.4 & -0.05 & -0.5 & 741 & 392 & 161 & 66 & 135 & 199 & -1.25 & 2065 & 171 \\
+% \multicolumn{5}{|c|}{Image} & \multicolumn{6}{c|}{GrayScale Histogram} & \multicolumn{7}{c|}{3-mean clusters} \\
+% \multicolumn{5}{|c|}{} & \multicolumn{6}{c|}{} & \multicolumn{7}{c|}{} \\
+% \multicolumn{5}{|c|}{\includegraphics[width=140px]{imgs/H03.png}} &
+% \multicolumn{6}{|c|}{\includegraphics[width=140px]{imgs/H03-histo.png}} &
+% \multicolumn{7}{|c|}{\includegraphics[width=140px]{imgs/H03-seg.png}} \\
+% $\mIInk $ & $\mIBack$ & $\mQ$ & $ \mA $ & $ \mS $ & $ \mSG $ & $ s_{i} $ & $ s_{g} $ & $ s_{b} $ & $ v_{i} $ & $ v_{g} $ & $ v_{b}$ & $ \mu_{i} $ & $ \mu_{g} $ & $ \mu_{b} $ & s & v & $ \mu $ \\
+% 0.13 & 0.2 & 0.03 & 0.3 & 0.2 & 1.4 & -0.6 & -0.02 & -0.5 & 257 & 206 & 30 & 98 & 146 & 189 & -3 & 356 & 185 \\
-\caption{Example on an image from the DIBCO dataset : extraction of the degradation layer and features values.}
+\caption{Example on image from the DIBCO dataset.}
\label{measuresExamplesOnRealImages}
+The table \ref{measuresExamplesOnRealImages} shows the degradation extraction and the values of the presented features on one document image. The manual analysis of these values indicates that it may be preferable to use Sauvola's for this image. Indeed, the values of $\mIInk$ and $\mIBack$ are low meaning that a global thresholding method like Otsu's is likely to fail to correctly classify the pixels. The value of $\mSG$ is also high : there are large spots around the characters. Window-based method have, most of the time, better results on this kind of documents. This hypothesis is confirmed with the f-score of Otsu's and Sauvola's methods. On this image, Ostu makes a score of 0.4 and Sauvola of 0.7.
%Some binarization methods rely on parameters. In this article, we do not focus on parameter optimization. Therefore, we chose to use the parameters given by the authors of each method in their corresponding original articles. Importantly, note that the prediction models created are only able to predict the performance of a binarization method with a specific set of parameters. However, a binarization method can have several prediction models, one for each set of parameters.
-For this specific use case we follow a methodology based on a step-wise linear regression. This methodology can be applied to all types of binarization methods and therefore is first detailed in a general context. The following sub-section (\ref{methodology}) presents this methodology and the dataset we used to train and validate our prediction models. In subsection \ref{otsusauvola}, we analyze the accuracy of 3 binarization prediction models that are highly used in document images : Otsu, Sauvola and Shijian. In this article, the accuracy of the last 10 methods is not presented. However, all prediction models are used to create a process that selects the binarization algorithm that is the most suited for a given image. The accuracy of this process (and therefore the accuracy of the prediction models) is analyzed in subsection \ref{section-optimal-selection}.
+For this specific use case we follow a methodology based on a step-wise linear regression. This methodology can be applied to all types of binarization methods and therefore is first detailed in a general context. The following sub-section (\ref{methodology}) presents this methodology and the dataset we used to train and validate our prediction models. In sub-section \ref{otsusauvola}, we analyze the accuracy of 3 binarization prediction models that are highly used in document images : Otsu, Sauvola and Shijian. In this article, the accuracy of the last 10 methods is not presented. However, all prediction models are used to create a process that selects the binarization algorithm that is the most suited for an image. The accuracy of this process (and therefore the accuracy of the prediction models) is analyzed in sub-section \ref{section-optimal-selection}.
\subsection {Predicting algorithms results with a step wise multivariate linear regression.}
To create the prediction model, we use a multivariate step wise linear regression \cite{thompson1978selectionp2}, followed by a repeated random sub-sampling validation (cross validation). This over all process can be divided in several steps :
- \item Features and f-scores computation: The 18 proposed features are computed for each image. We also run the binarization algorithm on the overall dataset and measure its accuracy relative to the ground truth. In the following section, these f-scores are called ground truth f-scores.
- \item Generation of the predictive model : This step consists of applying a step wise multivariate linear regression to the overall dataset, allowing us to select the most signicant features for predicting the given binarization algorithm. The output of this step is a linear function that gives a predicted f-score value for any image, for one binarization algorithm, knowing the selected features.
- \item Evaluation of model accuracy: The $R^2$ value indicates the proportion of variability in a data set that is accounted for by the statistical model and provides a measure of how well the model predicts future outcomes. The best theoretical value for $R^2$ is 1. Moreover, a p-value is computed for each selected feature indicating its signicance. We choose to keep the model only if $R^2 > 0.7$ and if a majority of p-values are lower than $0.1$.
+ \item Features and F-scores computation: The 18 proposed features are computed for each image. We also run the binarization algorithm on the overall dataset and measure its accuracy relative to the ground truth. In the follow-ing section, these f-scores are called ground truth f-scores.
+ \item Generation of the predictive model : This step consists of applying a step wise multivariate linear regression to the overall dataset, allowing us to select the most significant features for predicting the given binarization algorithm. The output of this step is a linear function that gives a predicted f-score value for any image, for one binarization algorithm, knowing the selected features.
+ \item Evaluation of model accuracy: The $R^2$ value indicates the proportion of variability in a data set that is accounted for by the statistical model and provides a measure of how well the model predicts future outcomes. The best theoretical value for $R^2$ is 1. Moreover, a p-value is computed for each selected feature indicating its significance. We choose to keep the model only if $R^2 > 0.7$ and if a majority of p-values are lower than $0.1$.
\item Model validation using Cross-Validation : the training of a prediction model and its accuracy measurement is done a several times (in our experiments : 100 times) on different subsets of a data-set :
\item The overall set of images is randomly split (90\% is used as training set and 10\% as validation set).
\item A prediction model is trained on the 90\%.
- \item On the validation set, the accuracy of the model is measured by two values. The $R^{2}$, which is the proportion of variability in a data set that is accounted for by the statistical model and provides a measure of how well future outcomes are likely to be predicted by the model. We also look at the slope coefficient of the validation regression, which also needs to be the closest to 1.
+ \item On the validation set, the accuracy of the model is measured by two values. The $R^{2}$ and the slope coefficient of the validation regression, which also needs to be the closest to 1.
\item These two metrics are averaged on all splits.
\item The averaged metrics allows to statistically validate the prediction model.
%\subsection{The Dataset}
-In this experiment we use a merge of the well known datasets DIBCO\footnote{http://users.iit.demokritos.gr/~bgat/DIBCO2009/} and H-DIBCO\footnote{http://users.iit.demokritos.gr/~bgat/H-DIBCO2010/}.
+In this experiment we use a merge of the well known datasets named DIBCO\footnote{http://users.iit.demokritos.gr/~bgat/DIBCO2009/} and H-DIBCO\footnote{http://users.iit.demokritos.gr/~bgat/H-DIBCO2010/}.
\subsection{Prediction models of commonly used binarization methods in document analysis systems}
\paragraph{Otsu's binarization method}
- The selected most significant measures are : $ \mIInk $, $ v_{I}$ , $ v_{B} $, $ \mu_{B} $, $\mu$ and $v$. This can be explained by the fact that Otsu's binarization method is based on a global grayscale histogram thresholding. That is why measures such as $\mIInk$, $\mu$ and $v$ are significant and have such low p-values. The estimated coefficients are presented in table \ref{otsuPredictionModel}. By repeating 100 times a random sub-sampling validation gives a mean slope coefficient of 0.989 and a mean $R^{2}$ of 0.987. This cross validation step estimates that the predictive model will perform in practice.
+ The selected most significant measures are : $ \mIInk $, $ v_{i}$ , $ v_{b} $, $ \mu_{b} $, $\mu$ and $v$. This can be explained by the fact that Otsu's binarization method is based on a global grayscale histogram thresholding. That is why measures such as $\mIInk$, $\mu$ and $v$ are significant and have such low p-values. The estimated coefficients are presented in table \ref{otsuPredictionModel}. By repeating 100 times a random sub-sampling validation gives a mean slope coefficient of 0.989 and a mean $R^{2}$ of 0.987. This cross validation step estimates that the predictive model will perform in practice.
Intercept & $1.187e+00$ & $1.604e-01$ & $<0.0001$ \\
$\mIInk$ & $1.244e+00$ & $2.042e-01$ & $<0.0001$ \\
-$v_{I}$ & $2.422e-02 $ & $1.534e-02 $ & $<0.1$ \\
-$v_{B}$ & $-4.336e-02$ & $ 1.095e-02$ & $< 0.01$ \\
-$\mu_{B}$ & $-2.662e-02$ & $3.585e-03 $ & $<0.0001$ \\
+$v_{i}$ & $2.422e-02 $ & $1.534e-02 $ & $<0.1$ \\
+$v_{b}$ & $-4.336e-02$ & $ 1.095e-02$ & $< 0.01$ \\
+$\mu_{b}$ & $-2.662e-02$ & $3.585e-03 $ & $<0.0001$ \\
$\mu$ & $2.445e-02$ & $3.296e-03$ & $<0.0001$ \\
$v$ & $3.262e-04$ & $5.326e-05$ & $<0.0001$ \\
\paragraph{Sauvola's binarization method}
-The selected measures are : $\mIBack$, $\mQ$, $\mA$, $\mu$, $s$, $s_{I}$, $v_{I}$ . The estimated coefficients are presented in table \ref{sauvolaPredictionModel}. It is no surprise that $\mA$ is selected for this binarization method. Indeed window based methods are sensitive to small noise components without any real ink information. The cross validation by repeating 100 times a random sub-sampling validation gives a mean slope coefficient of 1.0007 and a mean $R^{2}$ of 0.99. These results allows us to conclude that, as with Otsu's prediction model, this model is accurate and can be used in practice.
+The selected measures are : $\mIBack$, $\mQ$, $\mA$, $\mu$, $s$, $s_{i}$, $v_{i}$ . The estimated coefficients are presented in table \ref{sauvolaPredictionModel}. It is no surprise that $\mA$ is selected for this binarization method. Indeed window based methods are sensitive to small noise components without any real ink information. The cross validation by repeating 100 times a random sub-sampling validation gives a mean slope coefficient of 1.0007 and a mean $R^{2}$ of 0.99. These results allows us to conclude that, as with Otsu's prediction model, this model is accurate and can be used in practice.
$ \mA $ & 2.3e-01 & 1.22e-02 & $< 0.05$ \\
$\mu$ & -4.56e-03 & 9.54e-04 & $<0.0001 $ \\
$s$ & 7.709e-02 & 2.334e-02 & $<0.0001$ \\
-$s_{I}$ & 1.431e-01 & 3.255e-02 & $<0.0001$ \\
-$v_{I}$ & 4.264e-04 & 8.307e-05 & $<0.0001$ \\
+$s_{i}$ & 1.431e-01 & 3.255e-02 & $<0.0001$ \\
+$v_{i}$ & 4.264e-04 & 8.307e-05 & $<0.0001$ \\
$\mA$ & 3.162e-02 & 6.469e-03 & $<0.0001$ \\
$\mSG$ & -3.276e-02 & 2.846e-03 & $<0.0001$ \\
$var$ & -1.389e-04 & 5.131e-05 & $<0.0001$ \\
-$s_{I}$ & 3.882e-02 & 2.219e-02 & $<0.0001$ \\
-$s_{D}$ & 1.328e-01 & 3.597e-02 & $ <0.001$ \\
-$\mu_{I}$ & -4.004e-04 & 4.387e-04 & $<0.5 $ \\
+$s_{i}$ & 3.882e-02 & 2.219e-02 & $<0.0001$ \\
+$s_{g}$ & 1.328e-01 & 3.597e-02 & $ <0.001$ \\
+$\mu_{i}$ & -4.004e-04 & 4.387e-04 & $<0.5 $ \\
\label{section-optimal-selection}
- Given a document image, a binarization method and its prediction model, we can compute all of the features required by the model and use them as inputs. The result is the predicted accuracy of this specific binarization method for this specific image. Table \ref{selectionRes} presents some f-score statistics obtained from binarizing the DIBCO dataset. The first line corresponds to the best theoretical f-scores (having the ground truth, we know for each image the binarization method that will provide the best f-score). The second line corresponds to the f-scores obtained using only Shijian's method. The last line corresponds to the f-scores obtained using our automatic binarization selection. We analyse the accuracy of our binarization method selection algorithms in several ways. First, the method has a slightly better (2\%) mean accuracy than using only Shijian's method. Importantly, note that our algorithm has a higher global accuracy (the standard deviation equals $0.04$). Second, the worst binarization result of our method is much higher than Shijian's (56\%). Last, we compared our method with the optimal selection that we can compute from the ground truth. The results are very similar, indicating that the prediction models are accurate enough to select the best binarization method for each image (70\% perfect match). The mean error of our method is $0.009$ (standard deviation equals $0.02$), and, the worst error equals $0.06$.
+ Given a document image, a binarization method and its prediction model, we can compute all of the features required by the model and use them as inputs. The result is the predicted accuracy of this specific binarization method for this specific image. Table \ref{selectionRes} presents some f-score statistics obtained from binarizing the DIBCO dataset. The first line corresponds to the best theoretical f-scores (having the ground truth, we know for each image the binarization method that will provide the best f-score). The second line corresponds to the f-scores obtained using only Shijian's method. The last line corresponds to the f-scores obtained using our automatic binarization selection. We analyse the accuracy of our binarization method selection algorithms in several ways. First, the method has a slightly better (2\%) mean accuracy than using only Shijian's method. Importantly, note that our algorithm has a higher global accuracy (the standard deviation equals $0.04$). Last, the worst binarization result of our method is much higher than Shijian's (56\%).
+Second, we compared our method with the optimal selection that we can compute from the ground truth. The results are very similar, indicating that the prediction models are accurate enough to select the best binarization method for each image (70\% perfect match). The mean error of our method is $0.009$ (standard deviation equals $0.02$), and, the worst error equals $0.06$.