Commits

journet nicholas committed 5f3b0e9

version envoyée à ijedar

  • Participants
  • Parent commits 62d87f9
  • Tags après, envoye 3
    1. ijdar
    2. relecteurs
    3. à

Comments (0)

Files changed (3)

 
 The first part of our work is to identify the degradation within document images. The related works are ancient document image enhancement methods with a first step that often consists of identifying and localizing specific degradations pixels. 
 
-Among the methods focusing on pixel degradation identification, the authors of \cite{wang2003document} propose a directional wavelet transform to identify bleed-through pixels. The authors of \cite{dubois2001reduction} also localize document pixels suffering from bleed-through by a recto-verso registration : a parameter optimization method aims to find the appropriate transformation matrix that minimizes the difference between gray recto pixels and ink pixels from the verso. The recto pixels corresponding to the verso ones can then be labelled as bleed-through pixels. The problem addressed in \cite{zhang2002warped} is the localization of pixels that suffer from illumination defects. This problem occurs when scanning documents with large bookbindings. The authors propose a line-by-line thresholding to localize the boundary of the dark area near the bookbinding.
+Among the methods focusing on pixel degradation identification, the authors of \cite{wang2003document} propose a directional wavelet transform to identify bleed-through pixels. The authors of \cite{dubois2001reduction} also localize document pixels suffering from bleed-through by a recto-verso registration: a parameter optimization method aims to find the appropriate transformation matrix that minimizes the difference between gray recto pixels and ink pixels from the verso. The recto pixels corresponding to the verso ones can then be labelled as bleed-through pixels. The problem addressed in \cite{zhang2002warped} is the localization of pixels that suffer from illumination defects. This problem occurs when scanning documents with large bookbindings. The authors propose a line-by-line thresholding to localize the boundary of the dark area near the bookbinding.
 
 %The problem of perspective distortion occurring when scanning thick documents is addressed in \cite{zhang2002warped}. The first step of the proposed method is to locate areas suffering from illumination defects : a line-by-line thresholding is used to localize the boundary of the dark area near the bookbinding.
 

IJDAR/measures.tex

 a.\includegraphics[width=200px]{imgs/grayDegradation.png}
 b.\includegraphics[width=200px]{imgs/grayComponentsHistogram.png}
 c.\includegraphics[width=200px]{imgs/grayComponents.png}
-\caption{The three classes of pixels. (a) the original grayscale document image. (b) its grayscale histogram with two thresholds $ s_{0} $  and $ s_{1} $ obtained by a 3-means algorithm. (c) classification result : pixels lower than the threshold $ s_{0} $ in black, pixels between $ s_{0} $ and $ s_{1} $ in gray and pixels higher that $ s_{1} $ in white. The gray set of pixels (between $ s_{0} $ and $ s_{1} $) contains most of the instances of document degradation, such as bleed through, speckles, spots and ink loss.}
+\caption{The three classes of pixels. (a) the original grayscale document image. (b) its grayscale histogram with two thresholds $ s_{0} $  and $ s_{1} $ obtained by a 3-means algorithm. (c) classification result: pixels lower than the threshold $ s_{0} $ in black, pixels between $ s_{0} $ and $ s_{1} $ in gray and pixels higher that $ s_{1} $ in white. The gray set of pixels (between $ s_{0} $ and $ s_{1} $) contains most of the instances of document degradation, such as bleed through, speckles, spots and ink loss.}
 \label{grayscaleHisto}
 \end{center}
 \end{figure}
 c. \includegraphics[width=100px]{imgs/H05.jpg} &
 d. \includegraphics[width=100px]{imgs/H05_histo.png} 
 \end{tabular}
-\caption{Examples of global gray-level histograms :  a relatively clean document (a), its corresponding gray-level histogram (b) , an ancient degraded document (c) and (d) its corresponding histogram (more scattered and irregular than in b). The gray-level histogram is used to provide a first indication of the quality of the document.}
+\caption{Examples of global gray-level histograms:  a relatively clean document (a), its corresponding gray-level histogram (b) , an ancient degraded document (c) and (d) its corresponding histogram (more scattered and irregular than in b). The gray-level histogram is used to provide a first indication of the quality of the document.}
 \label{histogramExample}
 \end{center}
 \end{figure}
 
 We aim to compute the following global statistic features of the grayscale histogram: mean, variance and skewness. The skewness quantifies the asymmetry of the histogram.  For example, a negative skewness indicates that the distribution of pixels gray-levels has relatively few low values. 
-We denote the mean of the global histogram by $\mu$, its variance by $v$, and its skewness by $s$. A good value for the skewness is a high negative value : the left tail of the histogram is longer, the intensities are concentrated on the right  and the histogram has relatively few gray values. In that case, the image is likely easily binarized (see the images in Figure \ref{histogramExample}.a and Table \ref{measuresExamplesOnRealImages} line 2)
+We denote the mean of the global histogram by $\mu$, its variance by $v$, and its skewness by $s$. A good value for the skewness is a high negative value: the left tail of the histogram is longer, the intensities are concentrated on the right  and the histogram has relatively few gray values. In that case, the image is likely easily binarized (see the images in Figure \ref{histogramExample}.a and Table \ref{measuresExamplesOnRealImages} line 2)
 The mean, variance and skewness are also computed on the three \emph{sub-histograms} to characterize each layer distribution (ink, background and degradation).  
 
-This step provides 12 features : 
+This step provides 12 features: 
 \begin{itemize}
 	\item $\mu$, $v$, $s$ (global histogram)
 	\item $\mu_{\inkp}$,  $v_{\inkp}$, $s_{\inkp}$ (ink histogram)
 
 The gray-values of the three layers are not the only characteristics that could affect a binarization algorithm. 
 The amount of degradation pixels is also directly correlated with the binarization performance. 
-We aim to measure this performance as the relative quantity of ink and degradation pixels. We define $ \mQ  $ as the following ratio : 
+We aim to measure this performance as the relative quantity of ink and degradation pixels. We define $ \mQ  $ as the following ratio: 
 
 
 
 Binarization is a segmentation task meant to extract objects of interest (for example, characters, drawings). A good binarization should preserve the shape of the objects and avoid the creation of unwanted black or white components. Obviously, the location of the degradation pixels is a significant characteristic that can influence the binarization result. Figure \ref{locations} illustrates the main situations observed in real documents in which the degradation pixels spatially interfere with ink pixels. For example, the binarization results worsen if dark spots overlap characters  (Figures \ref{locations}.b and c). In other words, an ink component may be even more deformed because it is connected with a degraded component. The following features are meant to capture, the possible creation of unwanted black components, and the possible deformation of the characters through the binarization process. 
 % Some local binarization algorithms are sensitive to the proximity between degraded pixels and ink pixels. Some have wrong results if degraded pixels are touching ink pixels where others misclassify the degraded pixels that are located far from ink. 
 
-Let $S$ be a set of pixels. We denote the set of the 4-connected components of $S$ by $CC(S)$. In the rest of the section, we use the following notations : $\inkc = CC(\inkp)$, $\dc = CC(\degp)$ and $\backc = CC(\backp)$.
+Let $S$ be a set of pixels. We denote the set of the 4-connected components of $S$ by $CC(S)$. In the rest of the section, we use the following notations: $\inkc = CC(\inkp)$, $\dc = CC(\degp)$ and $\backc = CC(\backp)$.
 
 \begin{figure}[htbp]
 \begin{center}
 \end{center}
 \end{figure}
 
-Let  $c_{\inkp} \in \inkc$  be an ink component and  $c_{\degp} \in \dc$ be a degradation component. We denote the predicate returning true by $SG(c_{\inkp}, c_{\degp})$  if $ c_{\inkp}  $ and $ c_{\degp}  $ are connected~:
+Let  $c_{\inkp} \in \inkc$  be an ink component and  $c_{\degp} \in \dc$ be a degradation component. We denote the predicate returning true by $SG(c_{\inkp}, c_{\degp})$  if $ c_{\inkp}  $ and $ c_{\degp}  $ are connected:
 $$ SG (c_{\inkp}, c_{\degp}) =  \exists (p_{\inkp}, p_{\degp})_{ \in c_{\inkp} \times c_{\degp}} \mid p_{\inkp} \mbox{,~} p_{\degp} \mbox{~are 4-connected}$$ 
-We distinguish three different cases that can produce different types of binarization errors~: 
+We distinguish three different cases that can produce different types of binarization errors: 
 
 \begin{enumerate}
-\item  If  $c_{\inkp}$ and  $c_{\degp}$  are not connected (figure \ref{locations}.a), the original character will not be altered by the binarization process. If this configuration occurs numerous times, the binarization can lead to a document image highly degraded by many small black spots between characters. Let $\cma$ be  the set of degradation components that are not connected to any ink component~:
+\item  If  $c_{\inkp}$ and  $c_{\degp}$  are not connected (figure \ref{locations}.a), the original character will not be altered by the binarization process. If this configuration occurs numerous times, the binarization can lead to a document image highly degraded by many small black spots between characters. Let $\cma$ be  the set of degradation components that are not connected to any ink component:
 $$
 \cma = \{c_{\degp} \in \dc \mid \forall c_{\inkp} \in \inkc, SG (c_{\inkp}, c_{\degp})=false \} 
 $$
 
-The relative quantity of non-connected ink and degradation components is measured by $ \mA $~:
+The relative quantity of non-connected ink and degradation components is measured by $ \mA $:
 $$ \mA = \displaystyle\frac{ \card{ \cma } }{ \card{\inkc} } $$
 %The range of this feature depends on the image size, but it can still be used to create a prediction model.% This is discussed in section \ref{prediction}.
 
 
 The first document image (Table \ref{measuresExamplesOnRealImages}, line 1) is damaged by a large spot that overlaps text lines. The gray-levels of the spot are close to the gray-level of the text pixels. Because the Otsu method is based on a global threshold, the spot pixels tend to be misclassified as ink. On the contrary, the local method is more likely to achieve a correct separation of ink and background on the defective area, which explains why the respective f-scores of the Otsu and Sauvola methods are $0.4$ and $0.7$ on this image. The second document image (Table \ref{measuresExamplesOnRealImages}, line 2) presents a non-even background with speckles. Moreover the ink color is light relative to the background color. On this image, the respective f-scores of Otsu and Sauvola are $0.8$ and $0.4$. The Sauvola method is not robust to the background speckles, which are classified as ink. The faded ink defect is a drawback for a global method and lowers the performance of Otsu's method. 
 
-Table 2 shows that specific defects that reduce binarization performance are captured by the proposed features. Even if the global features based on histogram analysis are meaningful, they are not sufficient in that case to choose the best binarization method. The ink pixels' mean value $\mu_{I}$ of the first image is lower than that of the second one, indicating that the ink layer seems easier to identify using a global thresholding meth-od. However, the skewness of the ink $s_{I}$ is negative, indicating that most pixels are concentrated on the right part of the distribution: there are more gray pixels than really dark pixels. The skewness of the second global histogram $s$ is much higher than that of the the first image, indicating that the background of the second image is easy to separate using a global thresholding method. This separation is confirmed by the global variance $v$. Without additional information, the global thresholding method seems adapted to the second image but we cannot draw a similar conclusion for the first image.  
-
-In the first image, the values of $\mIInk$ and $\mIBack$ are low, indicating that a global thresholding method is likely to fail to correctly classify the pixels. The value of $\mSG$ is also high, indicating that there are large spots around the characters. Generally, window-based methods have better results on this type of document.  
-	 
-On the second image, the values of $\mIInk$ and $\mIBack$  are even lower: Otsu's method will also yield a bad result for the second image, but other features such as $s$ or the relatively low value of $v$ indicate that failure may be relative. Moreover, the value of $\mA$ is high, meaning that many components do not touch text pixels. This type of degradation is likely to produce binarization errors with windows based methods such as Sauvola's method.
-	 
-According to the computed features, it is preferable to use Sauvola's method for the first image and Otsu's for the second. Doing so is consistent with the f-scores of the two binarization methods. 
-	
-The proposed features characterize three different aspects of degradation: intensity, quantity and location. The next section details a methodology that uses these features to predict the result of a binarization algorithm, which is applied to the  prediction of $12$ binarization algorithms on the DIBCO dataset.
-
-% Depending of the algorithm that we aim to predict, all these measures may not be use on the same prediction model. A sub-selection of measures is necessary. This process is done in an automated way witch is presented in following section.
 
 \begin{center}
 \begin{table*}[!htdp]
 \end{table*}
 \end{center}
 
+Table 2 shows that specific defects that reduce binarization performance are captured by the proposed features. Even if the global features based on histogram analysis are meaningful, they are not sufficient in that case to choose the best binarization method. The ink pixels' mean value $\mu_{I}$ of the first image is lower than that of the second one, indicating that the ink layer seems easier to identify using a global thresholding meth-od. However, the skewness of the ink $s_{I}$ is negative, indicating that most pixels are concentrated on the right part of the distribution: there are more gray pixels than really dark pixels. The skewness of the second global histogram $s$ is much higher than that of the the first image, indicating that the background of the second image is easy to separate using a global thresholding method. This separation is confirmed by the global variance $v$. Without additional information, the global thresholding method seems adapted to the second image but we cannot draw a similar conclusion for the first image.  
+
+In the first image, the values of $\mIInk$ and $\mIBack$ are low, indicating that a global thresholding method is likely to fail to correctly classify the pixels. The value of $\mSG$ is also high, indicating that there are large spots around the characters. Generally, window-based methods have better results on this type of document.  
+	 
+On the second image, the values of $\mIInk$ and $\mIBack$  are even lower: Otsu's method will also yield a bad result for the second image, but other features such as $s$ or the relatively low value of $v$ indicate that failure may be relative. Moreover, the value of $\mA$ is high, meaning that many components do not touch text pixels. This type of degradation is likely to produce binarization errors with windows based methods such as Sauvola's method.
+	 
+According to the computed features, it is preferable to use Sauvola's method for the first image and Otsu's for the second. Doing so is consistent with the f-scores of the two binarization methods. 
+	
+The proposed features characterize three different aspects of degradation: intensity, quantity and location. The next section details a methodology that uses these features to predict the result of a binarization algorithm, which is applied to the  prediction of $12$ binarization algorithms on the DIBCO dataset.
+
+% Depending of the algorithm that we aim to predict, all these measures may not be use on the same prediction model. A sub-selection of measures is necessary. This process is done in an automated way witch is presented in following section.
+%
+%\begin{center}
+%\begin{table*}[!htdp]
+%{\scriptsize
+%\hfill{}
+%\caption{Two document image examples from the DIBCO dataset and their feature vectors. The proposed features capture different degradation types (for example, ink spots, faded ink, background speckles) }
+%\label{measuresExamplesOnRealImages}
+%\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}
+%\hline
+%	 \multicolumn{5}{|c|}{Image}  & \multicolumn{6}{c|}{GrayScale Histogram}  & \multicolumn{7}{c|}{3-mean clusters}   \\
+%
+%\hline
+%	 \multicolumn{5}{|c|}{}  & \multicolumn{6}{c|}{}  & \multicolumn{7}{c|}{}   \\
+%	 \multicolumn{5}{|c|}{\includegraphics[width=140px]{imgs/H04-2.png}} &
+%	 \multicolumn{6}{|c|}{\includegraphics[width=140px]{imgs/H04-2-histo.png}} &
+%	 \multicolumn{7}{|c|}{\includegraphics[width=140px]{imgs/H04-2-seg.png}} \\
+%	 
+%\hline	 
+%	 $\mIInk $ & $\mIBack$ & $\mQ$ & $ \mA $ & $ \mS $ & $ \mSG $ & $ s_{I} $ & $ s_{D} $ & $ s_{B} $ & $ v_{I} $ & $ v_{D} $ &  $ v_{B}$ & $ \mu_{I} $ & $ \mu_{D} $ & $ \mu_{B} $  & s          & v & $ \mu $ \\
+%	 0.2  		& 	0.1		& 0.3 	&	0.05	&	0.2	&	3,6	&	-0.4	&  -0.05    & -0.5          &  741 	&  392         &    161       &     66            &        135          &        199       &  -1.25    &  2065  & 171              \\
+%	
+%\hline		
+%%\hline
+%
+%	 \multicolumn{5}{|c|}{Image}  & \multicolumn{6}{c|}{GrayScale Histogram}  & \multicolumn{7}{c|}{3-mean clusters}   \\
+%
+%\hline
+%	 \multicolumn{5}{|c|}{}  & \multicolumn{6}{c|}{}  & \multicolumn{7}{c|}{}   \\
+%	 \multicolumn{5}{|c|}{\includegraphics[width=140px]{imgs/H03.png}} &
+%	 \multicolumn{6}{|c|}{\includegraphics[width=140px]{imgs/H03-histo.png}} &
+%	 \multicolumn{7}{|c|}{\includegraphics[width=140px]{imgs/H03-seg.png}} \\
+%	 
+%\hline	 
+%	 $\mIInk $ & $\mIBack$ & $\mQ$ & $ \mA $ & $ \mS $ & $ \mSG $ & $ s_{I} $ & $ s_{D} $ & $ s_{B} $ & $ v_{I} $ & $ v_{D} $ &  $ v_{B}$ & $ \mu_{I} $ & $ \mu_{D} $ & $ \mu_{B} $  & s & v & $ \mu $ \\
+%	0.13   	& 	0.2		&  0.03	&	0.3	&	0.2	&	1.4	&  -0.6  &    -0.02      &  -0.5	&     257     &  206          &         30    &       98         &       146       &   189               & -3     & 356  & 185    \\
+%	
+%\hline
+%
+%\end{tabular}
+%} \hfill{}
+%\end{table*}
+%\end{center}
+

IJDAR/prediction.tex

 	\item Kapur \cite{kapur1985new} is an entropy-based thresholding method.
 	\item Kittler \cite{kittler1985threshold} is a clustering-based thresholding algorithm.
 	\item Li \cite{li1998iterative} is a cross-entropic thresholding method based on the minimization of an information theoretic distance (Kullback-Leibler).
-	\item Niblack \cite{niblack1985introduction} : is a locally adaptive thresholding meth-od using pixels intensity variance.
+	\item Niblack \cite{niblack1985introduction}: is a locally adaptive thresholding meth-od using pixels intensity variance.
 %	\item Ramesh \cite{ramesh1995thresholding} : is a shape-modeling thresholding technique.
 	\item Ridler \cite{calvard1978picture} is an iterative thresholding method based on two-class Gaussian mixture models.
 	\item Sahoo \cite{sahoo1997threshold} is an entropy-based thresholding method.
 
 %that are either easy to binarize (high mean f-score),  or hard (low mean f-score) is needed. %Indeed, without an heterogenous dataset, we would train our prediction model on a subset of images. The trained prediction model would not be usable in real life. 
 
-We develop a new dataset by merging the DIBCO\footnote{http://users.iit.demokritos.gr/~bgat/DIBCO2009/}  and H-DIBCO\footnote{http://users.iit.demokritos.gr/~bgat/H-DIBCO2010/} datasets \cite{gatos2009icdar,pratikakis2010h,pratikakis2011icdar}. These datasets are primarily used as data for binarization contests and contain a heterogeneous set of images from difficult to easy to binarize. Table \ref{fscoredistrib} summarizes some statistical results of the $12$ binarization algorithms applied to all DIBCO images  ($36$ images).%\footnote{The overall merged dataset can be obtained at http://sd-22392.dedibox.fr:8080/DoQuBookWeb/pagecollections.jsp}).	
+We develop a new dataset by merging the DIBCO\footnote{http://users.iit.demokritos.gr/$\sim$bgat/DIBCO2009/}  and H-DIBCO\footnote{http://users.iit.demokritos.gr/$\sim$bgat/H-DIBCO2010/} datasets \cite{gatos2009icdar,pratikakis2010h,pratikakis2011icdar}. These datasets are primarily used as data for binarization contests and contain a heterogeneous set of images from difficult to easy to binarize. Table \ref{fscoredistrib} summarizes some statistical results of the $12$ binarization algorithms applied to all DIBCO images  ($36$ images).%\footnote{The overall merged dataset can be obtained at http://sd-22392.dedibox.fr:8080/DoQuBookWeb/pagecollections.jsp}).	
 
 
 
 
 
 
-This overall process which is presented on figure \ref{shema} can be divided into five steps\footnote{The overall R project script and our evaluation data can be downloaded from the following website \texttt{https://bitbucket.org/vrabeux/qualityevaluation}}~:
+This overall process which is presented on figure \ref{shema} can be divided into five steps\footnote{The overall R project script and our evaluation data can be downloaded from the following website \texttt{https://bitbucket.org/vrabeux/qualityevaluation}}:
 \begin{enumerate}
 	\item \textbf{Features computation:} The $18$ proposed features are computed for each image.  
 	
  
 
 
-The best theoretical value for $ R^{2}$ is 1. Moreover, a p-value is computed for each selected feature indicating its significance :  a low p-value leads to reject the hypothesis that the selected feature is not significant (null hypothesis). At this step, there is no automatic rule to decide whether a model is valid or not. However, in our experiments, we choose to keep the model only if the $R^{2}$ value is higher than 0.7 and if a majority of p-values are lower than $0.1$. 
+The best theoretical value for $ R^{2}$ is 1. Moreover, a p-value is computed for each selected feature indicating its significance:  a low p-value leads to reject the hypothesis that the selected feature is not significant (null hypothesis). At this step, there is no automatic rule to decide whether a model is valid or not. However, in our experiments, we choose to keep the model only if the $R^{2}$ value is higher than 0.7 and if a majority of p-values are lower than $0.1$. 
 
 %???  We also look at the slope coefficient of the validation regression, which also needs to be the closest to 1.
 
-	\item \textbf{Model validation} : 
+	\item \textbf{Model validation}: 
 Because of the relatively few images in the dataset, we use cross validation to estimate the performance of the predictive function generated in step 2. We randomly split the overall set of images into two different subsets: the training set and the validation set. In our experiments, the training set is composed of 90\% of the dataset images and the validation set is composed of the remaining 10\%.
 
 By applying linear regression to the training set, we compute a new prediction function with its associated $R^2$. The features used here are those selected at step 2.
 
 \begin{center}
 \begin{table}[ht]
-\caption{Otsu's prediction model : all selected features are significant (p-value $<0.1$), and the model is likely to correctly predict  future unknown images given that the $R^{2}$ value is higher than $0.9$. The mean percentage error is denoted by $mpe$.}
+\caption{Otsu's prediction model: all selected features are significant (p-value $<0.1$), and the model is likely to correctly predict  future unknown images given that the $R^{2}$ value is higher than $0.9$. The mean percentage error is denoted by $mpe$.}
 \label{otsuPredictionModel}
 {\small
 \hfill{}
 
 The same experiment was conducted on the other binarization methods. Table~\ref{otherPredictionModel} sums up the selected features and the significant information to validate or not a binarization prediction model. 
 
-Among the 18 features, most models embed about 7 features. Globally the selected features are consistent with the binarization algorithm : the step wise selection process tends to keep global (resp. local) features for global (resp. local) binarization algorithms. We also note that $\mS$ is never selected by any prediction model. Indeed, the binarization accuracy is measured at the pixel level (f-score). With this accuracy measure, the feature $\mSG$ becomes more significant than $\mS$, which may not have been the case with another evaluation measure.
+Among the 18 features, most models embed about 7 features. Globally the selected features are consistent with the binarization algorithm: the step wise selection process tends to keep global (resp. local) features for global (resp. local) binarization algorithms. We also note that $\mS$ is never selected by any prediction model. Indeed, the binarization accuracy is measured at the pixel level (f-score). With this accuracy measure, the feature $\mSG$ becomes more significant than $\mS$, which may not have been the case with another evaluation measure.
 
 
 The $R^{2}$ values show the quality of each prediction model. The prediction models of Sahoo and Niblack binarization methods were not kept for the statistical validation step since the $R^{2}$ values were below 0.7. For these two binarization models new features have to be created in order to obtain more accurate prediction models.
 \subfloat[]{
 \includegraphics[width=200px]{imgs/diffMethodesBinar/Otsu.png}
 }
-\caption{Sophisticated binarization algorithms do not always give the best output : original image (a),  Shijian Lu's binarization output (b),  Sauvola's binarization output (c) and Otsu's binarization output (d). Ostu's algorithm has the best performances on this specific image.}
+\caption{Sophisticated binarization algorithms do not always give the best output: original image (a),  Shijian Lu's binarization output (b),  Sauvola's binarization output (c) and Otsu's binarization output (d). Ostu's algorithm has the best performances on this specific image.}
 \label{fig-shijian-fails}
 \end{center}
 \end{figure}