To estimate the f-score of a binarization algorithm, we automatically build a prediction model based on the most significant features among the $18$. More precisely, the prediction model is computed using multivariate step wise linear regression \cite{thompson1978selectionp1,thompson1978selectionp2,hocking, cohen}, followed by repeated random sub-sampling validation (cross validation).

-Mettre outils classiques + citation

The linear regression models, as an hyperplane, the relationship between the features and the groundtruthed f-scores. This result can then be used to predict a f-score according to a set of computed features. The prediction can be improved by using only a pertinent subset of features among the $18$ independent computed features.

There are three main ways to carry out a selection. First, the forward strategy consists in computing a criteria (linked to the $R^2$ value) by adding one feature at a time. On the contrary, a second approach (backward strategy) consists in starting with all the features and deleting them one at a time. After each deletion the criteria is computed. The last strategy consists in testing all the possible combinations. As we have only 18 features, we decided to use the exhaustive strategy.

\begin{enumerate}

\item \textbf{Features computation:} The $18$ proposed features are computed for each image.

- \item \textbf{F-scores computation:} We run the binarization algorithm and compute the f-score for each image by comparing the binarization result and the ground truth. In the following section, these f-scores are called ground truth f-scores.

+ \item \textbf{F-scores computation:} We run the binarization algorithm and compute the f-score for each image by comparing the binarization result and the ground truth. In the following, these f-scores are called ground truth f-scores and denoted by $f_i, i\in[0,N]$ (with $N=\#images$)

- \item \textbf{Generation of the predictive model}: This step consists of applying a step wise multivariate linear regression to the overall dataset, allowing us to select the most significant features for predicting the given binarization algorithm. Keeping all features in each prediction model would lead to overparameterized models. Indeed, some features may not be significant for predicting a specific binarization method. Moreover, even if a feature is highly correlated to the accuracy of an algorithm, it may have a weak contribution to the final prediction model. The output of this step is a linear function that gives a predicted f-score value for any image, for one binarization algorithm, knowing the selected features.

+ \item \textbf{Generation of the predictive model}: This step consists of applying a step wise multivariate linear regression to the overall dataset, allowing us to select the most significant features for predicting the given binarization algorithm. Keeping all features in each prediction model would lead to overfitt models. Indeed, some features may not be significant for predicting a specific binarization method. Moreover, even if a feature is highly correlated to the accuracy of an algorithm, it may have a weak contribution to the final prediction model. The output of this step is a linear function that gives a predicted f-score value for any image, for one binarization algorithm, knowing the selected features. We denote by $\hat{f_i}, i\in[0,N]$ the predicted f-scores.

\item \textbf{Evaluation of model accuracy}:

-The $R^{2}$ value measures the quality of the prediction model. It can be interpreted as a correlation between the ground truth and the prediction. The best theoretical value for $ R^{2}$ is 1. Moreover, a p-value is computed for each selected feature indicating its significance : a low p-value leads to reject the hypothesis that the selected feature is not significant (null hypothesis).

+The $R^{2}$ value measures the quality of the prediction model. It can be interpreted as a correlation between the ground truth and the prediction.

+with $\bar{f}$ the mean of ground truth f-scores.

+

+

+

+

+The best theoretical value for $ R^{2}$ is 1. Moreover, a p-value is computed for each selected feature indicating its significance : a low p-value leads to reject the hypothesis that the selected feature is not significant (null hypothesis).

There is no automatic rule to decide whether a model is valid. In our tests, we choose to keep the model only if $R^2 > 0.7$ and if a majority of p-values are lower than $0.1$.

\item \textbf{Model validation} :

Because of the relatively few images in the dataset, we use cross validation to estimate the performance of the predictive function generated in step 2. We randomly split the overall set of images into two different subsets: the training set and the validation set. In our experiments, the training set is composed of 90\% of the dataset images and the validation set is composed of the remaining 10\%.

-By applying linear regression to the training set, we compute a new prediction function with its associated $R^2$. We apply this new predictive function to the validation set. The obtained predicted f-scores are compared with the ground truth f-scores using a linear regression that provides the slope coefficient $\alpha$.

+By applying linear regression to the training set, we compute a new prediction function with its associated $R^2$. The features used here are those selected at step 2.

+We apply this new predictive function to the validation set. The obtained predicted f-scores are compared with the ground truth f-scores using a linear regression that provides the slope coefficient $\alpha$.

If $\alpha$ is close to $1$, then the f-scores of the binarization are well predicted on the validation set.

\end{enumerate}

%This $6$-steps process permits to have a prediction model validated for an algorithm (function).

-All $11$ binarization methods described in this article require parameter settings. Note that our methodology involves the creation of different predictive models, one for each parameter set. %For example, Sauvola's method with a $5 \times 5$ window size is different from Sauvola's method with an $8 \times 8$ window size and will require the creation of a different prediction model.

+Some binarization methods described in this article require parameter settings. Note that our methodology involves the creation of different predictive models, one for each parameter set. %For example, Sauvola's method with a $5 \times 5$ window size is different from Sauvola's method with an $8 \times 8$ window size and will require the creation of a different prediction model.

%Linear regression makes sense here because we observed that the computed features are linearly correlated to ?

%SVM?

The prediction model for Otsu's, Sauvola's and Shijian's binarization algorithms were generated with the methodology described in section~\ref{subsection-prediction}. The coefficients associated with the most significant selected features, their p-values and the intercept of the linear predictive function are detailed in Tables~\ref{otsuPredictionModel}, \ref{sauvolaPredictionModel} and \ref{shijianPredictionModel}. If a feature is not present in a table, then it was not selected by the step wise algorithm. As mentioned in the previous section, the cross validation for each model gives the pair $(\bar{\alpha}, \bar{R^{2}})$.

\paragraph{Otsu's binarization method}

-The most significant selected features for Otsu's prediction model are $ \mIInk $, $ v_{I}$, $ v_{B} $, $ \mu_{B} $, $\mu$ and $v$ (see Table~\ref{otsuPredictionModel} for the coefficients of the predictive function). For Otsu's prediction model, we can explain the feature selection by the fact that Otsu's binarization method is based on global thresholding. That is why features such as $\mIInk$, $\mu$ and $v$ are significant and have such low p-values. The model's $ R^{2}$ equals $0.93$, which is considered very good.

+The most significant selected features for Otsu's prediction model are $ \mIInk $, $ v_{I}$, $ v_{B} $, $ \mu_{B} $, $\mu$ and $v$ (see Table~\ref{otsuPredictionModel} for the coefficients of the predictive function). For Otsu's prediction model, we can explain the feature selection by the fact that Otsu's binarization method is based on global thresholding. That is why features such as $\mIInk$, $\mu$ and $v$ are significant and have such low p-values. The model's $ R^{2}$ equals $0.93$, which is considered very good \ref{cohen}.

The cross-validation gives a $\bar{\alpha}$ coefficient of $0.989$ and $\bar{R^{2}}$ of $0.987$. These results indicate that our model does not depend on the chosen training data.

## IJDAR/prediction.tex