Wiki
Clone wikienmap-box-idl / imageRF - Manual for Application
Aims and Concept of imageRF
imageRF is an IDL based tool for the supervised classification and regression analysis of remote sensing image data. It implements the machine learning approach of Random ForestsTM(RF) 1 (Breiman, L & Cutler, A, 2011) that uses multiple self-learning decision trees to parameterize models and use them for estimating categorical or continuous variables.
The classification of image data in imageRF follows the workflow shown in the next figure. First the RF Classification (RFC) model is parameterized using a reference data set for training and internal validation. The result is stored as *.rfc file (see Section Parameterization - Default Settings). This makes the model independent of the current working session and allows it to be shared easily.
In a second step the model is used to perform an image classification. The models accuracy is calculated in the third step where its estimations are compared with an independent validation set.
The same workflow is used to estimate continuous values from spectral imagery with help of RF Regression (RFR) models, as shown in this figure.
For methodological studies it might be interesting to perform different model trainings and only compare results based on a set of independent validation points, not the entire image. By using the Fast Accuracy Assessment tool (see Section), only reference areas used for the independent validation will be classified/estimated on the basis of the RFC/RFR model, as shown in the next figure.
Background
Random Forests (RF), first developed by Breiman (2001), are an ensemble method for supervised classification and regression based on classification and regression trees (CART). They rely on the assumption that different independent predictors predict incorrectly in different areas. By combining the prediction results, it is possible to improve the overall prediction accuracy (see also Polikar (2006)).
CARTs show significant differences in their structure, if their training data varies slightly. By using this characteristic of CARTs and combining it with Bagging (bootstrap aggregating) and random feature selection independent predictors can be created. These binary trees are grown to maximum size and pruned.
Bagging: For each predictor of the ensemble new training data is generated by resampling. From the training data with n observations n observations are selected randomly with replacement. This is called bootstrapping. In large data sets about 63% of the observations of the original training data are in the bootstrap data. To validate the RF the remaining observations, which are called out-of-bag-data, are used to determine the out-of-bag error.
Random Feature Selection: In order to differentiate the ensemble predictors not all available features are used to determine the optimum split point in a node. Instead only a predetermined number of randomly selected features are used to avoid overfitting of the model.
Out-Of-Bag-Error: For every tree the particular out-of-bag-data is used for predicting and the results are aggregated over all trees to compute the error rate or out-of-bag-error. Compared to cross-validation the out-of-bag-error is unbiased and a good estimate for the generalization error. With increasing number of trees the out-of-bag-error decreases and converges to a threshold, as for example in the next figure. This is the lowest error rate that can be achieved by the RF according to the training data.
According to Breiman (2001) and Gislason et al. (2006) RF don't tend to overfit although the binary trees for RF are not pruned. This behavior can be traced back to the law of large numbers. Furthermore, RF are robust against outliers in training data. According to Pal (2003) they generate good results with noisy data.
User Guide
Data Types
imageRF uses image data that is stored according to the ENVI File Format. A description of this format and its different file types is given at Data Format Definition.
For users already familiar with the ENVI File Format: Please note that files used as a regression reference must provide the following entries in the header file (*.hdr):
bands = 1
file type = ENVI Standard
data ignore value = <your data ignore value>
Parameterization of RFC/RFR Models
The parameterization of RF requires only the number of trees to be grown. In imageRF, the user can decide to parameterize the RFC or RFR model using either default or advanced settings. The following example shows how to parameterize an RFC Model. (This is analogous for RFR models using the imageRF Regression > Parameterize RF Regression (RFR) menu. Differences between RFC and RFR Models are highlighted.)
Default Settings
- From the imageRF Classification menu, select Parameterize RF Classifier (RFC).
- Select the Image to be classified and the file specifying the Reference Data for the training.
- In the frame for the parameters you can change the number of trees to grow the RF. The default value for the number of trees to use is 100. Default values can be examined by clicking Advanced (for more information see Advanced Settings). In most cases, default values already lead to high accuracies.
- Specify path and filename for the RFC model (*.rfc).
- Click Accept when you are finished and wait until the parameterization is done.
An RFC file will be written to the specified directory. A report on the generated RFC file can be viewed using the View RFC Parameters tool.
- You may proceed by directly applying the model to an image, as described here.
Advanced Settings
The Advanced Settings option allows the user to modify the setup of the RF. The user can:
- define the function to determine the number of randomly selected features to compute the best split point or set this number manually
- select the impurity function (RFC only)
- set the stop criteria by defining a minimum impurity and a minimum number of samples per node
- From the imageRF Classification menu, select Parameterize RF Classifier (RFC).
- Select the Image to be classified and the file associated with the Reference Areas for the training.
- Specify path and filename for the RFC model (*.rfc).
- Click Advanced to continue with the advanced settings. The parameterization dialog will be expanded.
- The functions to determine the number of randomly selected features nr from the number of all features na are:
- nr = sqrt(na) (default)
- nr = log2(na)
- nr = a user defined value
- The functions to determine the impurity in a node are the Gini Index and the Entropy (In case of a RF regression this is not required):
- Gini Index =
- Entropy =
c = number of classes, t = node of a tree, p = relative frequency of c
- The stop criteria to stop splitting are:
- the Minimum number of samples in a node (default = 1)
- the Minimum impurity in a node (default = 0.0)
Using the default values the decision trees will be full grown.
- Specify RFC Model file path for output.
- Click Accept when you are finished with the Advanced Settings. A file with the RF Classifier (*.rfc) will be written to disc. A report on the generated RFC file can be viewed using View RFC Parameters.
Apply RFC/RFR models to Image
After successfully parameterizing the RFC or RFR model, you can apply it to an image file. This is shown in the following example for an RFC Model. (In case of RFR models this is analogous, starting with the imageRF Regression > Apply RFR to Image Button) From the imageRF Classification menu, select Classify Image.
- Select the RFC Model and the Image to be applied.
- Optionally select a Mask file if you want to constrain the estimations on specific areas. Masked pixels will be set to zero, i.e unclassified.
Note: In case of a regression masked pixels will be set to the data ignore value of the reference area file that was used to train the model - Specify a file name for the RFC Result. Optionally an image showing the probabilities for each class can be added (Not available for RFR models).
- Click Accept when you are finished and wait until the classification is done.
The final classification and the optional class probabilities will appear in the Filelist and can be opened in the View Manager (see Figure 5).
Figure 5: Output of RF Classification in the EnMAP-Box. A classification with five landcover classes and underlying class probabilities for (1) vegetation, (2) built-up, (3) impervious, (4) soil and (5) water.
Fast Accuracy Assessment of RFC/RFR Models
The following example shows the Fast Accuracy Assessment for RFC models. (This is analogous for RFR models in the imageRF Regression menu)
- From the imageRF Classification menu select Fast Accuracy Assessment.
- Specify an RFC Model, the associated Image and the Reference Data for independent validation.
- Accept and wait until the assessment is done. The Accuracy Assessment Report appears.
The image pixels are extracted from locations defined in the reference file only and used to create a temporary estimation. Comparing the estimation with the corresponding reference values yields accuracy statistics.
Accuracy Report for Classification Results
The output is summarized and several performance measures are provided.
- Quick Overview: Overall accuracy measures and class-wise measures including the 95% confidence interval.
- Error Matrix: Containing the number of correctly classified pixels in the diagonal (here marked in green), omitted pixels in the column of each class, falsely included pixels in the row of each class.
- Estimated Map Areas
- Performance Measures for each class
- Error of Omission [%]: The share of reference pixels in that class that have been “omitted” in the classification image (pixels in the column except from the diagonal). Equals 100 minus Producer Accuracy.
- Error of Commission [%]: Percentage of class pixels in the classification image which are falsely classified. Equals 100 minus User Accuracy.
- User's Accuracy [%]: 100 minus Error of Commission.
- Producer's Accuracy [%]: 100 minus Error of Omission.
- F1 Measure [%]: Weighted harmonic mean of User Accuracy (UAi) of class i and Producer Accuracy (PAi) of class i. F1 Measure of class i is given by: F1i = 2*UAi*PAi/(UAi+PAi).
- Avg. F1 Accuracy: Arithmetic mean of class-wise F1 measures.
- Overall Accuracy [%]: Percentage of correctly classified pixels.
- Kappa Accuracy: Kappa value
Accuracy Report for Regression Results
The Fast Accuracy Assessment of an RFR Model will show three windows. The first gives a textual report of the residual statistics, including the values for MAE, RMSE and r2.
The scatter plot shows values of the validation reference against the RFR estimation. Ideally they are positioned along the diagonal, which would mean that all estimated values are equal to the reference values.
The third graph shows the distribution of residuals. Underestimations have negative values and overestimations positive values. The position 0 is equal to the diagonal in the previous graph.
View Parameterization of RFC/RFR Models
Information on RFC or RFR models can be examined using the View RFC Parameters tool. This is especially helpful for orientation, when various models have been trained. The Model window contains information on the training data used, the out-of-bag-error, the number of trees of the RF, the number of features used for splitting and the total number of features.
The following example shows how to view the parameters of an RFC model. For RFR models this works analogous, starting from the imageRF Regression menu.
- From the imageRF Classification main menu, select View RFC Parameters.
- The following dialog asks you to specify the RFC file
- Click Accept to view the models learning curve and its parameterization values
Variable Importance of RFC/RFR Models
One further option implemented in imageRF 1.1 is a tool to compute the variable importance for feature selection. Using a smaller number of features may result in a non-inferior accuracy compared to the use of larger feature sets, and provides potential advantages regarding data storage and computational processing costs.
The following example shows how to calculate and view the variable importance of a RFC model. For RFR models this is analogous.
- From the imageRF Classification menu select RFC Variable Importance
- In the appearing dialog one previously created *.rfc model file has to be chosen.
- Click Accept and wait until the calculation of Variable Importance is done.
The normalized and raw variable importance will be computed now. For each tree the samples being out-of-bag are permuted in the respective variable, put down the tree and the accuracies are computed. The accuracies of the permuted out-of-bag samples are subtracted from the accuracies of the original samples. The average of the differences of the accuracies of a variable is the raw importance for these variables. Dividing the raw variable importance by the respective standard deviation results in the normalized variable importance. A high value means that the variable has a high importance for the entire RF and vice versa.
Normalized Variable Importance of an RFC model
Raw Variable Importance of an RFC model
References
Breiman, L. (2001): Random Forests. In: Machine Learning Vol. 45 (2001), October, No. 1, 5–32
Breiman, L & Cutler, A (2011): Website: Random Forests – Leo Breiman and Adele Cutler, URL https://www.stat.berkeley.edu/~breiman/RandomForests/, last visited on 2015-06-22
Gislason, P. O., Benediktsson, J. A., & J. R. Sveinsson (2006): Random Forests for land cover classification. In: Pattern Recognition Letters 27 (2006), No. 4, 294–300
Pal, M. (2003): Random Forests for Land Cover Classification. In: Proceedings of the International Geoscience And Remote Sensing Symposium, 2003, 3510–3512
Polikar, R. (2006): Ensemble based systems in decision making. In: IEEE Circuits and Systems Magazine 6 (2006), No. 3, 21-45
Updated