orange / docs / reference / rst / Orange.feature.selection.rst

Selection (selection)

Feature selection module contains several utility functions for selecting features based on they scores normally obtained in classification or regression problems. A typical example is the function :obj:`select` that returns a subsets of highest-scored features features:

The script outputs:

Best 3 features:

The module also includes a learner that incorporates feature subset selection.

Functions for feature subset selection

Learning with feature subset selection

Class wrappers for selection functions

Below are few examples of utility of this class:

>>> filter = Orange.feature.selection.FilterAboveThreshold(threshold=.15)
>>> new_data = filter(data)
>>> new_data = Orange.feature.selection.FilterAboveThreshold(data)
>>> new_data = Orange.feature.selection.FilterAboveThreshold(data, threshold=.1)
>>> new_data = Orange.feature.selection.FilterAboveThreshold(data, threshold=.1, \


The following script defines a new Naive Bayes classifier, that selects five best features from the data set before learning. The new classifier is wrapped-up in a special class (see :doc:`/tutorial/rst/python-learners` lesson in :doc:`/tutorial/rst/index`). Th script compares this filtered learner with one that uses a complete set of features.


Interestingly, and somehow expected, feature subset selection helps. This is the output that we get:

Learner      CA
Naive Bayes  0.903
with FSS     0.940

We can do all of he above by wrapping the learner using :class:`~Orange.feature.selection.FilteredLearner`, thus creating an object that is assembled from data filter and a base learner. When given a data table, this learner uses attribute filter to construct a new data set and base learner to construct a corresponding classifier. Attribute filters should be of the type like :class:`~Orange.feature.selection.FilterAboveThreshold` or :class:`~Orange.feature.selection.FilterBestN` that can be initialized with the arguments and later presented with a data, returning new reduced data set.

The following code fragment replaces the bulk of code from previous example, and compares naive Bayesian classifier to the same classifier when only a single most important attribute is used.


Now, let's decide to retain three features and observe how many times an attribute was used. Remember, 10-fold cross validation constructs ten instances for each classifier, and each time we run :class:`~.FilteredLearner` a different set of features may be selected. Orange.evaluation.testing.cross_validation stores classifiers in results variable, and :class:`~.FilteredLearner` returns a classifier that can tell which features it used, so the code to do all this is quite short.

Running :download:` <code/>` with three features selected each time a learner is run gives the following result:

Learner      CA
bayes        0.903
filtered     0.956

Number of times features were used in cross-validation:
 3 x el-salvador-aid
 6 x synfuels-corporation-cutback
 7 x adoption-of-the-budget-resolution
10 x physician-fee-freeze
 4 x crime


  • K. Kira and L. Rendell. A practical approach to feature selection. In D. Sleeman and P. Edwards, editors, Proc. 9th Int'l Conf. on Machine Learning, pages 249{256, Aberdeen, 1992. Morgan Kaufmann Publishers.
  • I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF. In F. Bergadano and L. De Raedt, editors, Proc. European Conf. on Machine Learning (ECML-94), pages 171-182. Springer-Verlag, 1994.
  • R. Kohavi, G. John: Wrappers for Feature Subset Selection, Artificial Intelligence, 97 (1-2), pages 273-324, 1997