Files changed (42)
+Much of Orange is devoted to machine learning methods for classification, or supervised data mining. These methods rely on
+the data with class-labeled instances, like that of senate voting. Here is a code that loads this data set, displays the first data instance and shows its predicted class (``republican``)::
+Classification uses two types of objects: learners and classifiers. Learners consider class-labeled data and return a classifier. Given a data instance (a vector of feature values), classifiers return a predicted class::
+Above, we read the data, constructed a `naive Bayesian learner <http://en.wikipedia.org/wiki/Naive_Bayes_classifier>`_, gave it the data set to construct a classifier, and used it to predict the class of the first data item. We also use these concepts in the following code that predicts the classes of the first five instances in the data set:
+Naive Bayesian classifier has made a mistake in the third instance, but otherwise predicted correctly. No wonder, since this was also the data it trained from.
+additional parameter that specifies the output type. If this is ``Orange.classification.Classifier.GetProbabilities``, the classifier will output class probabilities:
+The output of the script also shows how badly the naive Bayesian classifier missed the class for the thrid data item::
+Validating the accuracy of classifiers on the training data, as we did above, serves demonstration purposes only. Any performance measure that assess accuracy should be estimated on the independent test set. Such is also a procedure called `cross-validation <http://en.wikipedia.org/wiki/Cross-validation_(statistics)>`_, which averages performance estimates across several runs, each time considering a different training and test subsets as sampled from the original data set:
+Cross-validation is expecting a list of learners. The performance estimators also return a list of scores, one for every learner. There was just one learner in the script above, hence the list of size one was used. The script estimates classification accuracy and area under ROC curve. The later score is very high, indicating a very good performance of naive Bayesian learner on senate voting data set::
+Some of these are included in the code that estimates the probability of a target class on a testing data. This time, training and test data sets are disjoint:
+For these five data items, there are no major differences between predictions of observed classification algorithms::
+The following code cross-validates several learners. Notice the difference between this and the code above. Cross-validation requires learners, while in the script above, learners were immediately given the data and the calls returned classifiers.
+Classification models are objects, exposing every component of its structure. For instance, one can traverse classification tree in code and observe the associated data instances, probabilities and conditions. It is often, however, sufficient, to provide textual output of the model. For logistic regression and trees, this is illustrated in the script below:
+print "\nAfter feature subset selection with margin %5.3f (%d attributes):" % (marg, len(ndata.domain.attributes))
+print ", ".join("%s: %.3f" % (l.name, s) for l, s in zip(learners, Orange.evaluation.scoring.AUC(res)))
+This section describes how to load and save the data. We also show how to explore the data, its domain description, how to report on basic data set statistics, and how to sample the data.
+Orange can read files in native and other data formats. Native format starts with feature (attribute) names, their type (continuous, discrete, string). The third line contains meta information to identify dependent features (class), irrelevant features (ignore) or meta features (meta). Here are the first few lines from a data set :download:`lenses.tab <code/lenses.tab>` on prescription of eye
+Values are tab-limited. The data set has four attributes (age of the patient, spectacle prescription, notion on astigmatism, and information on tear production rate) and an associated three-valued dependent variable encoding lens prescription for the patient (hard contact lenses, soft contact lenses, no lenses). Feature descriptions could use one letter only, so the header of this data set could also read::
+You may download :download:`lenses.tab <code/lenses.tab>` to a target directory and there open a python shell. Alternatively, just execute the code below; this particular data set comes with Orange instalation, and Orange knows where to find it:
+Note that for the file name no suffix is needed; as Orange checks if any files in the current directory are of the readable type. The call to ``Orange.data.Table`` creates an object called ``data`` that holds your data set and information about the lenses domain:
+<Orange.feature.Discrete 'age', Orange.feature.Discrete 'prescription', Orange.feature.Discrete 'astigmatic', Orange.feature.Discrete 'tear_rate'>
+The following script wraps-up everything we have done so far and lists first 5 data instances with ``soft`` perscription:
+Note that data is an object that holds both the data and information on the domain. We show above how to access attribute and class names, but there is much more information there, including that on feature type, set of values for categorical features, and other.
+This time, we have to provide the extension for Orange to know which data format to use. An extension for native Orange's data format is ".tab". The following code saves only the data items with myope perscription:
+Data table object stores information on data instances as well as on data domain. Domain holds the names of features, optional classes, their types and, if categorical, value names.
+Orange's objects often behave like Python lists and dictionaries, and can be indexed or accessed through feature names.
+Data table stores data instances (or examples). These can be index or traversed as any Python list. Data instances can be considered as vectors, accessed through element index, or through feature name.
+Iris data set we have used above has four continous attributes. Here's a script that computes their mean:
+Above also illustrates indexing of data instances with objects that store features; in ``d[x]`` variable ``x`` is an Orange object. Here's the output::
+The particular data instance included missing data (represented with '?') for first and fourth feature. We can use the method ``is_special()`` to detect parts of the data which is missing. In the original data set file, the missing values are, by default, represented with a blank space. We use the method ``is_special()`` below to examine each feature and report on proportion of instances for which this feature was undefined:
+``Orange.data.Table`` accepts a list of data items and returns a new data set. This is useful for any data subsetting:
+and inherits the data description (domain) from the original data set. Changing the domain requires setting up a new domain descriptor. This feature is useful for any kind of feature selection:
+By default, ``Orange.data.Domain`` assumes that last feature in argument feature list is a class variable. This can be changed with an optional argument::
+The first call to ``Orange.data.Domain`` constructed the classless domain, while the second used the last feature and constructed the domain with one input feature and a continous class.
+`Learning of ensembles <http://en.wikipedia.org/wiki/Ensemble_learning>`_ combines the predictions of separate models to gain in accuracy. The models may come from different training data samples, or may use different learners on the same data sets. Learners may also be diversified by changing their parameter sets.
+In Orange, ensembles are simply wrappers around learners. They behave just like any other learner. Given the data, they return models that can predict the outcome for any data instance::
+Most ensemble methods can wrap either classification or regression learners. Exceptions are task-specialized techniques such as boosting.
+`Bootstrap aggregating <http://en.wikipedia.org/wiki/Bootstrap_aggregating>`_, or bagging, samples the training data uniformly and with replacement to train different predictors. Majority vote (classification) or mean (regression) across predictions then combines independent predictions into a single prediction.
+In general, boosting is a technique that combines weak learners into a single strong learner. Orange implements `AdaBoost <http://en.wikipedia.org/wiki/AdaBoost>`_, which assigns weights to data instances according to performance of the learner. AdaBoost uses these weights to iteratively sample the instances to focus on those that are harder to classify. In the aggregation AdaBoost emphases individual classifiers with better performance on their training sets.
+The following script wraps a classification tree in boosted and bagged learner, and tests the three learner through cross-validation:
+The benefit of the two ensembling techniques, assessed in terms of area under ROC curve, is obvious::
+Consider we partition a training set into held-in and held-out set. Assume that our taks is prediction of y, either probability of the target class in classification or a real value in regression. We are given a set of learners. We train them on held-in set, and obtain a vector of prediction on held-out set. Each element of the vector corresponds to prediction of individual predictor. We can now learn how to combine these predictions to form a target prediction, by training a new predictor on a data set of predictions and true value of y in held-out set. The technique is called `stacked generalization <http://en.wikipedia.org/wiki/Ensemble_learning#Stacking>`_, or in short stacking. Instead of a single split to held-in and held-out data set, the vectors of predictions are obtained through cross-validation.
+By default, the meta classifier is naive Bayesian classifier. Changing this to logistic regression may be a good idea as well::
+Stacking is often better than each of the base learners alone, as also demonstrated by running our script::
+`Random forest <http://en.wikipedia.org/wiki/Random_forest>`_ ensembles tree predictors. The diversity of trees is achieved in randomization of feature selection for node split criteria, where instead of the best feature one is picked arbitrary from a set of best features. Another source of randomization is a bootstrap sample of data from which the threes are developed. Predictions from usually several hundred trees are aggregated by voting. Constructing so many trees may be computationally demanding. Orange uses a special tree inducer (Orange.classification.tree.SimpleTreeLearner, considered by default) optimized for speed in random forest construction:
+Random forests are often superior when compared to other base classification or regression learners::
+This is a gentle introduction on scripting in Orange. Orange is a Python `Python <http://www.python.org/>`_ library, and the tutorial is a guide through Orange scripting in this language.
+We here assume you have already `downloaded and installed Orange <http://orange.biolab.si/download/>`_ and have a working version of Python. Python scripts can run in a terminal window, integrated environments like `PyCharm <http://www.jetbrains.com/pycharm/>`_ and `PythonWin <http://wiki.python.org/moin/PythonWin>`_,
+or shells like `iPython <http://ipython.scipy.org/moin/>`_. Whichever environment you are using, try now to import Orange. Below, we used a Python shell::
+Orange comes with plenty classification and regression algorithms. But its also fun to make the new ones. You can build them anew, or wrap existing learners and add some preprocessing to construct new variants. Notice that learners in Orange have to adhere to certain rules. Let us observe them on a classification algorithm::
+When learner is given the data it returns a predictor. In our case, classifier. Classifiers are passed data instances and return a value of a class. They can also return probability distribution, or this together with a class value::
+Regression is similar, just that the regression model would return only the predicted continuous value.
+Notice also that the constructor for the learner can be given the data, and in that case it will construct a classifier (what else could it do?)::
+Consider a naive Bayesian classifiers. They do perform well, but could loose accuracy when there are many features, especially when these are correlated. Feature selection can help. We may want to wrap naive Bayesian classifier with feature subset selection, such that it would learn only from the few most informative features. We will assume the data contains only discrete features and will score them with information gain. Here is an example that sets the scorer (``gain``) and uses it to find best five features from the classification data set:
+We need to incorporate the feature selection within the learner, at the point where it gets the data. Learners for classification tasks inherit from ``Orange.classification.PyLearner``:
+The initialization part of the learner (``__init__``) simply stores the based learner (in our case a naive Bayesian classifier), the name of the learner and a number of features we would like to use. Invocation of the learner (``__call__``) scores the features, stores the best one in the list (``best``), construct a data domain and then uses the one to transform the data (``Orange.data.Table(domain, data)``) by including only the set of the best features. Besides the most informative features we needed to include also the class. The learner then returns the classifier by using a generic classifier ``Orange.classification.PyClassifier``, where the actual prediction model is passed through ``classifier`` argument.
+Note that classifiers in Orange also use the weight vector, which records the importance of training data items. This is useful for several algorithms, like boosting.
+It does! We constructed the naive Bayesian classifier with only three features. But how do we know what is the best number of features we could use? It's time to construct one more learner.
+Given a training data, what is the best number of features we could use with a training algorithm? We can estimate that through cross-validation, by checking possible feature set sizes and estimating how well does the classifier on such reduced feature set behave. When we are done, we use the feature sets size with best performance, and build a classifier on the entire training set. This procedure is often referred to as internal cross validation. We wrap it into a new learner:
+Again, our code stores the arguments at initialization (``__init__``). The learner invocation part selects the best value of parameter ``m``, the size of the feature set, and uses it to construct the final classifier.
+We can now compare the three classification algorithms. That is, the base classifier (naive Bayesian), the classifier with a fixed number of selected features, and the classifier that estimates the optimal number of features from the training set:
+And the result? The classifier with feature set size wins (but not substantially. The results would be more pronounced if we would run this on the datasets with larger number of features)::
+From the interface point of view, regression methods in Orange are very similar to classification. Both intended for supervised data mining, they require class-labeled data. Just like in classification, regression is implemented with learners and regression models (regressors). Regression learners are objects that accept data and return regressors. Regression models are given data items to predict the value of continuous class:
+Let us start with regression trees. Below is an example script that builds the tree from data on housing prices and prints out the tree in textual form:
+Following is initialization of few other regressors and their prediction of the first five data instances in housing price data set:
+Just like for classification, the same evaluation module (``Orange.evaluation``) is available for regression. Its testing submodule includes procedures such as cross-validation, leave-one-out testing and similar, and functions in scoring submodule can assess the accuracy from the testing:
+`MARS <http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines>`_ has the lowest root mean squared error::