Commits

Blaz Zupan committed 04009d1

new tutorial

  • Participants
  • Parent commits 782c9d8

Comments (0)

Files changed (11)

Orange/classification/__init__.py

 RandomLearner = core.RandomLearner
 ClassifierFromVar = core.ClassifierFromVar
 ConstantClassifier = core.DefaultClassifier
+
+class PyLearner(object):
+    def __new__(cls, data=None, **kwds):
+        learner = object.__new__(cls)
+        if data:
+            learner.__init__(**kwds) # force init
+            return learner(data)
+        else:
+            return learner  # invokes the __init__
+
+    def __init__(self, name='learner'):
+        self.name = name
+
+    def __call__(self, data, weight=None):
+        return None
+
+class PyClassifier:
+    def __init__(self, **kwds):
+        self.__dict__.update(kwds)
+
+    def __call__(self, example, resultType = Classifier.GetValue):
+        return self.classifier(example, resultType)

docs/tutorial/rst/classification.rst

 ==============
 
 .. index:: classification
-.. index:: supervised data mining
+.. index:: 
+   single: data mining; supervised
 
-A substantial part of Orange is devoted to machine learning methods
-for classification, or supervised data mining. These methods start
-from the data that incorporates class-labeled instances, like
-:download:`voting.tab <code/voting.tab>`::
+Much of Orange is devoted to machine learning methods for classification, or supervised data mining. These methods rely on
+the data with class-labeled instances, like that of senate voting. Here is a code that loads this data set, displays the first data instance and shows its predicted class (``republican``)::
 
-   >>> data = orange.ExampleTable("voting.tab")
+   >>> data = Orange.data.Table("voting")
    >>> data[0]
    ['n', 'y', 'n', 'y', 'y', 'y', 'n', 'n', 'n', 'y', '?', 'y', 'y', 'y', 'n', 'y', 'republican']
-   >>> data[0].getclass()
+   >>> data[0].get_class()
    <orange.Value 'party'='republican'>
 
-Supervised data mining attempts to develop predictive models from such
-data that, given the set of feature values, predict a corresponding
-class.
+Learners and Classifiers
+------------------------
 
-.. index:: classifiers
 .. index::
-   single: classifiers; naive Bayesian
+   single: classification; learner
+.. index::
+   single: classification; classifier
+.. index::
+   single: classification; naive Bayesian classifier
 
-There are two types of objects important for classification: learners
-and classifiers. Orange has a number of build-in learners. For
-instance, ``orange.BayesLearner`` is a naive Bayesian learner. When
-data is passed to a learner (e.g., ``orange.BayesLearner(data))``, it
-returns a classifier. When data instance is presented to a classifier,
-it returns a class, vector of class probabilities, or both.
+Classification uses two types of objects: learners and classifiers. Learners consider class-labeled data and return a classifier. Given a data instance (a vector of feature values), classifiers return a predicted class::
 
-A Simple Classifier
--------------------
+    >>> import Orange
+    >>> data = Orange.data.Table("voting")
+    >>> learner = Orange.classification.bayes.NaiveLearner()
+    >>> classifier = learner(data)
+    >>> classifier(data[0])
+    <orange.Value 'party'='republican'>
 
-Let us see how this works in practice. We will
-construct a naive Bayesian classifier from voting data set, and
-will use it to classify the first five instances from this data set
-(:download:`classifier.py <code/classifier.py>`)::
+Above, we read the data, constructed a `naive Bayesian learner <http://en.wikipedia.org/wiki/Naive_Bayes_classifier>`_, gave it the data set to construct a classifier, and used it to predict the class of the first data item. We also use these concepts in the following code that predicts the classes of the first five instances in the data set:
 
-   import orange
-   data = orange.ExampleTable("voting")
-   classifier = orange.BayesLearner(data)
-   for i in range(5):
-       c = classifier(data[i])
-       print "original", data[i].getclass(), "classified as", c
+.. literalinclude: code/classification-classifier1.py
+   :lines: 4-
 
-The script loads the data, uses it to constructs a classifier using
-naive Bayesian method, and then classifies first five instances of the
-data set. Note that both original class and the class assigned by a
-classifier is printed out.
+The script outputs::
 
-The data set that we use includes votes for each of the U.S.  House of
-Representatives Congressmen on the 16 key votes; a class is a
-representative's party. There are 435 data instances - 267 democrats
-and 168 republicans - in the data set (see UCI ML Repository and
-voting-records data set for further description).  This is how our
-classifier performs on the first five instances:
+    republican; originally republican
+    republican; originally republican
+    republican; originally democrat
+      democrat; originally democrat
+      democrat; originally democrat
 
-   1: republican (originally republican)
-   2: republican (originally republican)
-   3: republican (originally democrat)
-   4: democrat (originally democrat)
-   5: democrat (originally democrat)
+Naive Bayesian classifier has made a mistake in the third instance, but otherwise predicted correctly. No wonder, since this was also the data it trained from.
 
-Naive Bayes made a mistake at a third instance, but otherwise predicted
-correctly.
-
-Obtaining Class Probabilities
------------------------------
+Probabilistic Classification
+----------------------------
 
 To find out what is the probability that the classifier assigns
 to, say, democrat class, we need to call the classifier with
-additional parameter ``orange.GetProbabilities``. Also, note that the
-democrats have a class index 1. We find this out with print
-``data.domain.classVar.values`` (:download:`classifier2.py <code/classifier2.py>`)::
+additional parameter that specifies the output type. If this is ``Orange.classification.Classifier.GetProbabilities``, the classifier will output class probabilities:
 
-   import orange
-   data = orange.ExampleTable("voting")
-   classifier = orange.BayesLearner(data)
-   print "Possible classes:", data.domain.classVar.values
-   print "Probabilities for democrats:"
-   for i in range(5):
-       p = classifier(data[i], orange.GetProbabilities)
-       print "%d: %5.3f (originally %s)" % (i+1, p[1], data[i].getclass())
+.. literalinclude: code/classification-classifier2.py
+   :lines: 4-
 
-The output of this script is::
+The output of the script also shows how badly the naive Bayesian classifier missed the class for the thrid data item::
 
-   Possible classes: <republican, democrat>
-   Probabilities for democrats:
-   1: 0.000 (originally republican)
-   2: 0.000 (originally republican)
-   3: 0.005 (originally democrat)
-   4: 0.998 (originally democrat)
-   5: 0.957 (originally democrat)
+   Probabilities for democrat:
+   0.000; originally republican
+   0.000; originally republican
+   0.005; originally democrat
+   0.998; originally democrat
+   0.957; originally democrat
 
-The printout, for example, shows that with the third instance
-naive Bayes has not only misclassified, but the classifier missed
-quite substantially; it has assigned only a 0.005 probability to
-the correct class.
+Cross-Validation
+----------------
 
-.. note::
-   Python list indexes start with 0.
+.. index:: cross-validation
 
-.. note::
-   The ordering of class values depend on occurence of classes in the
-   input data set.
+Validating the accuracy of classifiers on the training data, as we did above, serves demonstration purposes only. Any performance measure that assess accuracy should be estimated on the independent test set. Such is also a procedure called `cross-validation <http://en.wikipedia.org/wiki/Cross-validation_(statistics)>`_, which averages performance estimates across several runs, each time considering a different training and test subsets as sampled from the original data set:
 
-Classification tree
--------------------
+.. literalinclude: code/classification-cv.py
+   :lines: 3-
 
-.. index:: classifiers
 .. index::
-   single: classifiers; classification trees
+   single: classification; scoring
+.. index::
+   single: classification; area under ROC
+.. index::
+   single: classification; accuracy
 
-Classification tree learner (yes, this is the same *decision tree*)
-is a native Orange learner, but because it is a rather
-complex object that is for its versatility composed of a number of
-other objects (for attribute estimation, stopping criterion, etc.),
-a wrapper (module) called ``orngTree`` was build around it to simplify
-the use of classification trees and to assemble the learner with
-some usual (default) components. Here is a script with it (:download:`tree.py <code/tree.py>`)::
+Cross-validation is expecting a list of learners. The performance estimators also return a list of scores, one for every learner. There was just one learner in the script above, hence the list of size one was used. The script estimates classification accuracy and area under ROC curve. The later score is very high, indicating a very good performance of naive Bayesian learner on senate voting data set::
 
-   import orange, orngTree
-   data = orange.ExampleTable("voting")
+   Accuracy: 0.90
+   AUC:      0.97
+
+
+Handful of Classifiers
+----------------------
+
+Orange includes wide range of classification algorithms, including:
+
+- logistic regression (``Orange.classification.logreg``)
+- k-nearest neighbors (``Orange.classification.knn``)
+- support vector machines (``Orange.classification.svm``)
+- classification trees (``Orange.classification.tree``)
+- classification rules (``Orange.classification.rules``)
+
+Some of these are included in the code that estimates the probability of a target class on a testing data. This time, training and test data sets are disjoint:
+
+.. index::
+   single: classification; logistic regression
+.. index::
+   single: classification; trees
+.. index::
+   single: classification; k-nearest neighbors
+
+.. literalinclude: code/classification-other.py
+
+For these five data items, there are no major differences between predictions of observed classification algorithms::
+
+   Probabilities for republican:
+   original class  tree      k-NN      lr       
+   republican      0.949     1.000     1.000
+   republican      0.972     1.000     1.000
+   democrat        0.011     0.078     0.000
+   democrat        0.015     0.001     0.000
+   democrat        0.015     0.032     0.000
+
+The following code cross-validates several learners. Notice the difference between this and the code above. Cross-validation requires learners, while in the script above, learners were immediately given the data and the calls returned classifiers.
+
+.. literalinclude: code/classification-cv2.py
+
+Logistic regression wins in area under ROC curve::
+
+            nbc  tree lr  
+   Accuracy 0.90 0.95 0.94
+   AUC      0.97 0.94 0.99
+
+Reporting on Classification Models
+----------------------------------
+
+Classification models are objects, exposing every component of its structure. For instance, one can traverse classification tree in code and observe the associated data instances, probabilities and conditions. It is often, however, sufficient, to provide textual output of the model. For logistic regression and trees, this is illustrated in the script below:
+
+.. literalinclude: code/classification-models.py
+
+   The logistic regression part of the output is:
    
-   tree = orngTree.TreeLearner(data, sameMajorityPruning=1, mForPruning=2)
-   print "Possible classes:", data.domain.classVar.values
-   print "Probabilities for democrats:"
-   for i in range(5):
-       p = tree(data[i], orange.GetProbabilities)
-       print "%d: %5.3f (originally %s)" % (i+1, p[1], data[i].getclass())
+   class attribute = survived
+   class values = <no, yes>
+
+         Feature       beta  st. error     wald Z          P OR=exp(beta)
    
-   orngTree.printTxt(tree)
+       Intercept      -1.23       0.08     -15.15      -0.00
+    status=first       0.86       0.16       5.39       0.00       2.36
+   status=second      -0.16       0.18      -0.91       0.36       0.85
+    status=third      -0.92       0.15      -6.12       0.00       0.40
+       age=child       1.06       0.25       4.30       0.00       2.89
+      sex=female       2.42       0.14      17.04       0.00      11.25
 
-.. note:: 
-   The script for classification tree is almost the same as the one
-   for naive Bayes (:download:`classifier2.py <code/classifier2.py>`), except that we have imported
-   another module (``orngTree``) and used learner
-   ``orngTree.TreeLearner`` to build a classifier called ``tree``.
+Trees can also be rendered in `dot <http://en.wikipedia.org/wiki/DOT_language>`_::
 
-.. note::
-   For those of you that are at home with machine learning: the
-   default parameters for tree learner assume that a single example is
-   enough to have a leaf for it, gain ratio is used for measuring the
-   quality of attributes that are considered for internal nodes of the
-   tree, and after the tree is constructed the subtrees no pruning
-   takes place.
+   tree.dot(file_name="0.dot", node_shape="ellipse", leaf_shape="box")
 
-The resulting tree with default parameters would be rather big, so we
-have additionally requested that leaves that share common predecessor
-(node) are pruned if they classify to the same class, and requested
-that tree is post-pruned using m-error estimate pruning method with
-parameter m set to 2.0. The output of our script is::
-
-   Possible classes: <republican, democrat>
-   Probabilities for democrats:
-   1: 0.051 (originally republican)
-   2: 0.027 (originally republican)
-   3: 0.989 (originally democrat)
-   4: 0.985 (originally democrat)
-   5: 0.985 (originally democrat)
-
-Notice that all of the instances are classified correctly. The last
-line of the script prints out the tree that was used for
-classification::
-
-   physician-fee-freeze=n: democrat (98.52%)
-   physician-fee-freeze=y
-   |    synfuels-corporation-cutback=n: republican (97.25%)
-   |    synfuels-corporation-cutback=y
-   |    |    mx-missile=n
-   |    |    |    el-salvador-aid=y
-   |    |    |    |    adoption-of-the-budget-resolution=n: republican (85.33%)
-   |    |    |    |    adoption-of-the-budget-resolution=y
-   |    |    |    |    |    anti-satellite-test-ban=n: democrat (99.54%)
-   |    |    |    |    |    anti-satellite-test-ban=y: republican (100.00%)
-   |    |    |    el-salvador-aid=n
-   |    |    |    |    handicapped-infants=n: republican (100.00%)
-   |    |    |    |    handicapped-infants=y: democrat (99.77%)
-   |    |    mx-missile=y
-   |    |    |    religious-groups-in-schools=y: democrat (99.54%)
-   |    |    |    religious-groups-in-schools=n
-   |    |    |    |    immigration=y: republican (98.63%)
-   |    |    |    |    immigration=n
-   |    |    |    |    |    handicapped-infants=n: republican (98.63%)
-   |    |    |    |    |    handicapped-infants=y: democrat (99.77%)
-
-The printout includes the feature on which the tree branches in the
-internal nodes. For leaves, it shows the the class label to which a
-tree would make a classification. The probability of that class, as
-estimated from the training data set, is also displayed.
-
-If you are more of a *visual* type, you may like the graphical 
-presentation of the tree better. This was achieved by printing out a
-tree in so-called dot file (the line of the script required for this
-is ``orngTree.printDot(tree, fileName='tree.dot',
-internalNodeShape="ellipse", leafShape="box")``), which was then
-compiled to PNG using program called `dot`_.
+Following figure shows an example of such rendering.
 
 .. image:: files/tree.png
    :alt: A graphical presentation of a classification tree
-
-.. _dot: http://graphviz.org/
-
-Nearest neighbors and majority classifiers
-------------------------------------------
-
-.. index:: classifiers
-.. index:: 
-   single: classifiers; k nearest neighbours
-.. index:: 
-   single: classifiers; majority classifier
-
-Let us here check on two other classifiers. Majority classifier always
-classifies to the majority class of the training set, and predicts 
-class probabilities that are equal to class distributions from the training
-set. While being useless as such, it may often be good to compare this
-simplest classifier to any other classifier you test &ndash; if your
-other classifier is not significantly better than majority classifier,
-than this may a reason to sit back and think.
-
-The second classifier we are introducing here is based on k-nearest
-neighbors algorithm, an instance-based method that finds k examples
-from training set that are most similar to the instance that has to be
-classified. From the set it obtains in this way, it estimates class
-probabilities and uses the most frequent class for prediction.
-
-The following script takes naive Bayes, classification tree (what we
-have already learned), majority and k-nearest neighbors classifier
-(new ones) and prints prediction for first 10 instances of voting data
-set (:download:`handful.py <code/handful.py>`)::
-
-   import orange, orngTree
-   data = orange.ExampleTable("voting")
-   
-   # setting up the classifiers
-   majority = orange.MajorityLearner(data)
-   bayes = orange.BayesLearner(data)
-   tree = orngTree.TreeLearner(data, sameMajorityPruning=1, mForPruning=2)
-   knn = orange.kNNLearner(data, k=21)
-   
-   majority.name="Majority"; bayes.name="Naive Bayes";
-   tree.name="Tree"; knn.name="kNN"
-   
-   classifiers = [majority, bayes, tree, knn]
-   
-   # print the head
-   print "Possible classes:", data.domain.classVar.values
-   print "Probability for republican:"
-   print "Original Class",
-   for l in classifiers:
-       print "%-13s" % (l.name),
-   print
-   
-   # classify first 10 instances and print probabilities
-   for example in data[:10]:
-       print "(%-10s)  " % (example.getclass()),
-       for c in classifiers:
-           p = apply(c, [example, orange.GetProbabilities])
-           print "%5.3f        " % (p[0]),
-       print
-
-The code is somehow long, due to our effort to print the results
-nicely. The first part of the code sets-up our four classifiers, and
-gives them names. Classifiers are then put into the list denoted with
-variable ``classifiers`` (this is nice since, if we would need to add
-another classifier, we would just define it and put it in the list,
-and for the rest of the code we would not worry about it any
-more). The script then prints the header with the names of the
-classifiers, and finally uses the classifiers to compute the
-probabilities of classes. Note for a special function ``apply`` that
-we have not met yet: it simply calls a function that is given as its
-first argument, and passes it the arguments that are given in the
-list. In our case, ``apply`` invokes our classifiers with a data
-instance and request to compute probabilities. The output of our
-script is::
-
-   Possible classes: <republican, democrat>
-   Probability for republican:
-   Original Class Majority      Naive Bayes   Tree          kNN
-   (republican)   0.386         1.000         0.949         1.000
-   (republican)   0.386         1.000         0.973         1.000
-   (democrat  )   0.386         0.995         0.011         0.138
-   (democrat  )   0.386         0.002         0.015         0.468
-   (democrat  )   0.386         0.043         0.015         0.035
-   (democrat  )   0.386         0.228         0.015         0.442
-   (democrat  )   0.386         1.000         0.973         0.977
-   (republican)   0.386         1.000         0.973         1.000
-   (republican)   0.386         1.000         0.973         1.000
-   (democrat  )   0.386         0.000         0.015         0.000
-
-.. note::
-   The prediction of majority class classifier does not depend on the
-   instance it classifies (of course!).
-
-.. note:: 
-   At this stage, it would be inappropriate to say anything conclusive
-   on the predictive quality of the classifiers - for this, we will
-   need to resort to statistical methods on comparison of
-   classification models.

docs/tutorial/rst/code/assoc1.py

-# Description: Creates a list of association rules, selects five rules and prints them out
-# Category:    description
-# Uses:        imports-85
-# Classes:     orngAssoc.build, Preprocessor_discretize, EquiNDiscretization
-# Referenced:  assoc.htm
+import orngAssoc
+import Orange
 
-import orange, orngAssoc
+data = Orange.data.Table("imports-85")
+data = Orange.data.Table("zoo")
+#data = Orange.data.preprocess.Discretize(data, \
+#  method=Orange.data.discretization.EqualFreq(numberOfIntervals=3))
+# data = data.select(range(10))
 
-data = orange.ExampleTable("imports-85")
-data = orange.Preprocessor_discretize(data, \
-  method=orange.EquiNDiscretization(numberOfIntervals=3))
-data = data.select(range(10))
-
-rules = orange.AssociationRulesInducer(data, support=0.4)
+rules = Orange.associate.AssociationRulesInducer(data, support=0.4)
 
 print "%i rules with support higher than or equal to %5.3f found.\n" % (len(rules), 0.4)
 

docs/tutorial/rst/code/assoc2.py

 # Classes:     orngAssoc.build, Preprocessor_discretize, EquiNDiscretization
 # Referenced:  assoc.htm
 
-import orange, orngAssoc
+import orngAssoc
+import Orange
 
-data = orange.ExampleTable("imports-85")
-data = orange.Preprocessor_discretize(data, \
-  method=orange.EquiNDiscretization(numberOfIntervals=3))
+data = Orange.data.Table("imports-85")
+data = Orange.data.preprocess.Discretize(data, \
+  method=Orange.data.discretization.EqualFreq(numberOfIntervals=3))
 data = data.select(range(10))
 
-rules = orange.AssociationRulesInducer(data, support=0.4)
+rules = Orange.associate.AssociationRulesInducer(data, support=0.4)
 
 n = 5
 print "%i most confident rules:" % (n)

docs/tutorial/rst/code/bagging.py

 # Category:    modelling
 # Referenced:  c_bagging.htm
 
-import orange, random
+import random
+import Orange
 
 def Learner(examples=None, **kwds):
     learner = apply(Learner_Class, (), kwds)
     def __init__(self, **kwds):
         self.__dict__.update(kwds)
 
-    def __call__(self, example, resultType = orange.GetValue):
+    def __call__(self, example, resultType = Orange.classification.Classifier.GetValue):
         freq = [0.] * len(self.domain.classVar.values)
         for c in self.classifiers:
             freq[int(c(example))] += 1
         index = freq.index(max(freq))
-        value = orange.Value(self.domain.classVar, index)
+        value = Orange.data.Value(self.domain.classVar, index)
         for i in range(len(freq)):
             freq[i] = freq[i]/len(self.classifiers)
-        if resultType == orange.GetValue: return value
-        elif resultType == orange.GetProbabilities: return freq
+        if resultType == Orange.classification.Classifier.GetValue: return value
+        elif resultType == Orange.classification.Classifier.GetProbabilities: return freq
         else: return (value, freq)
         

docs/tutorial/rst/code/bagging_test.py

 # Referenced:  c_bagging.htm
 # Classes:     orngTest.crossValidation
 
-import orange, orngTree, orngStat, orngTest, orngStat, bagging
-data = orange.ExampleTable("adult_sample.tab")
+import bagging
+import Orange
+data = Orange.data.Table("adult_sample.tab")
 
-tree = orngTree.TreeLearner(mForPrunning=10, minExamples=30)
+tree = Orange.classification.tree.TreeLearner(mForPrunning=10, minExamples=30)
 tree.name = "tree"
 baggedTree = bagging.Learner(learner=tree, t=5)
 
 learners = [tree, baggedTree]
 
-results = orngTest.crossValidation(learners, data, folds=5)
+results = Orange.evaluation.testing.cross_validation(learners, data, folds=5)
 for i in range(len(learners)):
-    print "%s: %5.3f" % (learners[i].name, orngStat.CA(results)[i])
+    print "%s: %5.3f" % (learners[i].name, Orange.evaluation.scoring.CA(results)[i])

docs/tutorial/rst/code/fss6.py

 # Uses:        adult_saple.tab
 # Referenced:  o_fss.htm
 
-import orange, orngFSS
-data = orange.ExampleTable("adult_sample.tab")
+import orngFSS
+import Orange
+data = Orange.data.Table("adult_sample.tab")
 
 def report_relevance(data):
-  m = orngFSS.attMeasure(data)
+  m = Orange.feature.scoring.score_all(data)
   for i in m:
     print "%5.3f %s" % (i[1], i[0])
 
 print "Before feature subset selection (%d attributes):" % len(data.domain.attributes)
 report_relevance(data)
-data = orange.ExampleTable("adult_sample.tab")
+data = Orange.data.Table("adult_sample.tab")
 
 marg = 0.01
-filter = orngFSS.FilterRelief(margin=marg)
+filter = Orange.feature.selection.FilterRelief(margin=marg)
 ndata = filter(data)
 print "\nAfter feature subset selection with margin %5.3f (%d attributes):" % (marg, len(ndata.domain.attributes))
 report_relevance(ndata)

docs/tutorial/rst/code/fss7.py

 # Uses:        crx.tab
 # Referenced:  o_fss.htm
 
-import orange, orngDisc, orngTest, orngStat, orngFSS
+import orngFSS
+import Orange
 
-data = orange.ExampleTable("crx.tab")
+data = Orange.data.Table("crx.tab")
 
-bayes = orange.BayesLearner()
-dBayes = orngDisc.DiscretizedLearner(bayes, name='disc bayes')
-fss = orngFSS.FilterAttsAboveThresh(threshold=0.05)
-fBayes = orngFSS.FilteredLearner(dBayes, filter=fss, name='bayes & fss')
+bayes = Orange.classification.bayes.NaiveLearner()
+dBayes = Orange.feature.discretization.DiscretizedLearner(bayes, name='disc bayes')
+fss = Orange.feature.selection.FilterAboveThreshold(threshold=0.05)
+fBayes = Orange.feature.selection.FilteredLearner(dBayes, filter=fss, name='bayes & fss')
 
 learners = [dBayes, fBayes]
-results = orngTest.crossValidation(learners, data, folds=10, storeClassifiers=1)
+results = Orange.evaluation.testing.cross_validation(learners, data, folds=10, storeClassifiers=1)
 
 # how many attributes did each classifier use?
 
 
 print "\nLearner         Accuracy  #Atts"
 for i in range(len(learners)):
-  print "%-15s %5.3f     %5.2f" % (learners[i].name, orngStat.CA(results)[i], natt[i])
+  print "%-15s %5.3f     %5.2f" % (learners[i].name, Orange.evaluation.scoring.CA(results)[i], natt[i])
 
 # which attributes were used in filtered case?
 

docs/tutorial/rst/ensembles.rst

 .. index:: ensembles
+
+Ensembles
+=========
+
+`Learning of ensembles <http://en.wikipedia.org/wiki/Ensemble_learning>`_ combines the predictions of separate models to gain in accuracy. The models may come from different training data samples, or may use different learners on the same data sets. Learners may also be diversified by changing their parameter sets.
+
+In Orange, ensembles are simply wrappers around learners. They behave just like any other learner. Given the data, they return models that can predict the outcome for any data instance::
+
+   >>> import Orange
+   >>> data = Orange.data.Table("housing")
+   >>> tree = Orange.classification.tree.TreeLearner()
+   >>> btree = Orange.ensemble.bagging.BaggedLearner(tree)
+   >>> btree
+   BaggedLearner 'Bagging'
+   >>> btree(data)
+   BaggedClassifier 'Bagging'
+   >>> btree(data)(data[0])
+   <orange.Value 'MEDV'='24.6'>
+
+The last line builds a predictor (``btree(data)``) and then uses it on a first data instance.
+
+Most ensemble methods can wrap either classification or regression learners. Exceptions are task-specialized techniques such as boosting.
+
+Bagging and Boosting
+--------------------
+
 .. index:: 
    single: ensembles; bagging
+
+`Bootstrap aggregating <http://en.wikipedia.org/wiki/Bootstrap_aggregating>`_, or bagging, samples the training data uniformly and with replacement to train different predictors. Majority vote (classification) or mean (regression) across predictions then combines independent predictions into a single prediction. 
+
 .. index:: 
    single: ensembles; boosting
 
-Ensemble learners
-=================
+In general, boosting is a technique that combines weak learners into a single strong learner. Orange implements `AdaBoost <http://en.wikipedia.org/wiki/AdaBoost>`_, which assigns weights to data instances according to performance of the learner. AdaBoost uses these weights to iteratively sample the instances to focus on those that are harder to classify. In the aggregation AdaBoost emphases individual classifiers with better performance on their training sets.
 
-Building ensemble classifiers in Orange is simple and easy. Starting
-from learners/classifiers that can predict probabilities and, if
-needed, use example weights, ensembles are actually wrappers that can
-aggregate predictions from a list of constructed classifiers. These
-wrappers behave exactly like other Orange learners/classifiers. We
-will here first show how to use a module for bagging and boosting that
-is included in Orange distribution (:py:mod:`Orange.ensemble` module), and
-then, for a somehow more advanced example build our own ensemble
-learner. Using this module, using it is very easy: you have to define
-a learner, give it to bagger or booster, which in turn returns a new
-(boosted or bagged) learner. Here goes an example (:download:`ensemble3.py <code/ensemble3.py>`)::
+The following script wraps a classification tree in boosted and bagged learner, and tests the three learner through cross-validation:
 
-   import orange, orngTest, orngStat, orngEnsemble
-   data = orange.ExampleTable("promoters")
-   
-   majority = orange.MajorityLearner()
-   majority.name = "default"
-   knn = orange.kNNLearner(k=11)
-   knn.name = "k-NN (k=11)"
-   
-   bagged_knn = orngEnsemble.BaggedLearner(knn, t=10)
-   bagged_knn.name = "bagged k-NN"
-   boosted_knn = orngEnsemble.BoostedLearner(knn, t=10)
-   boosted_knn.name = "boosted k-NN"
-   
-   learners = [majority, knn, bagged_knn, boosted_knn]
-   results = orngTest.crossValidation(learners, data, folds=10)
-   print "        Learner   CA     Brier Score"
-   for i in range(len(learners)):
-       print ("%15s:  %5.3f  %5.3f") % (learners[i].name,
-           orngStat.CA(results)[i], orngStat.BrierScore(results)[i])
+.. literalinclude:: code/ensemble-bagging.py
 
-Most of the code is used for defining and naming objects that learn,
-and the last piece of code is to report evaluation results. Notice
-that to bag or boost a learner, it takes only a single line of code
-(like, ``bagged_knn = orngEnsemble.BaggedLearner(knn, t=10)``)!
-Parameter ``t`` in bagging and boosting refers to number of
-classifiers that will be used for voting (or, if you like better,
-number of iterations by boosting/bagging). Depending on your random
-generator, you may get something like::
+The benefit of the two ensembling techniques, assessed in terms of area under ROC curve, is obvious::
 
-           Learner   CA     Brier Score
-           default:  0.473  0.501
-       k-NN (k=11):  0.859  0.240
-       bagged k-NN:  0.813  0.257
-      boosted k-NN:  0.830  0.244
+    tree: 0.83
+   boost: 0.90
+    bagg: 0.91
 
+Stacking
+--------
 
+.. index:: 
+   single: ensembles; stacking
+
+Consider we partition a training set into held-in and held-out set. Assume that our taks is prediction of y, either probability of the target class in classification or a real value in regression. We are given a set of learners. We train them on held-in set, and obtain a vector of prediction on held-out set. Each element of the vector corresponds to prediction of individual predictor. We can now learn how to combine these predictions to form a target prediction, by training a new predictor on a data set of predictions and true value of y in held-out set. The technique is called `stacked generalization <http://en.wikipedia.org/wiki/Ensemble_learning#Stacking>`_, or in short stacking. Instead of a single split to held-in and held-out data set, the vectors of predictions are obtained through cross-validation.
+
+Orange provides a wrapper for stacking that is given a set of base learners and a meta learner:
+
+.. literalinclude:: code/ensemble-stacking.py
+   :lines: 3-
+
+By default, the meta classifier is naive Bayesian classifier. Changing this to logistic regression may be a good idea as well::
+
+    stack = Orange.ensemble.stacking.StackedClassificationLearner(base_learners, \
+               meta_learner=Orange.classification.logreg.LogRegLearner)
+
+Stacking is often better than each of the base learners alone, as also demonstrated by running our script::
+
+   stacking: 0.967
+      bayes: 0.933
+       tree: 0.836
+        knn: 0.947
+
+Random Forests
+--------------
+
+.. index:: 
+   single: ensembles; random forests
+
+`Random forest <http://en.wikipedia.org/wiki/Random_forest>`_ ensembles tree predictors. The diversity of trees is achieved in randomization of feature selection for node split criteria, where instead of the best feature one is picked arbitrary from a set of best features. Another source of randomization is a bootstrap sample of data from which the threes are developed. Predictions from usually several hundred trees are aggregated by voting. Constructing so many trees may be computationally demanding. Orange uses a special tree inducer (Orange.classification.tree.SimpleTreeLearner, considered by default) optimized for speed in random forest construction: 
+
+.. literalinclude:: code/ensemble-forest.py
+   :lines: 3-
+
+Random forests are often superior when compared to other base classification or regression learners::
+
+   forest: 0.976
+    bayes: 0.935
+      knn: 0.952

docs/tutorial/rst/index.rst

 Orange Tutorial
 ###############
 
-If you are new to Orange, then this is probably the best place to start. This
-tutorial was written with a purpose to provide a gentle tutorial over basic
-functionality of Orange. As Orange is integrated within `Python <http://www.python.org/>`_, the tutorial
-is in essence a guide through some basic Orange scripting in this language.
-Although relying on Python, those of you who have some knowledge on programming
-won't need to learn Python first: the tutorial should be simple enough to skip
-learning Python itself.
+This is a gentle introduction on scripting in Orange. Orange is a Python `Python <http://www.python.org/>`_ library, and the tutorial is a guide through Orange scripting in this language.
 
-Contents:
+We here assume you have already `downloaded and installed Orange <http://orange.biolab.si/download/>`_ and have a working version of Python. Python scripts can run in a terminal window, integrated environments like `PyCharm <http://www.jetbrains.com/pycharm/>`_ and `PythonWin <http://wiki.python.org/moin/PythonWin>`_,
+or shells like `iPython <http://ipython.scipy.org/moin/>`_. Whichever environment you are using, try now to import Orange. Below, we used a Python shell::
+
+   % python
+   >>> import Orange
+   >>> Orange.version.version
+   '2.6a2.dev-a55510d'
+   >>>
+
+If this leaves no error and warning, Orange and Python are properly
+installed and you are ready to continue with this Tutorial.
+
+********
+Contents
+********
 
 .. toctree::
    :maxdepth: 1
 
-   start.rst
-   load-data.rst
-   basic-exploration.rst
+   data.rst
    classification.rst
-   evaluation.rst
-   learners-in-python.rst
    regression.rst
-   association-rules.rst
-   feature-subset-selection.rst
    ensembles.rst
-   discretization.rst
+   python-learners.rst
 
 ****************
-Index and search
+Index and Search
 ****************
 
 * :ref:`genindex`

docs/tutorial/rst/regression.rst

-.. index:: regression
-
 Regression
 ==========
 
-At the time of writing of this part of tutorial, there were
-essentially two different learning methods for regression modelling:
-regression trees and instance-based learner (k-nearest neighbors). In
-this lesson, we will see that using regression is just like using
-classifiers, and evaluation techniques are not much different either.
+.. index:: regression
+
+From the interface point of view, regression methods in Orange are very similar to classification. Both intended for supervised data mining, they require class-labeled data. Just like in classification, regression is implemented with learners and regression models (regressors). Regression learners are objects that accept data and return regressors. Regression models are given data items to predict the value of continuous class:
+
+.. literalinclude:: code/regression.py
+
+
+Handful of Regressors
+---------------------
 
 .. index::
-   single: regression; regression trees
+   single: regression; tree
 
-Few simple regressors
----------------------
+Let us start with regression trees. Below is an example script that builds the tree from data on housing prices and prints out the tree in textual form:
 
-Let us start with regression trees. Below is an example script that builds
-the tree from :download:`housing.tab <code/housing.tab>` data set and prints
-out the tree in textual form (:download:`regression1.py <code/regression1.py>`)::
+.. literalinclude:: code/regression-tree.py
+   :lines: 3-
 
-   import orange, orngTree
+The script outputs the tree::
    
-   data = orange.ExampleTable("housing.tab")
-   rt = orngTree.TreeLearner(data, measure="retis", mForPruning=2, minExamples=20)
-   orngTree.printTxt(rt, leafStr="%V %I")
-   
-Notice special setting for attribute evaluation measure! Following is
-the output of this script::
-   
-   RM<6.941: 19.9 [19.333-20.534]
-   RM>=6.941
-   |    RM<7.437
-   |    |    CRIM>=7.393: 14.4 [10.172-18.628]
-   |    |    CRIM<7.393
-   |    |    |    DIS<1.886: 45.7 [37.124-54.176]
-   |    |    |    DIS>=1.886: 32.7 [31.656-33.841]
-   |    RM>=7.437
-   |    |    TAX<534.500: 45.9 [44.295-47.498]
-   |    |    TAX>=534.500: 21.9 [21.900-21.900]
+   RM<=6.941: 19.9
+   RM>6.941
+   |    RM<=7.437
+   |    |    CRIM>7.393: 14.4
+   |    |    CRIM<=7.393
+   |    |    |    DIS<=1.886: 45.7
+   |    |    |    DIS>1.886: 32.7
+   |    RM>7.437
+   |    |    TAX<=534.500: 45.9
+   |    |    TAX>534.500: 21.9
+
+Following is initialization of few other regressors and their prediction of the first five data instances in housing price data set:
 
 .. index::
-   single: regression; k nearest neighbours
+   single: regression; mars
+   single: regression; linear
 
-Predicting continues classes is just like predicting crisp ones. In
-this respect, the following script will be nothing new. It uses both
-regression trees and k-nearest neighbors, and also uses a majority
-learner which for regression simply returns an average value from
-learning data set (:download:`regression2.py <code/regression2.py>`)::
+.. literalinclude:: code/regression-other.py
+   :lines: 3-
 
-   import orange, orngTree, orngTest, orngStat
-   
-   data = orange.ExampleTable("housing.tab")
-   selection = orange.MakeRandomIndices2(data, 0.5)
-   train_data = data.select(selection, 0)
-   test_data = data.select(selection, 1)
-   
-   maj = orange.MajorityLearner(train_data)
-   maj.name = "default"
-   
-   rt = orngTree.TreeLearner(train_data, measure="retis", mForPruning=2, minExamples=20)
-   rt.name = "reg. tree"
-   
-   k = 5
-   knn = orange.kNNLearner(train_data, k=k)
-   knn.name = "k-NN (k=%i)" % k
-   
-   regressors = [maj, rt, knn]
-   
-   print "\n%10s " % "original",
-   for r in regressors:
-     print "%10s " % r.name,
-   print
-   
-   for i in range(10):
-     print "%10.1f " % test_data[i].getclass(),
-     for r in regressors:
-       print "%10.1f " % r(test_data[i]),
-     print
+Looks like the housing prices are not that hard to predict::
 
-The otput of this script is::
+   y    lin  mars tree
+   21.4 24.8 23.0 20.1
+   15.7 14.4 19.0 17.3
+   36.5 35.7 35.6 33.8
 
-     original     default   reg. tree  k-NN (k=5)
-         24.0        50.0        25.0        24.6
-         21.6        50.0        25.0        22.0
-         34.7        50.0        35.4        26.6
-         28.7        50.0        25.0        36.2
-         27.1        50.0        21.7        18.9
-         15.0        50.0        21.7        18.9
-         18.9        50.0        21.7        18.9
-         18.2        50.0        21.7        21.0
-         17.5        50.0        21.7        16.6
-         20.2        50.0        21.7        23.1
+Cross Validation
+----------------
 
-.. index: mean squared error
+Just like for classification, the same evaluation module (``Orange.evaluation``) is available for regression. Its testing submodule includes procedures such as cross-validation, leave-one-out testing and similar, and functions in scoring submodule can assess the accuracy from the testing:
 
-Evaluation and scoring
-----------------------
+.. literalinclude:: code/regression-other.py
+   :lines: 3-
 
-For our third and last example for regression, let us see how we can
-use cross-validation testing and for a score function use
-(:download:`regression3.py <code/regression3.py>`, uses `housing.tab <code/housing.tab>`)::
+.. index: 
+   single: regression; root mean squared error
 
-   import orange, orngTree, orngTest, orngStat
-   
-   data = orange.ExampleTable("housing.tab")
-   
-   maj = orange.MajorityLearner()
-   maj.name = "default"
-   rt = orngTree.TreeLearner(measure="retis", mForPruning=2, minExamples=20)
-   rt.name = "regression tree"
-   k = 5
-   knn = orange.kNNLearner(k=k)
-   knn.name = "k-NN (k=%i)" % k
-   learners = [maj, rt, knn]
-   
-   data = orange.ExampleTable("housing.tab")
-   results = orngTest.crossValidation(learners, data, folds=10)
-   mse = orngStat.MSE(results)
-   
-   print "Learner        MSE"
-   for i in range(len(learners)):
-     print "%-15s %5.3f" % (learners[i].name, mse[i])
+`MARS <http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines>`_ has the lowest root mean squared error::
 
-Again, compared to classification tasks, this is nothing new. The only
-news in the above script is a mean squared error evaluation function
-(``orngStat.MSE``). The scripts prints out the following report::
+   Learner  RMSE
+   lin      4.83
+   mars     3.84
+   tree     5.10
 
-   Learner        MSE
-   default         84.777
-   regression tree 40.096
-   k-NN (k=5)      17.532
-
-Other scoring techniques are available to evaluate the success of
-regression. Script below uses a range of them, plus features a nice
-implementation where a list of scoring techniques is defined
-independetly from the code that reports on the results (part of
-:download:`regression4.py <code/regression4.py>`)::
-
-   lr = orngRegression.LinearRegressionLearner(name="lr")
-   rt = orngTree.TreeLearner(measure="retis", mForPruning=2,
-                             minExamples=20, name="rt")
-   maj = orange.MajorityLearner(name="maj")
-   knn = orange.kNNLearner(k=10, name="knn")
-   learners = [maj, lr, rt, knn]
-   
-   # evaluation and reporting of scores
-   results = orngTest.learnAndTestOnTestData(learners, train, test)
-   scores = [("MSE", orngStat.MSE),
-             ("RMSE", orngStat.RMSE),
-             ("MAE", orngStat.MAE),
-             ("RSE", orngStat.RSE),
-             ("RRSE", orngStat.RRSE),
-             ("RAE", orngStat.RAE),
-             ("R2", orngStat.R2)]
-   
-   print "Learner  " + "".join(["%-7s" % s[0] for s in scores])
-   for i in range(len(learners)):
-       print "%-8s " % learners[i].name + "".join(["%6.3f " % s[1](results)[i] for s in scores])
-
-Here, we used a number of different scores, including:
-
-* MSE - mean squared errror,
-* RMSE - root mean squared error,
-* MAE - mean absolute error,
-* RSE - relative squared error,
-* RRSE - root relative squared error,
-* RAE - relative absolute error, and
-* R2 - coefficient of determinatin, also referred to as R-squared.
-
-For precise definition of these measures, see :py:mod:`Orange.statistics`. Running
-the script above yields::
-
-   Learner  MSE    RMSE   MAE    RSE    RRSE   RAE    R2
-   maj      84.777  9.207  6.659  1.004  1.002  1.002 -0.004
-   lr       23.729  4.871  3.413  0.281  0.530  0.513  0.719
-   rt       40.096  6.332  4.569  0.475  0.689  0.687  0.525
-   knn      17.244  4.153  2.670  0.204  0.452  0.402  0.796
-