Commits

Miha Stajdohar committed f7f0c7b Merge

Merged with tutorial updates.

  • Participants
  • Parent commits 3da1cf3, a68fd2f

Comments (0)

Files changed (113)

Orange/classification/__init__.py

 RandomLearner = core.RandomLearner
 ClassifierFromVar = core.ClassifierFromVar
 ConstantClassifier = core.DefaultClassifier
+
+class PyLearner(object):
+    def __new__(cls, data=None, **kwds):
+        learner = object.__new__(cls)
+        if data:
+            learner.__init__(**kwds) # force init
+            return learner(data)
+        else:
+            return learner  # invokes the __init__
+
+    def __init__(self, name='learner'):
+        self.name = name
+
+    def __call__(self, data, weight=None):
+        return None
+
+class PyClassifier:
+    def __init__(self, **kwds):
+        self.__dict__.update(kwds)
+
+    def __call__(self, example, resultType = Classifier.GetValue):
+        return self.classifier(example, resultType)

docs/tutorial/rst/association-rules.rst

-.. index:: association rules
-
-Association rules
-=================
-
-Association rules are fun to do in Orange. One reason for this is
-Python, and particular implementation that allows a list of
-association rules to behave just like any list in Python. That is, you
-can select parts of the list, you can remove rules, even add them
-(yes, ``append()`` works on Orange association rules!).
-
-For association rules, Orange straightforwardly implements APRIORI
-algorithm (see Agrawal et al., Fast discovery of association rules, a
-chapter in Advances in knowledge discovery and data mining, 1996),
-Orange includes an optimized version of the algorithm that works on
-tabular data).  For number of reasons (but mostly for convenience)
-association rules should be constructed and managed through the
-interface provided by :py:mod:`Orange.associate`.  As implemented in Orange,
-association rules construction procedure does not handle continuous
-attributes, so make sure that your data is categorized. Also, class
-variables are treated just like attributes.  For examples in this
-tutorial, we will use data from the data set :download:`imports-85.tab <code/imports-85.tab>`, which
-surveys different types of cars and lists their characteristics. We
-will use only first ten attributes from this data set and categorize
-them so three equally populated intervals will be created for each
-continuous variable.  This will be done through the following part of
-the code::
-
-   data = orange.ExampleTable("imports-85")
-   data = orange.Preprocessor_discretize(data, \
-     method=orange.EquiNDiscretization(numberOfIntervals=3))
-   data = data.select(range(10))
-
-Now, to our examples. First one uses the data set constructed with
-above script and shows how to build a list of association rules which
-will have support of at least 0.4. Next, we select a subset of first
-five rules, print them out, delete first three rules and repeat the
-printout. The script that does this is (part of :download:`assoc1.py <code/assoc1.py>`, uses
-:download:`imports-85.tab <code/imports-85.tab>`)::
-
-   rules = orange.AssociationRulesInducer(data, support=0.4)
-   
-   print "%i rules with support higher than or equal to %5.3f found." % (len(rules), minSupport)
-   
-   orngAssoc.sort(rules, ["support", "confidence"])
-   
-   orngAssoc.printRules(rules[:5], ["support", "confidence"])
-   print
-   
-   del rules[:3]
-   orngAssoc.printRules(rules[:5], ["support", "confidence"])
-   print
-
-The output of this script is::
-
-   87 rules with support higher than or equal to 0.400 found.
-   
-   supp    conf    rule
-   0.888   0.984   engine-location=front -> fuel-type=gas
-   0.888   0.901   fuel-type=gas -> engine-location=front
-   0.805   0.982   engine-location=front -> aspiration=std
-   0.805   0.817   aspiration=std -> engine-location=front
-   0.785   0.958   fuel-type=gas -> aspiration=std
-   
-   supp    conf    rule
-   0.805   0.982   engine-location=front -> aspiration=std
-   0.805   0.817   aspiration=std -> engine-location=front
-   0.785   0.958   fuel-type=gas -> aspiration=std
-   0.771   0.981   fuel-type=gas aspiration=std -> engine-location=front
-   0.771   0.958   aspiration=std engine-location=front -> fuel-type=gas
-   
-Notice that the when printing out the rules, user can specify which
-rule evaluation measures are to be printed. Choose anything from
-``['support', 'confidence', 'lift', 'leverage', 'strength',
-'coverage']``.
-
-The second example uses the same data set, but first prints out five
-most confident rules. Then, it shows a rather advanced type of
-filtering: every rule has parameters that record its support,
-confidence, etc... These may be used when constructing your own filter
-functions. The one in our example uses ``support`` and ``lift``.
-
-.. note:: 
-   If you have just started with Python: lambda is a compact way to
-   specify a simple function without using def statement. As a
-   function, it uses its own name space, so minimal lift and support
-   requested in our example should be passed as function
-   arguments. 
-
-Here goes the code (part of :download:`assoc2.py <code/assoc2.py>`)::
-
-   rules = orange.AssociationRulesInducer(data, support = 0.4)
-   
-   n = 5
-   print "%i most confident rules:" % (n)
-   orngAssoc.sort(rules, ["confidence"])
-   orngAssoc.printRules(rules[0:n], ['confidence','support','lift'])
-   
-   conf = 0.8; lift = 1.1
-   print "\nRules with support>%5.3f and lift>%5.3f" % (conf, lift)
-   rulesC=rules.filter(lambda x: x.confidence>conf and x.lift>lift)
-   orngAssoc.sort(rulesC, ['confidence'])
-   orngAssoc.printRules(rulesC, ['confidence','support','lift'])
-   
-Just one rule with requested support and lift is found in our rule set::
-
-   5 most confident rules:
-   conf    supp    lift    rule
-   1.000   0.478   1.015   fuel-type=gas aspiration=std drive-wheels=fwd -> engine-location=front
-   1.000   0.429   1.015   fuel-type=gas aspiration=std num-of-doors=four -> engine-location=front
-   1.000   0.507   1.015   aspiration=std drive-wheels=fwd -> engine-location=front
-   1.000   0.449   1.015   aspiration=std num-of-doors=four -> engine-location=front
-   1.000   0.541   1.015   fuel-type=gas drive-wheels=fwd -> engine-location=front
-   
-   Rules with confidence>0.800 and lift>1.100
-   conf    supp    lift    rule
-   0.898   0.429   1.116   fuel-type=gas num-of-doors=four -> aspiration=std engine-location=front
-   

docs/tutorial/rst/basic-exploration.rst

-Basic data exploration
-======================
-
-.. index:: basic data exploration
-
-Until now we have looked only at data files that include solely
-nominal (discrete) attributes. Let's make thinks now more interesting,
-and look at another file with mixture of attribute types. We will
-first use adult data set from UCI ML Repository. The prediction task
-related to this data set is to determine whether a person
-characterized by 14 attributes like education, race, occupation, etc.,
-makes over $50K/year. Because of the original set :download:`adult.tab <code/adult.tab>` is
-rather big (32561 data instances, about 4 MBytes), we will first
-create a smaller sample of about 3% of instances and use it in our
-examples. If you are curious how we do this, here is the code
-(:download:`sample_adult.py <code/sample_adult.py>`)::
-
-   import orange
-   data = orange.ExampleTable("adult")
-   selection = orange.MakeRandomIndices2(data, 0.03)
-   sample = data.select(selection, 0)
-   sample.save("adult_sample.tab")
-
-Above loads the data, prepares a selection vector of length equal to
-the number of data instances, which includes 0's and 1's, but it is
-told that there should be about 3% of 0's. Then, those instances are
-selected which have a corresponding 0 in selection vector, and stored
-in an object called *sample*. The sampled data is then saved in a
-file.  Note that ``MakeRandomIndices2`` performs a stratified selection,
-i.e., the class distribution of original and sampled data should be
-nearly the same.
-
-Basic characteristics of data sets
-----------------------------------
-
-.. index::
-   single: basic data exploration; attributes
-.. index::
-   single: basic data exploration; classes
-.. index::
-   single: basic data exploration; missing values
-
-For classification data sets, basic data characteristics are most
-often number of classes, number of attributes (and of these, how many
-are nominal and continuous), information if data contains missing
-values, and class distribution. Below is the script that does all
-this (:download:`data_characteristics.py <code/data_characteristics.py>`, :download:`adult_sample.tab <code/adult_sample.tab>`)::
-
-   import orange
-   data = orange.ExampleTable("adult_sample")
-   
-   # report on number of classes and attributes
-   print "Classes:", len(data.domain.classVar.values)
-   print "Attributes:", len(data.domain.attributes), ",",
-   
-   # count number of continuous and discrete attributes
-   ncont=0; ndisc=0
-   for a in data.domain.attributes:
-       if a.varType == orange.VarTypes.Discrete:
-           ndisc = ndisc + 1
-       else:
-           ncont = ncont + 1
-   print ncont, "continuous,", ndisc, "discrete"
-   
-   # obtain class distribution
-   c = [0] * len(data.domain.classVar.values)
-   for e in data:
-       c[int(e.getclass())] += 1
-   print "Instances: ", len(data), "total",
-   for i in range(len(data.domain.classVar.values)):
-       print ",", c[i], "with class", data.domain.classVar.values[i],
-   print
-
-The first part is the one that we know already: the script import
-Orange library into Python, and loads the data. The information on
-domain (class and attribute names, types, values, etc.) are stored in
-``data.domain``. Information on class variable is accessible through the
-``data.domain.classVar`` object which stores
-a vector of class' values. Its length is obtained using a function
-``len()``. Similarly, the list of attributes is stored in
-data.domain.attributes. Notice that to obtain the information on i-th
-attribute, this list can be indexed, e.g., ``data.domain.attributes[i]``.
-
-To count the number of continuous and discrete attributes, we have
-first initialized two counters (``ncont``, ``ndisc``), and then iterated
-through the attributes (variable ``a`` is an iteration variable that in is
-each loop associated with a single attribute).  The field ``varType``
-contains the type of the attribute; for discrete attributes, ``varType``
-is equal to ``orange.VarTypes.Discrete``, and for continuous ``varType`` is
-equal to ``orange.VarTypes.Continuous``.
-
-To obtain the number of instances for each class, we first
-initialized a vector c that would of the length equal to the number of
-different classes. Then, we iterated through the data;
-``e.getclass()`` returns a class of an instance e, and to
-turn it into a class index (a number that is in range from 0 to n-1,
-where n is the number of classes) and is used for an index of a
-element of c that should be incremented.
-
-Throughout the code, notice that a print statement in Python prints
-whatever items it has in the line that follows. The items are
-separated with commas, and Python will by default put a blank between
-them when printing. It will also print a new line, unless the print
-statement ends with a comma. It is possible to use print statement in
-Python with formatting directives, just like in C or C++, but this is
-beyond this text.
-
-Running the above script, we obtain the following output::
-
-   Classes: 2
-   Attributes: 14 , 6 continuous, 8 discrete
-   Instances:  977 total , 236 with class >50K , 741 with class <=50K
-
-If you would like class distributions printed as proportions of
-each class in the data sets, then the last part of the script needs
-to be slightly changed. This time, we have used string formatting
-with print as well (part of :download:`data_characteristics2.py <code/data_characteristics2.py>`)::
-
-   # obtain class distribution
-   c = [0] * len(data.domain.classVar.values)
-   for e in data:
-       c[int(e.getclass())] += 1
-   print "Instances: ", len(data), "total",
-   r = [0.] * len(c)
-   for i in range(len(c)):
-       r[i] = c[i]*100./len(data)
-   for i in range(len(data.domain.classVar.values)):
-       print ", %d(%4.1f%s) with class %s" % (c[i], r[i], '%', data.domain.classVar.values[i]),
-   print
-
-The new script outputs the following information::
-
-   Classes: 2
-   Attributes: 14 , 6 continuous, 8 discrete
-   Instances:  977 total , 236(24.2%) with class >50K , 741(75.8%) with class <=50K
-
-As it turns out, there are more people that earn less than those,
-that earn more... On a more technical site, such information may
-be important when your build your classifier; the base error for this
-data set is 1-.758 = .242, and your constructed models should only be
-better than this.
-
-Contingency matrix
-------------------
-
-.. index::
-   single: basic data exploration; class distribution
-
-Another interesting piece of information that we can obtain from the
-data is the distribution of classes for each value of the discrete
-attribute, and means for continuous attribute (we will leave the
-computation of standard deviation and other statistics to you). Let's
-compute means of continuous attributes first (part of :download:`data_characteristics3.py <code/data_characteristics3.py>`)::
-
-   print "Continuous attributes:"
-   for a in range(len(data.domain.attributes)):
-       if data.domain.attributes[a].varType == orange.VarTypes.Continuous:
-           d = 0.; n = 0
-           for e in data:
-               if not e[a].isSpecial():
-                   d += e[a]
-                   n += 1
-           print "  %s, mean=%3.2f" % (data.domain.attributes[a].name, d/n)
-
-This script iterates through attributes (outer for loop), and for
-attributes that are continuous (first if statement) computes a sum
-over all instances. A single new trick that the script uses is that it
-checks if the instance has a defined attribute value.  Namely, for
-instance ``e`` and attribute ``a``, ``e[a].isSpecial()`` is true if
-the value is not defined (unknown). Variable n stores the number of
-instances with defined values of attribute. For our sampled adult data
-set, this part of the code outputs::
-
-   Continuous attributes:
-     age, mean=37.74
-     fnlwgt, mean=189344.06
-     education-num, mean=9.97
-     capital-gain, mean=1219.90
-     capital-loss, mean=99.49
-     hours-per-week, mean=40.27
-   
-For nominal attributes, we could now compose a code that computes,
-for each attribute, how many times a specific value was used for each
-class. Instead, we used a build-in method DomainContingency, which
-does just that. All that our script will do is, mainly, to print it
-out in a readable form (part of :download:`data_characteristics3.py <code/data_characteristics3.py>`)::
-
-   print "\nNominal attributes (contingency matrix for classes:", data.domain.classVar.values, ")"
-   cont = orange.DomainContingency(data)
-   for a in data.domain.attributes:
-       if a.varType == orange.VarTypes.Discrete:
-           print "  %s:" % a.name
-           for v in range(len(a.values)):
-               sum = 0
-               for cv in cont[a][v]:
-                   sum += cv
-               print "    %s, total %d, %s" % (a.values[v], sum, cont[a][v])
-           print
-
-Notice that the first part of this script is similar to the one that
-is dealing with continuous attributes, except that the for loop is a
-little bit simpler. With continuous attributes, the iterator in the
-loop was an attribute index, whereas in the script above we iterate
-through members of ``data.domain.attributes``, which are objects that
-represent attributes. Data structures that may be addressed in Orange
-by attribute may most often be addressed either by attribute index,
-attribute name (string), or an object that represents an attribute.
-
-The output of the code above is rather long (this data set has
-some attributes that have rather large sets of values), so we show
-only the output for two attributes::
-
-   Nominal attributes (contingency matrix for classes: <>50K, <=50K> )
-     workclass:
-       Private, total 729, <170.000, 559.000>
-       Self-emp-not-inc, total 62, <19.000, 43.000>
-       Self-emp-inc, total 22, <10.000, 12.000>
-       Federal-gov, total 27, <10.000, 17.000>
-       Local-gov, total 53, <14.000, 39.000>
-       State-gov, total 39, <10.000, 29.000>
-       Without-pay, total 1, <0.000, 1.000>
-       Never-worked, total 0, <0.000, 0.000>
-   
-     sex:
-       Female, total 330, <28.000, 302.000>
-       Male, total 647, <208.000, 439.000>
-
-First, notice that the in the vectors the first number refers to a
-higher income, and the second number to the lower income (e.g., from
-this data it looks like that women earn less than men). Notice that
-Orange outputs the tuples. To change this, we would need another loop
-that would iterate through members of the tuples. You may also foresee
-that it would be interesting to compute the proportions rather than
-number of instances in above contingency matrix, but that we leave for
-your exercise.
-
-Missing values
---------------
-
-.. index::
-   single: missing values; statistics
-
-It is often interesting to see, given the attribute, what is the
-proportion of the instances with that attribute unknown. We have
-already learned that if a function isSpecial() can be used to
-determine if for specific instances and attribute the value is not
-defined. Let us use this function to compute the proportion of missing
-values per each attribute (:download:`report_missing.py <code/report_missing.py>`)::
-
-   import orange
-   data = orange.ExampleTable("adult_sample")
-   
-   natt = len(data.domain.attributes)
-   missing = [0.] * natt
-   for i in data:
-       for j in range(natt):
-           if i[j].isSpecial():
-               missing[j] += 1
-   missing = map(lambda x, l=len(data):x/l*100., missing)
-   
-   print "Missing values per attribute:"
-   atts = data.domain.attributes
-   for i in range(natt):
-       print "  %5.1f%s %s" % (missing[i], '%', atts[i].name)
-
-Integer variable natt stores number of attributes in the data set. An
-array missing stores the number of the missing values per attribute;
-its size is therefore equal to natt, and all of its elements are
-initially 0 (in fact, 0.0, since we purposely identified it as a real
-number, which helped us later when we converted it to percents).
-
-The only line that possibly looks (very?) strange is ``missing =
-map(lambda x, l=len(data):x/l*100., missing)``. This line could be
-replaced with for loop, but we just wanted to have it here to show how
-coding in Python may look very strange, but may gain in
-efficiency. The function map takes a vector (in our case missing), and
-executes a function on every of its elements, thus obtaining a new
-vector. The function it executes is in our case defined inline, and is
-in Python called lambda expression. You can see that our lambda
-function takes a single argument (when mapped, an element of vector
-missing), and returns its value that is normalized with the number of
-data instances (``len(data)``) multiplied by 100, to turn it in
-percentage. Thus, the map function in fact normalizes the elements of
-missing to express a proportion of missing values over the instances
-of the data set.
-
-Finally, let us see what outputs the script we have just been working
-on::
-
-   Missing values per attribute:
-       0.0% age
-       4.5% workclass
-       0.0% fnlwgt
-       0.0% education
-       0.0% education-num
-       0.0% marital-status
-       4.5% occupation
-       0.0% relationship
-       0.0% race
-       0.0% sex
-       0.0% capital-gain
-       0.0% capital-loss
-       0.0% hours-per-week
-       1.9% native-country
-
-In our sampled data set, just three attributes contain the missing
-values.
-
-Distributions of feature values
--------------------------------
-
-For some of the tasks above, Orange can provide a shortcut by means of
-``orange.DomainDistributions`` function which returns an object that
-holds averages and mean square errors for continuous attributes, value
-frequencies for discrete attributes, and for both number of instances
-where specific attribute has a missing value.  The use of this object
-is exemplified in the following script (:download:`data_characteristics4.py <code/data_characteristics4.py>`)::
-
-   import orange
-   data = orange.ExampleTable("adult_sample")
-   dist = orange.DomainDistributions(data)
-   
-   print "Average values and mean square errors:"
-   for i in range(len(data.domain.attributes)):
-       if data.domain.attributes[i].varType == orange.VarTypes.Continuous:
-           print "%s, mean=%5.2f +- %5.2f" % \
-               (data.domain.attributes[i].name, dist[i].average(), dist[i].error())
-   
-   print "\nFrequencies for values of discrete attributes:"
-   for i in range(len(data.domain.attributes)):
-       a = data.domain.attributes[i]
-       if a.varType == orange.VarTypes.Discrete:
-           print "%s:" % a.name
-           for j in range(len(a.values)):
-               print "  %s: %d" % (a.values[j], int(dist[i][j]))
-   
-   print "\nNumber of items where attribute is not defined:"
-   for i in range(len(data.domain.attributes)):
-       a = data.domain.attributes[i]
-       print "  %2d %s" % (dist[i].unknowns, a.name)
-
-Check this script out. Its results should match with the results we
-have derived by other scripts in this lesson.

docs/tutorial/rst/classification.rst

 ==============
 
 .. index:: classification
-.. index:: supervised data mining
+.. index:: 
+   single: data mining; supervised
 
-A substantial part of Orange is devoted to machine learning methods
-for classification, or supervised data mining. These methods start
-from the data that incorporates class-labeled instances, like
-:download:`voting.tab <code/voting.tab>`::
+Much of Orange is devoted to machine learning methods for classification, or supervised data mining. These methods rely on
+the data with class-labeled instances, like that of senate voting. Here is a code that loads this data set, displays the first data instance and shows its predicted class (``republican``)::
 
-   >>> data = orange.ExampleTable("voting.tab")
+   >>> data = Orange.data.Table("voting")
    >>> data[0]
    ['n', 'y', 'n', 'y', 'y', 'y', 'n', 'n', 'n', 'y', '?', 'y', 'y', 'y', 'n', 'y', 'republican']
-   >>> data[0].getclass()
+   >>> data[0].get_class()
    <orange.Value 'party'='republican'>
 
-Supervised data mining attempts to develop predictive models from such
-data that, given the set of feature values, predict a corresponding
-class.
+Learners and Classifiers
+------------------------
 
-.. index:: classifiers
 .. index::
-   single: classifiers; naive Bayesian
+   single: classification; learner
+.. index::
+   single: classification; classifier
+.. index::
+   single: classification; naive Bayesian classifier
 
-There are two types of objects important for classification: learners
-and classifiers. Orange has a number of build-in learners. For
-instance, ``orange.BayesLearner`` is a naive Bayesian learner. When
-data is passed to a learner (e.g., ``orange.BayesLearner(data))``, it
-returns a classifier. When data instance is presented to a classifier,
-it returns a class, vector of class probabilities, or both.
+Classification uses two types of objects: learners and classifiers. Learners consider class-labeled data and return a classifier. Given a data instance (a vector of feature values), classifiers return a predicted class::
 
-A Simple Classifier
--------------------
+    >>> import Orange
+    >>> data = Orange.data.Table("voting")
+    >>> learner = Orange.classification.bayes.NaiveLearner()
+    >>> classifier = learner(data)
+    >>> classifier(data[0])
+    <orange.Value 'party'='republican'>
 
-Let us see how this works in practice. We will
-construct a naive Bayesian classifier from voting data set, and
-will use it to classify the first five instances from this data set
-(:download:`classifier.py <code/classifier.py>`)::
+Above, we read the data, constructed a `naive Bayesian learner <http://en.wikipedia.org/wiki/Naive_Bayes_classifier>`_, gave it the data set to construct a classifier, and used it to predict the class of the first data item. We also use these concepts in the following code that predicts the classes of the first five instances in the data set:
 
-   import orange
-   data = orange.ExampleTable("voting")
-   classifier = orange.BayesLearner(data)
-   for i in range(5):
-       c = classifier(data[i])
-       print "original", data[i].getclass(), "classified as", c
+.. literalinclude: code/classification-classifier1.py
+   :lines: 4-
 
-The script loads the data, uses it to constructs a classifier using
-naive Bayesian method, and then classifies first five instances of the
-data set. Note that both original class and the class assigned by a
-classifier is printed out.
+The script outputs::
 
-The data set that we use includes votes for each of the U.S.  House of
-Representatives Congressmen on the 16 key votes; a class is a
-representative's party. There are 435 data instances - 267 democrats
-and 168 republicans - in the data set (see UCI ML Repository and
-voting-records data set for further description).  This is how our
-classifier performs on the first five instances:
+    republican; originally republican
+    republican; originally republican
+    republican; originally democrat
+      democrat; originally democrat
+      democrat; originally democrat
 
-   1: republican (originally republican)
-   2: republican (originally republican)
-   3: republican (originally democrat)
-   4: democrat (originally democrat)
-   5: democrat (originally democrat)
+Naive Bayesian classifier has made a mistake in the third instance, but otherwise predicted correctly. No wonder, since this was also the data it trained from.
 
-Naive Bayes made a mistake at a third instance, but otherwise predicted
-correctly.
-
-Obtaining Class Probabilities
------------------------------
+Probabilistic Classification
+----------------------------
 
 To find out what is the probability that the classifier assigns
 to, say, democrat class, we need to call the classifier with
-additional parameter ``orange.GetProbabilities``. Also, note that the
-democrats have a class index 1. We find this out with print
-``data.domain.classVar.values`` (:download:`classifier2.py <code/classifier2.py>`)::
+additional parameter that specifies the output type. If this is ``Orange.classification.Classifier.GetProbabilities``, the classifier will output class probabilities:
 
-   import orange
-   data = orange.ExampleTable("voting")
-   classifier = orange.BayesLearner(data)
-   print "Possible classes:", data.domain.classVar.values
-   print "Probabilities for democrats:"
-   for i in range(5):
-       p = classifier(data[i], orange.GetProbabilities)
-       print "%d: %5.3f (originally %s)" % (i+1, p[1], data[i].getclass())
+.. literalinclude: code/classification-classifier2.py
+   :lines: 4-
 
-The output of this script is::
+The output of the script also shows how badly the naive Bayesian classifier missed the class for the thrid data item::
 
-   Possible classes: <republican, democrat>
-   Probabilities for democrats:
-   1: 0.000 (originally republican)
-   2: 0.000 (originally republican)
-   3: 0.005 (originally democrat)
-   4: 0.998 (originally democrat)
-   5: 0.957 (originally democrat)
+   Probabilities for democrat:
+   0.000; originally republican
+   0.000; originally republican
+   0.005; originally democrat
+   0.998; originally democrat
+   0.957; originally democrat
 
-The printout, for example, shows that with the third instance
-naive Bayes has not only misclassified, but the classifier missed
-quite substantially; it has assigned only a 0.005 probability to
-the correct class.
+Cross-Validation
+----------------
 
-.. note::
-   Python list indexes start with 0.
+.. index:: cross-validation
 
-.. note::
-   The ordering of class values depend on occurence of classes in the
-   input data set.
+Validating the accuracy of classifiers on the training data, as we did above, serves demonstration purposes only. Any performance measure that assess accuracy should be estimated on the independent test set. Such is also a procedure called `cross-validation <http://en.wikipedia.org/wiki/Cross-validation_(statistics)>`_, which averages performance estimates across several runs, each time considering a different training and test subsets as sampled from the original data set:
 
-Classification tree
--------------------
+.. literalinclude: code/classification-cv.py
+   :lines: 3-
 
-.. index:: classifiers
 .. index::
-   single: classifiers; classification trees
+   single: classification; scoring
+.. index::
+   single: classification; area under ROC
+.. index::
+   single: classification; accuracy
 
-Classification tree learner (yes, this is the same *decision tree*)
-is a native Orange learner, but because it is a rather
-complex object that is for its versatility composed of a number of
-other objects (for attribute estimation, stopping criterion, etc.),
-a wrapper (module) called ``orngTree`` was build around it to simplify
-the use of classification trees and to assemble the learner with
-some usual (default) components. Here is a script with it (:download:`tree.py <code/tree.py>`)::
+Cross-validation is expecting a list of learners. The performance estimators also return a list of scores, one for every learner. There was just one learner in the script above, hence the list of size one was used. The script estimates classification accuracy and area under ROC curve. The later score is very high, indicating a very good performance of naive Bayesian learner on senate voting data set::
 
-   import orange, orngTree
-   data = orange.ExampleTable("voting")
+   Accuracy: 0.90
+   AUC:      0.97
+
+
+Handful of Classifiers
+----------------------
+
+Orange includes wide range of classification algorithms, including:
+
+- logistic regression (``Orange.classification.logreg``)
+- k-nearest neighbors (``Orange.classification.knn``)
+- support vector machines (``Orange.classification.svm``)
+- classification trees (``Orange.classification.tree``)
+- classification rules (``Orange.classification.rules``)
+
+Some of these are included in the code that estimates the probability of a target class on a testing data. This time, training and test data sets are disjoint:
+
+.. index::
+   single: classification; logistic regression
+.. index::
+   single: classification; trees
+.. index::
+   single: classification; k-nearest neighbors
+
+.. literalinclude: code/classification-other.py
+
+For these five data items, there are no major differences between predictions of observed classification algorithms::
+
+   Probabilities for republican:
+   original class  tree      k-NN      lr       
+   republican      0.949     1.000     1.000
+   republican      0.972     1.000     1.000
+   democrat        0.011     0.078     0.000
+   democrat        0.015     0.001     0.000
+   democrat        0.015     0.032     0.000
+
+The following code cross-validates several learners. Notice the difference between this and the code above. Cross-validation requires learners, while in the script above, learners were immediately given the data and the calls returned classifiers.
+
+.. literalinclude: code/classification-cv2.py
+
+Logistic regression wins in area under ROC curve::
+
+            nbc  tree lr  
+   Accuracy 0.90 0.95 0.94
+   AUC      0.97 0.94 0.99
+
+Reporting on Classification Models
+----------------------------------
+
+Classification models are objects, exposing every component of its structure. For instance, one can traverse classification tree in code and observe the associated data instances, probabilities and conditions. It is often, however, sufficient, to provide textual output of the model. For logistic regression and trees, this is illustrated in the script below:
+
+.. literalinclude: code/classification-models.py
+
+   The logistic regression part of the output is:
    
-   tree = orngTree.TreeLearner(data, sameMajorityPruning=1, mForPruning=2)
-   print "Possible classes:", data.domain.classVar.values
-   print "Probabilities for democrats:"
-   for i in range(5):
-       p = tree(data[i], orange.GetProbabilities)
-       print "%d: %5.3f (originally %s)" % (i+1, p[1], data[i].getclass())
+   class attribute = survived
+   class values = <no, yes>
+
+         Feature       beta  st. error     wald Z          P OR=exp(beta)
    
-   orngTree.printTxt(tree)
+       Intercept      -1.23       0.08     -15.15      -0.00
+    status=first       0.86       0.16       5.39       0.00       2.36
+   status=second      -0.16       0.18      -0.91       0.36       0.85
+    status=third      -0.92       0.15      -6.12       0.00       0.40
+       age=child       1.06       0.25       4.30       0.00       2.89
+      sex=female       2.42       0.14      17.04       0.00      11.25
 
-.. note:: 
-   The script for classification tree is almost the same as the one
-   for naive Bayes (:download:`classifier2.py <code/classifier2.py>`), except that we have imported
-   another module (``orngTree``) and used learner
-   ``orngTree.TreeLearner`` to build a classifier called ``tree``.
+Trees can also be rendered in `dot <http://en.wikipedia.org/wiki/DOT_language>`_::
 
-.. note::
-   For those of you that are at home with machine learning: the
-   default parameters for tree learner assume that a single example is
-   enough to have a leaf for it, gain ratio is used for measuring the
-   quality of attributes that are considered for internal nodes of the
-   tree, and after the tree is constructed the subtrees no pruning
-   takes place.
+   tree.dot(file_name="0.dot", node_shape="ellipse", leaf_shape="box")
 
-The resulting tree with default parameters would be rather big, so we
-have additionally requested that leaves that share common predecessor
-(node) are pruned if they classify to the same class, and requested
-that tree is post-pruned using m-error estimate pruning method with
-parameter m set to 2.0. The output of our script is::
-
-   Possible classes: <republican, democrat>
-   Probabilities for democrats:
-   1: 0.051 (originally republican)
-   2: 0.027 (originally republican)
-   3: 0.989 (originally democrat)
-   4: 0.985 (originally democrat)
-   5: 0.985 (originally democrat)
-
-Notice that all of the instances are classified correctly. The last
-line of the script prints out the tree that was used for
-classification::
-
-   physician-fee-freeze=n: democrat (98.52%)
-   physician-fee-freeze=y
-   |    synfuels-corporation-cutback=n: republican (97.25%)
-   |    synfuels-corporation-cutback=y
-   |    |    mx-missile=n
-   |    |    |    el-salvador-aid=y
-   |    |    |    |    adoption-of-the-budget-resolution=n: republican (85.33%)
-   |    |    |    |    adoption-of-the-budget-resolution=y
-   |    |    |    |    |    anti-satellite-test-ban=n: democrat (99.54%)
-   |    |    |    |    |    anti-satellite-test-ban=y: republican (100.00%)
-   |    |    |    el-salvador-aid=n
-   |    |    |    |    handicapped-infants=n: republican (100.00%)
-   |    |    |    |    handicapped-infants=y: democrat (99.77%)
-   |    |    mx-missile=y
-   |    |    |    religious-groups-in-schools=y: democrat (99.54%)
-   |    |    |    religious-groups-in-schools=n
-   |    |    |    |    immigration=y: republican (98.63%)
-   |    |    |    |    immigration=n
-   |    |    |    |    |    handicapped-infants=n: republican (98.63%)
-   |    |    |    |    |    handicapped-infants=y: democrat (99.77%)
-
-The printout includes the feature on which the tree branches in the
-internal nodes. For leaves, it shows the the class label to which a
-tree would make a classification. The probability of that class, as
-estimated from the training data set, is also displayed.
-
-If you are more of a *visual* type, you may like the graphical 
-presentation of the tree better. This was achieved by printing out a
-tree in so-called dot file (the line of the script required for this
-is ``orngTree.printDot(tree, fileName='tree.dot',
-internalNodeShape="ellipse", leafShape="box")``), which was then
-compiled to PNG using program called `dot`_.
+Following figure shows an example of such rendering.
 
 .. image:: files/tree.png
    :alt: A graphical presentation of a classification tree
-
-.. _dot: http://graphviz.org/
-
-Nearest neighbors and majority classifiers
-------------------------------------------
-
-.. index:: classifiers
-.. index:: 
-   single: classifiers; k nearest neighbours
-.. index:: 
-   single: classifiers; majority classifier
-
-Let us here check on two other classifiers. Majority classifier always
-classifies to the majority class of the training set, and predicts 
-class probabilities that are equal to class distributions from the training
-set. While being useless as such, it may often be good to compare this
-simplest classifier to any other classifier you test &ndash; if your
-other classifier is not significantly better than majority classifier,
-than this may a reason to sit back and think.
-
-The second classifier we are introducing here is based on k-nearest
-neighbors algorithm, an instance-based method that finds k examples
-from training set that are most similar to the instance that has to be
-classified. From the set it obtains in this way, it estimates class
-probabilities and uses the most frequent class for prediction.
-
-The following script takes naive Bayes, classification tree (what we
-have already learned), majority and k-nearest neighbors classifier
-(new ones) and prints prediction for first 10 instances of voting data
-set (:download:`handful.py <code/handful.py>`)::
-
-   import orange, orngTree
-   data = orange.ExampleTable("voting")
-   
-   # setting up the classifiers
-   majority = orange.MajorityLearner(data)
-   bayes = orange.BayesLearner(data)
-   tree = orngTree.TreeLearner(data, sameMajorityPruning=1, mForPruning=2)
-   knn = orange.kNNLearner(data, k=21)
-   
-   majority.name="Majority"; bayes.name="Naive Bayes";
-   tree.name="Tree"; knn.name="kNN"
-   
-   classifiers = [majority, bayes, tree, knn]
-   
-   # print the head
-   print "Possible classes:", data.domain.classVar.values
-   print "Probability for republican:"
-   print "Original Class",
-   for l in classifiers:
-       print "%-13s" % (l.name),
-   print
-   
-   # classify first 10 instances and print probabilities
-   for example in data[:10]:
-       print "(%-10s)  " % (example.getclass()),
-       for c in classifiers:
-           p = apply(c, [example, orange.GetProbabilities])
-           print "%5.3f        " % (p[0]),
-       print
-
-The code is somehow long, due to our effort to print the results
-nicely. The first part of the code sets-up our four classifiers, and
-gives them names. Classifiers are then put into the list denoted with
-variable ``classifiers`` (this is nice since, if we would need to add
-another classifier, we would just define it and put it in the list,
-and for the rest of the code we would not worry about it any
-more). The script then prints the header with the names of the
-classifiers, and finally uses the classifiers to compute the
-probabilities of classes. Note for a special function ``apply`` that
-we have not met yet: it simply calls a function that is given as its
-first argument, and passes it the arguments that are given in the
-list. In our case, ``apply`` invokes our classifiers with a data
-instance and request to compute probabilities. The output of our
-script is::
-
-   Possible classes: <republican, democrat>
-   Probability for republican:
-   Original Class Majority      Naive Bayes   Tree          kNN
-   (republican)   0.386         1.000         0.949         1.000
-   (republican)   0.386         1.000         0.973         1.000
-   (democrat  )   0.386         0.995         0.011         0.138
-   (democrat  )   0.386         0.002         0.015         0.468
-   (democrat  )   0.386         0.043         0.015         0.035
-   (democrat  )   0.386         0.228         0.015         0.442
-   (democrat  )   0.386         1.000         0.973         0.977
-   (republican)   0.386         1.000         0.973         1.000
-   (republican)   0.386         1.000         0.973         1.000
-   (democrat  )   0.386         0.000         0.015         0.000
-
-.. note::
-   The prediction of majority class classifier does not depend on the
-   instance it classifies (of course!).
-
-.. note:: 
-   At this stage, it would be inappropriate to say anything conclusive
-   on the predictive quality of the classifiers - for this, we will
-   need to resort to statistical methods on comparison of
-   classification models.

docs/tutorial/rst/code/accuracy.py

-# Description: Learn a naive Bayesian classifier, and measure classification accuracy on the same data set
-# Category:    evaluation
-# Uses:        voting.tab
-# Referenced:  c_performance.htm
-
-import orange
-data = orange.ExampleTable("voting")
-classifier = orange.BayesLearner(data)
-
-# compute classification accuracy
-correct = 0.0
-for ex in data:
-    if classifier(ex) == ex.getclass():
-        correct += 1
-print "Classification accuracy:", correct/len(data)

docs/tutorial/rst/code/accuracy2.py

-# Description: Set a number of learners, for each build a classifier from the data and determine classification accuracy
-# Category:    evaluation
-# Uses:        voting.tab
-# Referenced:  c_performance.htm
-
-import orange, orngTree
-
-def accuracy(test_data, classifiers):
-    correct = [0.0]*len(classifiers)
-    for ex in test_data:
-        for i in range(len(classifiers)):
-            if classifiers[i](ex) == ex.getclass():
-                correct[i] += 1
-    for i in range(len(correct)):
-        correct[i] = correct[i] / len(test_data)
-    return correct
-
-# set up the classifiers
-data = orange.ExampleTable("voting")
-bayes = orange.BayesLearner(data)
-bayes.name = "bayes"
-tree = orngTree.TreeLearner(data);
-tree.name = "tree"
-classifiers = [bayes, tree]
-
-# compute accuracies
-acc = accuracy(data, classifiers)
-print "Classification accuracies:"
-for i in range(len(classifiers)):
-    print classifiers[i].name, acc[i]

docs/tutorial/rst/code/accuracy3.py

-# Category:    evaluation
-# Description: Set a number of learners, split data to train and test set, learn models from train set and estimate classification accuracy on the test set
-# Uses:        voting.tab
-# Classes:     MakeRandomIndices2
-# Referenced:  c_performance.htm
-
-import orange, orngTree
-
-def accuracy(test_data, classifiers):
-    correct = [0.0]*len(classifiers)
-    for ex in test_data:
-        for i in range(len(classifiers)):
-            if classifiers[i](ex) == ex.getclass():
-                correct[i] += 1
-    for i in range(len(correct)):
-        correct[i] = correct[i] / len(test_data)
-    return correct
-
-# set up the classifiers
-data = orange.ExampleTable("voting")
-selection = orange.MakeRandomIndices2(data, 0.5)
-train_data = data.select(selection, 0)
-test_data = data.select(selection, 1)
-
-bayes = orange.BayesLearner(train_data)
-tree = orngTree.TreeLearner(train_data)
-bayes.name = "bayes"
-tree.name = "tree"
-classifiers = [bayes, tree]
-
-# compute accuracies
-acc = accuracy(test_data, classifiers)
-print "Classification accuracies:"
-for i in range(len(classifiers)):
-    print classifiers[i].name, acc[i]
-

docs/tutorial/rst/code/accuracy4.py

-# Description: Estimation of accuracy by random sampling.
-# User can set what proportion of data will be used in training.
-# Demonstration of use for different learners.
-# Category:   evaluation
-# Uses:        voting.tab
-# Classes:     MakeRandomIndices2
-# Referenced:  c_performance.htm
-
-import orange, orngTree
-
-def accuracy(test_data, classifiers):
-    correct = [0.0] * len(classifiers)
-    for ex in test_data:
-        for i in range(len(classifiers)):
-            if classifiers[i](ex) == ex.getclass():
-                correct[i] += 1
-    for i in range(len(correct)):
-        correct[i] = correct[i] / len(test_data)
-    return correct
-
-def test_rnd_sampling(data, learners, p=0.7, n=10):
-    acc = [0.0] * len(learners)
-    for i in range(n):
-        selection = orange.MakeRandomIndices2(data, p)
-        train_data = data.select(selection, 0)
-        test_data = data.select(selection, 1)
-        classifiers = []
-        for l in learners:
-            classifiers.append(l(train_data))
-        acc1 = accuracy(test_data, classifiers)
-        print "%d: %s" % (i + 1, ["%.6f" % a for a in acc1])
-        for j in range(len(learners)):
-            acc[j] += acc1[j]
-    for j in range(len(learners)):
-        acc[j] = acc[j] / n
-    return acc
-
-orange.setrandseed(0)
-# set up the learners
-bayes = orange.BayesLearner()
-tree = orngTree.TreeLearner();
-#tree = orngTree.TreeLearner(mForPruning=2)
-bayes.name = "bayes"
-tree.name = "tree"
-learners = [bayes, tree]
-
-# compute accuracies on data
-data = orange.ExampleTable("voting")
-acc = test_rnd_sampling(data, learners)
-print "Classification accuracies:"
-for i in range(len(learners)):
-    print learners[i].name, acc[i]

docs/tutorial/rst/code/accuracy5.py

-# Category:    evaluation
-# Description: Estimation of accuracy by cross validation. Demonstration of use for different learners.
-# Uses:        voting.tab
-# Classes:     MakeRandomIndicesCV
-# Referenced:  c_performance.htm
-
-import orange, orngTree
-
-def accuracy(test_data, classifiers):
-    correct = [0.0] * len(classifiers)
-    for ex in test_data:
-        for i in range(len(classifiers)):
-            if classifiers[i](ex) == ex.getclass():
-                correct[i] += 1
-    for i in range(len(correct)):
-        correct[i] = correct[i] / len(test_data)
-    return correct
-
-def cross_validation(data, learners, k=10):
-    acc = [0.0] * len(learners)
-    selection = orange.MakeRandomIndicesCV(data, folds=k)
-    for test_fold in range(k):
-        train_data = data.select(selection, test_fold, negate=1)
-        test_data = data.select(selection, test_fold)
-        classifiers = []
-        for l in learners:
-            classifiers.append(l(train_data))
-        acc1 = accuracy(test_data, classifiers)
-        print "%d: %s" % (test_fold + 1, ["%.6f" % a for a in acc1])
-        for j in range(len(learners)):
-            acc[j] += acc1[j]
-    for j in range(len(learners)):
-        acc[j] = acc[j] / k
-    return acc
-
-orange.setrandseed(0)
-# set up the learners
-bayes = orange.BayesLearner()
-tree = orngTree.TreeLearner(mForPruning=2)
-
-bayes.name = "bayes"
-tree.name = "tree"
-learners = [bayes, tree]
-
-# compute accuracies on data
-data = orange.ExampleTable("voting")
-acc = cross_validation(data, learners, k=10)
-print "Classification accuracies:"
-for i in range(len(learners)):
-    print learners[i].name, acc[i]

docs/tutorial/rst/code/accuracy6.py

-# Description: Leave-one-out method for estimation of classification accuracy. Demonstration of use for different learners
-# Category:    evaluation
-# Uses:        voting.tab
-# Referenced:  c_performance.htm
-
-import orange, orngTree
-
-def leave_one_out(data, learners):
-    acc = [0.0]*len(learners)
-    selection = [1] * len(data)
-    last = 0
-    for i in range(len(data)):
-        print 'leave-one-out: %d of %d' % (i, len(data))
-        selection[last] = 1
-        selection[i] = 0
-        train_data = data.select(selection, 1)
-        for j in range(len(learners)):
-            classifier = learners[j](train_data)
-            if classifier(data[i]) == data[i].getclass():
-                acc[j] += 1
-        last = i
-
-    for j in range(len(learners)):
-        acc[j] = acc[j]/len(data)
-    return acc
-
-orange.setrandseed(0)    
-# set up the learners
-bayes = orange.BayesLearner()
-tree = orngTree.TreeLearner(minExamples=10, mForPruning=2)
-bayes.name = "bayes"
-tree.name = "tree"
-learners = [bayes, tree]
-
-# compute accuracies on data
-data = orange.ExampleTable("voting")
-acc = leave_one_out(data, learners)
-print "Classification accuracies:"
-for i in range(len(learners)):
-    print learners[i].name, acc[i]

docs/tutorial/rst/code/accuracy7.py

-# Description: Demostration of use of cross-validation as provided in orngEval module
-# Category:    evaluation
-# Uses:        voting.tab
-# Classes:     orngTest.crossValidation
-# Referenced:  c_performance.htm
-
-import orange, orngTest, orngStat, orngTree
-
-# set up the learners
-bayes = orange.BayesLearner()
-tree = orngTree.TreeLearner(mForPruning=2)
-bayes.name = "bayes"
-tree.name = "tree"
-learners = [bayes, tree]
-
-# compute accuracies on data
-data = orange.ExampleTable("voting")
-results = orngTest.crossValidation(learners, data, folds=10)
-
-# output the results
-print "Learner  CA     IS     Brier    AUC"
-for i in range(len(learners)):
-    print "%-8s %5.3f  %5.3f  %5.3f  %5.3f" % (learners[i].name, \
-        orngStat.CA(results)[i], orngStat.IS(results)[i],
-        orngStat.BrierScore(results)[i], orngStat.AUC(results)[i])

docs/tutorial/rst/code/accuracy8.py

-# Description: Demostration of use of cross-validation as provided in orngEval module
-# Category:    evaluation
-# Uses:        voting.tab
-# Classes:     orngTest.crossValidation
-# Referenced:  c_performance.htm
-
-import orange
-import orngTest, orngStat, orngTree
-
-# set up the learners
-bayes = orange.BayesLearner()
-tree = orngTree.TreeLearner(mForPruning=2)
-bayes.name = "bayes"
-tree.name = "tree"
-learners = [bayes, tree]
-
-# compute accuracies on data
-data = orange.ExampleTable("voting")
-res = orngTest.crossValidation(learners, data, folds=10)
-cm = orngStat.computeConfusionMatrices(res,
-        classIndex=data.domain.classVar.values.index('democrat'))
-
-stat = (('CA', lambda res,cm: orngStat.CA(res)),
-        ('Sens', lambda res,cm: orngStat.sens(cm)),
-        ('Spec', lambda res,cm: orngStat.spec(cm)),
-        ('AUC', lambda res,cm: orngStat.AUC(res)),
-        ('IS', lambda res,cm: orngStat.IS(res)),
-        ('Brier', lambda res,cm: orngStat.BrierScore(res)),
-        ('F1', lambda res,cm: orngStat.F1(cm)),
-        ('F2', lambda res,cm: orngStat.Falpha(cm, alpha=2.0)),
-        ('MCC', lambda res,cm: orngStat.MCC(cm)),
-        ('sPi', lambda res,cm: orngStat.scottsPi(cm)),
-        )
-
-scores = [s[1](res,cm) for s in stat]
-print
-print "Learner  " + "".join(["%-7s" % s[0] for s in stat])
-for (i, l) in enumerate(learners):
-    print "%-8s " % l.name + "".join(["%5.3f  " % s[i] for s in scores])