Commits

Blaz Zupan committed 7b3d84b

tutorial update

  • Participants
  • Parent commits c22077a

Comments (0)

Files changed (114)

File docs/tutorial/rst/association-rules.rst

-.. index:: association rules
-
-Association rules
-=================
-
-Association rules are fun to do in Orange. One reason for this is
-Python, and particular implementation that allows a list of
-association rules to behave just like any list in Python. That is, you
-can select parts of the list, you can remove rules, even add them
-(yes, ``append()`` works on Orange association rules!).
-
-For association rules, Orange straightforwardly implements APRIORI
-algorithm (see Agrawal et al., Fast discovery of association rules, a
-chapter in Advances in knowledge discovery and data mining, 1996),
-Orange includes an optimized version of the algorithm that works on
-tabular data).  For number of reasons (but mostly for convenience)
-association rules should be constructed and managed through the
-interface provided by :py:mod:`Orange.associate`.  As implemented in Orange,
-association rules construction procedure does not handle continuous
-attributes, so make sure that your data is categorized. Also, class
-variables are treated just like attributes.  For examples in this
-tutorial, we will use data from the data set :download:`imports-85.tab <code/imports-85.tab>`, which
-surveys different types of cars and lists their characteristics. We
-will use only first ten attributes from this data set and categorize
-them so three equally populated intervals will be created for each
-continuous variable.  This will be done through the following part of
-the code::
-
-   data = orange.ExampleTable("imports-85")
-   data = orange.Preprocessor_discretize(data, \
-     method=orange.EquiNDiscretization(numberOfIntervals=3))
-   data = data.select(range(10))
-
-Now, to our examples. First one uses the data set constructed with
-above script and shows how to build a list of association rules which
-will have support of at least 0.4. Next, we select a subset of first
-five rules, print them out, delete first three rules and repeat the
-printout. The script that does this is (part of :download:`assoc1.py <code/assoc1.py>`, uses
-:download:`imports-85.tab <code/imports-85.tab>`)::
-
-   rules = orange.AssociationRulesInducer(data, support=0.4)
-   
-   print "%i rules with support higher than or equal to %5.3f found." % (len(rules), minSupport)
-   
-   orngAssoc.sort(rules, ["support", "confidence"])
-   
-   orngAssoc.printRules(rules[:5], ["support", "confidence"])
-   print
-   
-   del rules[:3]
-   orngAssoc.printRules(rules[:5], ["support", "confidence"])
-   print
-
-The output of this script is::
-
-   87 rules with support higher than or equal to 0.400 found.
-   
-   supp    conf    rule
-   0.888   0.984   engine-location=front -> fuel-type=gas
-   0.888   0.901   fuel-type=gas -> engine-location=front
-   0.805   0.982   engine-location=front -> aspiration=std
-   0.805   0.817   aspiration=std -> engine-location=front
-   0.785   0.958   fuel-type=gas -> aspiration=std
-   
-   supp    conf    rule
-   0.805   0.982   engine-location=front -> aspiration=std
-   0.805   0.817   aspiration=std -> engine-location=front
-   0.785   0.958   fuel-type=gas -> aspiration=std
-   0.771   0.981   fuel-type=gas aspiration=std -> engine-location=front
-   0.771   0.958   aspiration=std engine-location=front -> fuel-type=gas
-   
-Notice that the when printing out the rules, user can specify which
-rule evaluation measures are to be printed. Choose anything from
-``['support', 'confidence', 'lift', 'leverage', 'strength',
-'coverage']``.
-
-The second example uses the same data set, but first prints out five
-most confident rules. Then, it shows a rather advanced type of
-filtering: every rule has parameters that record its support,
-confidence, etc... These may be used when constructing your own filter
-functions. The one in our example uses ``support`` and ``lift``.
-
-.. note:: 
-   If you have just started with Python: lambda is a compact way to
-   specify a simple function without using def statement. As a
-   function, it uses its own name space, so minimal lift and support
-   requested in our example should be passed as function
-   arguments. 
-
-Here goes the code (part of :download:`assoc2.py <code/assoc2.py>`)::
-
-   rules = orange.AssociationRulesInducer(data, support = 0.4)
-   
-   n = 5
-   print "%i most confident rules:" % (n)
-   orngAssoc.sort(rules, ["confidence"])
-   orngAssoc.printRules(rules[0:n], ['confidence','support','lift'])
-   
-   conf = 0.8; lift = 1.1
-   print "\nRules with support>%5.3f and lift>%5.3f" % (conf, lift)
-   rulesC=rules.filter(lambda x: x.confidence>conf and x.lift>lift)
-   orngAssoc.sort(rulesC, ['confidence'])
-   orngAssoc.printRules(rulesC, ['confidence','support','lift'])
-   
-Just one rule with requested support and lift is found in our rule set::
-
-   5 most confident rules:
-   conf    supp    lift    rule
-   1.000   0.478   1.015   fuel-type=gas aspiration=std drive-wheels=fwd -> engine-location=front
-   1.000   0.429   1.015   fuel-type=gas aspiration=std num-of-doors=four -> engine-location=front
-   1.000   0.507   1.015   aspiration=std drive-wheels=fwd -> engine-location=front
-   1.000   0.449   1.015   aspiration=std num-of-doors=four -> engine-location=front
-   1.000   0.541   1.015   fuel-type=gas drive-wheels=fwd -> engine-location=front
-   
-   Rules with confidence>0.800 and lift>1.100
-   conf    supp    lift    rule
-   0.898   0.429   1.116   fuel-type=gas num-of-doors=four -> aspiration=std engine-location=front
-   

File docs/tutorial/rst/basic-exploration.rst

-Basic data exploration
-======================
-
-.. index:: basic data exploration
-
-Until now we have looked only at data files that include solely
-nominal (discrete) attributes. Let's make thinks now more interesting,
-and look at another file with mixture of attribute types. We will
-first use adult data set from UCI ML Repository. The prediction task
-related to this data set is to determine whether a person
-characterized by 14 attributes like education, race, occupation, etc.,
-makes over $50K/year. Because of the original set :download:`adult.tab <code/adult.tab>` is
-rather big (32561 data instances, about 4 MBytes), we will first
-create a smaller sample of about 3% of instances and use it in our
-examples. If you are curious how we do this, here is the code
-(:download:`sample_adult.py <code/sample_adult.py>`)::
-
-   import orange
-   data = orange.ExampleTable("adult")
-   selection = orange.MakeRandomIndices2(data, 0.03)
-   sample = data.select(selection, 0)
-   sample.save("adult_sample.tab")
-
-Above loads the data, prepares a selection vector of length equal to
-the number of data instances, which includes 0's and 1's, but it is
-told that there should be about 3% of 0's. Then, those instances are
-selected which have a corresponding 0 in selection vector, and stored
-in an object called *sample*. The sampled data is then saved in a
-file.  Note that ``MakeRandomIndices2`` performs a stratified selection,
-i.e., the class distribution of original and sampled data should be
-nearly the same.
-
-Basic characteristics of data sets
-----------------------------------
-
-.. index::
-   single: basic data exploration; attributes
-.. index::
-   single: basic data exploration; classes
-.. index::
-   single: basic data exploration; missing values
-
-For classification data sets, basic data characteristics are most
-often number of classes, number of attributes (and of these, how many
-are nominal and continuous), information if data contains missing
-values, and class distribution. Below is the script that does all
-this (:download:`data_characteristics.py <code/data_characteristics.py>`, :download:`adult_sample.tab <code/adult_sample.tab>`)::
-
-   import orange
-   data = orange.ExampleTable("adult_sample")
-   
-   # report on number of classes and attributes
-   print "Classes:", len(data.domain.classVar.values)
-   print "Attributes:", len(data.domain.attributes), ",",
-   
-   # count number of continuous and discrete attributes
-   ncont=0; ndisc=0
-   for a in data.domain.attributes:
-       if a.varType == orange.VarTypes.Discrete:
-           ndisc = ndisc + 1
-       else:
-           ncont = ncont + 1
-   print ncont, "continuous,", ndisc, "discrete"
-   
-   # obtain class distribution
-   c = [0] * len(data.domain.classVar.values)
-   for e in data:
-       c[int(e.getclass())] += 1
-   print "Instances: ", len(data), "total",
-   for i in range(len(data.domain.classVar.values)):
-       print ",", c[i], "with class", data.domain.classVar.values[i],
-   print
-
-The first part is the one that we know already: the script import
-Orange library into Python, and loads the data. The information on
-domain (class and attribute names, types, values, etc.) are stored in
-``data.domain``. Information on class variable is accessible through the
-``data.domain.classVar`` object which stores
-a vector of class' values. Its length is obtained using a function
-``len()``. Similarly, the list of attributes is stored in
-data.domain.attributes. Notice that to obtain the information on i-th
-attribute, this list can be indexed, e.g., ``data.domain.attributes[i]``.
-
-To count the number of continuous and discrete attributes, we have
-first initialized two counters (``ncont``, ``ndisc``), and then iterated
-through the attributes (variable ``a`` is an iteration variable that in is
-each loop associated with a single attribute).  The field ``varType``
-contains the type of the attribute; for discrete attributes, ``varType``
-is equal to ``orange.VarTypes.Discrete``, and for continuous ``varType`` is
-equal to ``orange.VarTypes.Continuous``.
-
-To obtain the number of instances for each class, we first
-initialized a vector c that would of the length equal to the number of
-different classes. Then, we iterated through the data;
-``e.getclass()`` returns a class of an instance e, and to
-turn it into a class index (a number that is in range from 0 to n-1,
-where n is the number of classes) and is used for an index of a
-element of c that should be incremented.
-
-Throughout the code, notice that a print statement in Python prints
-whatever items it has in the line that follows. The items are
-separated with commas, and Python will by default put a blank between
-them when printing. It will also print a new line, unless the print
-statement ends with a comma. It is possible to use print statement in
-Python with formatting directives, just like in C or C++, but this is
-beyond this text.
-
-Running the above script, we obtain the following output::
-
-   Classes: 2
-   Attributes: 14 , 6 continuous, 8 discrete
-   Instances:  977 total , 236 with class >50K , 741 with class <=50K
-
-If you would like class distributions printed as proportions of
-each class in the data sets, then the last part of the script needs
-to be slightly changed. This time, we have used string formatting
-with print as well (part of :download:`data_characteristics2.py <code/data_characteristics2.py>`)::
-
-   # obtain class distribution
-   c = [0] * len(data.domain.classVar.values)
-   for e in data:
-       c[int(e.getclass())] += 1
-   print "Instances: ", len(data), "total",
-   r = [0.] * len(c)
-   for i in range(len(c)):
-       r[i] = c[i]*100./len(data)
-   for i in range(len(data.domain.classVar.values)):
-       print ", %d(%4.1f%s) with class %s" % (c[i], r[i], '%', data.domain.classVar.values[i]),
-   print
-
-The new script outputs the following information::
-
-   Classes: 2
-   Attributes: 14 , 6 continuous, 8 discrete
-   Instances:  977 total , 236(24.2%) with class >50K , 741(75.8%) with class <=50K
-
-As it turns out, there are more people that earn less than those,
-that earn more... On a more technical site, such information may
-be important when your build your classifier; the base error for this
-data set is 1-.758 = .242, and your constructed models should only be
-better than this.
-
-Contingency matrix
-------------------
-
-.. index::
-   single: basic data exploration; class distribution
-
-Another interesting piece of information that we can obtain from the
-data is the distribution of classes for each value of the discrete
-attribute, and means for continuous attribute (we will leave the
-computation of standard deviation and other statistics to you). Let's
-compute means of continuous attributes first (part of :download:`data_characteristics3.py <code/data_characteristics3.py>`)::
-
-   print "Continuous attributes:"
-   for a in range(len(data.domain.attributes)):
-       if data.domain.attributes[a].varType == orange.VarTypes.Continuous:
-           d = 0.; n = 0
-           for e in data:
-               if not e[a].isSpecial():
-                   d += e[a]
-                   n += 1
-           print "  %s, mean=%3.2f" % (data.domain.attributes[a].name, d/n)
-
-This script iterates through attributes (outer for loop), and for
-attributes that are continuous (first if statement) computes a sum
-over all instances. A single new trick that the script uses is that it
-checks if the instance has a defined attribute value.  Namely, for
-instance ``e`` and attribute ``a``, ``e[a].isSpecial()`` is true if
-the value is not defined (unknown). Variable n stores the number of
-instances with defined values of attribute. For our sampled adult data
-set, this part of the code outputs::
-
-   Continuous attributes:
-     age, mean=37.74
-     fnlwgt, mean=189344.06
-     education-num, mean=9.97
-     capital-gain, mean=1219.90
-     capital-loss, mean=99.49
-     hours-per-week, mean=40.27
-   
-For nominal attributes, we could now compose a code that computes,
-for each attribute, how many times a specific value was used for each
-class. Instead, we used a build-in method DomainContingency, which
-does just that. All that our script will do is, mainly, to print it
-out in a readable form (part of :download:`data_characteristics3.py <code/data_characteristics3.py>`)::
-
-   print "\nNominal attributes (contingency matrix for classes:", data.domain.classVar.values, ")"
-   cont = orange.DomainContingency(data)
-   for a in data.domain.attributes:
-       if a.varType == orange.VarTypes.Discrete:
-           print "  %s:" % a.name
-           for v in range(len(a.values)):
-               sum = 0
-               for cv in cont[a][v]:
-                   sum += cv
-               print "    %s, total %d, %s" % (a.values[v], sum, cont[a][v])
-           print
-
-Notice that the first part of this script is similar to the one that
-is dealing with continuous attributes, except that the for loop is a
-little bit simpler. With continuous attributes, the iterator in the
-loop was an attribute index, whereas in the script above we iterate
-through members of ``data.domain.attributes``, which are objects that
-represent attributes. Data structures that may be addressed in Orange
-by attribute may most often be addressed either by attribute index,
-attribute name (string), or an object that represents an attribute.
-
-The output of the code above is rather long (this data set has
-some attributes that have rather large sets of values), so we show
-only the output for two attributes::
-
-   Nominal attributes (contingency matrix for classes: <>50K, <=50K> )
-     workclass:
-       Private, total 729, <170.000, 559.000>
-       Self-emp-not-inc, total 62, <19.000, 43.000>
-       Self-emp-inc, total 22, <10.000, 12.000>
-       Federal-gov, total 27, <10.000, 17.000>
-       Local-gov, total 53, <14.000, 39.000>
-       State-gov, total 39, <10.000, 29.000>
-       Without-pay, total 1, <0.000, 1.000>
-       Never-worked, total 0, <0.000, 0.000>
-   
-     sex:
-       Female, total 330, <28.000, 302.000>
-       Male, total 647, <208.000, 439.000>
-
-First, notice that the in the vectors the first number refers to a
-higher income, and the second number to the lower income (e.g., from
-this data it looks like that women earn less than men). Notice that
-Orange outputs the tuples. To change this, we would need another loop
-that would iterate through members of the tuples. You may also foresee
-that it would be interesting to compute the proportions rather than
-number of instances in above contingency matrix, but that we leave for
-your exercise.
-
-Missing values
---------------
-
-.. index::
-   single: missing values; statistics
-
-It is often interesting to see, given the attribute, what is the
-proportion of the instances with that attribute unknown. We have
-already learned that if a function isSpecial() can be used to
-determine if for specific instances and attribute the value is not
-defined. Let us use this function to compute the proportion of missing
-values per each attribute (:download:`report_missing.py <code/report_missing.py>`)::
-
-   import orange
-   data = orange.ExampleTable("adult_sample")
-   
-   natt = len(data.domain.attributes)
-   missing = [0.] * natt
-   for i in data:
-       for j in range(natt):
-           if i[j].isSpecial():
-               missing[j] += 1
-   missing = map(lambda x, l=len(data):x/l*100., missing)
-   
-   print "Missing values per attribute:"
-   atts = data.domain.attributes
-   for i in range(natt):
-       print "  %5.1f%s %s" % (missing[i], '%', atts[i].name)
-
-Integer variable natt stores number of attributes in the data set. An
-array missing stores the number of the missing values per attribute;
-its size is therefore equal to natt, and all of its elements are
-initially 0 (in fact, 0.0, since we purposely identified it as a real
-number, which helped us later when we converted it to percents).
-
-The only line that possibly looks (very?) strange is ``missing =
-map(lambda x, l=len(data):x/l*100., missing)``. This line could be
-replaced with for loop, but we just wanted to have it here to show how
-coding in Python may look very strange, but may gain in
-efficiency. The function map takes a vector (in our case missing), and
-executes a function on every of its elements, thus obtaining a new
-vector. The function it executes is in our case defined inline, and is
-in Python called lambda expression. You can see that our lambda
-function takes a single argument (when mapped, an element of vector
-missing), and returns its value that is normalized with the number of
-data instances (``len(data)``) multiplied by 100, to turn it in
-percentage. Thus, the map function in fact normalizes the elements of
-missing to express a proportion of missing values over the instances
-of the data set.
-
-Finally, let us see what outputs the script we have just been working
-on::
-
-   Missing values per attribute:
-       0.0% age
-       4.5% workclass
-       0.0% fnlwgt
-       0.0% education
-       0.0% education-num
-       0.0% marital-status
-       4.5% occupation
-       0.0% relationship
-       0.0% race
-       0.0% sex
-       0.0% capital-gain
-       0.0% capital-loss
-       0.0% hours-per-week
-       1.9% native-country
-
-In our sampled data set, just three attributes contain the missing
-values.
-
-Distributions of feature values
--------------------------------
-
-For some of the tasks above, Orange can provide a shortcut by means of
-``orange.DomainDistributions`` function which returns an object that
-holds averages and mean square errors for continuous attributes, value
-frequencies for discrete attributes, and for both number of instances
-where specific attribute has a missing value.  The use of this object
-is exemplified in the following script (:download:`data_characteristics4.py <code/data_characteristics4.py>`)::
-
-   import orange
-   data = orange.ExampleTable("adult_sample")
-   dist = orange.DomainDistributions(data)
-   
-   print "Average values and mean square errors:"
-   for i in range(len(data.domain.attributes)):
-       if data.domain.attributes[i].varType == orange.VarTypes.Continuous:
-           print "%s, mean=%5.2f +- %5.2f" % \
-               (data.domain.attributes[i].name, dist[i].average(), dist[i].error())
-   
-   print "\nFrequencies for values of discrete attributes:"
-   for i in range(len(data.domain.attributes)):
-       a = data.domain.attributes[i]
-       if a.varType == orange.VarTypes.Discrete:
-           print "%s:" % a.name
-           for j in range(len(a.values)):
-               print "  %s: %d" % (a.values[j], int(dist[i][j]))
-   
-   print "\nNumber of items where attribute is not defined:"
-   for i in range(len(data.domain.attributes)):
-       a = data.domain.attributes[i]
-       print "  %2d %s" % (dist[i].unknowns, a.name)
-
-Check this script out. Its results should match with the results we
-have derived by other scripts in this lesson.

File docs/tutorial/rst/classification.rst

-Classification
-==============
-
-.. index:: classification
-.. index:: 
-   single: data mining; supervised
-
-Much of Orange is devoted to machine learning methods for classification, or supervised data mining. These methods rely on
-the data with class-labeled instances, like that of senate voting. Here is a code that loads this data set, displays the first data instance and shows its predicted class (``republican``)::
-
-   >>> data = Orange.data.Table("voting")
-   >>> data[0]
-   ['n', 'y', 'n', 'y', 'y', 'y', 'n', 'n', 'n', 'y', '?', 'y', 'y', 'y', 'n', 'y', 'republican']
-   >>> data[0].get_class()
-   <orange.Value 'party'='republican'>
-
-Learners and Classifiers
-------------------------
-
-.. index::
-   single: classification; learner
-.. index::
-   single: classification; classifier
-.. index::
-   single: classification; naive Bayesian classifier
-
-Classification uses two types of objects: learners and classifiers. Learners consider class-labeled data and return a classifier. Given a data instance (a vector of feature values), classifiers return a predicted class::
-
-    >>> import Orange
-    >>> data = Orange.data.Table("voting")
-    >>> learner = Orange.classification.bayes.NaiveLearner()
-    >>> classifier = learner(data)
-    >>> classifier(data[0])
-    <orange.Value 'party'='republican'>
-
-Above, we read the data, constructed a `naive Bayesian learner <http://en.wikipedia.org/wiki/Naive_Bayes_classifier>`_, gave it the data set to construct a classifier, and used it to predict the class of the first data item. We also use these concepts in the following code that predicts the classes of the first five instances in the data set:
-
-.. literalinclude: code/classification-classifier1.py
-   :lines: 4-
-
-The script outputs::
-
-    republican; originally republican
-    republican; originally republican
-    republican; originally democrat
-      democrat; originally democrat
-      democrat; originally democrat
-
-Naive Bayesian classifier has made a mistake in the third instance, but otherwise predicted correctly. No wonder, since this was also the data it trained from.
-
-Probabilistic Classification
-----------------------------
-
-To find out what is the probability that the classifier assigns
-to, say, democrat class, we need to call the classifier with
-additional parameter that specifies the output type. If this is ``Orange.classification.Classifier.GetProbabilities``, the classifier will output class probabilities:
-
-.. literalinclude: code/classification-classifier2.py
-   :lines: 4-
-
-The output of the script also shows how badly the naive Bayesian classifier missed the class for the thrid data item::
-
-   Probabilities for democrat:
-   0.000; originally republican
-   0.000; originally republican
-   0.005; originally democrat
-   0.998; originally democrat
-   0.957; originally democrat
-
-Cross-Validation
-----------------
-
-.. index:: cross-validation
-
-Validating the accuracy of classifiers on the training data, as we did above, serves demonstration purposes only. Any performance measure that assess accuracy should be estimated on the independent test set. Such is also a procedure called `cross-validation <http://en.wikipedia.org/wiki/Cross-validation_(statistics)>`_, which averages performance estimates across several runs, each time considering a different training and test subsets as sampled from the original data set:
-
-.. literalinclude: code/classification-cv.py
-   :lines: 3-
-
-.. index::
-   single: classification; scoring
-.. index::
-   single: classification; area under ROC
-.. index::
-   single: classification; accuracy
-
-Cross-validation is expecting a list of learners. The performance estimators also return a list of scores, one for every learner. There was just one learner in the script above, hence the list of size one was used. The script estimates classification accuracy and area under ROC curve. The later score is very high, indicating a very good performance of naive Bayesian learner on senate voting data set::
-
-   Accuracy: 0.90
-   AUC:      0.97
-
-
-Handful of Classifiers
-----------------------
-
-Orange includes wide range of classification algorithms, including:
-
-- logistic regression (``Orange.classification.logreg``)
-- k-nearest neighbors (``Orange.classification.knn``)
-- support vector machines (``Orange.classification.svm``)
-- classification trees (``Orange.classification.tree``)
-- classification rules (``Orange.classification.rules``)
-
-Some of these are included in the code that estimates the probability of a target class on a testing data. This time, training and test data sets are disjoint:
-
-.. index::
-   single: classification; logistic regression
-.. index::
-   single: classification; trees
-.. index::
-   single: classification; k-nearest neighbors
-
-.. literalinclude: code/classification-other.py
-
-For these five data items, there are no major differences between predictions of observed classification algorithms::
-
-   Probabilities for republican:
-   original class  tree      k-NN      lr       
-   republican      0.949     1.000     1.000
-   republican      0.972     1.000     1.000
-   democrat        0.011     0.078     0.000
-   democrat        0.015     0.001     0.000
-   democrat        0.015     0.032     0.000
-
-The following code cross-validates several learners. Notice the difference between this and the code above. Cross-validation requires learners, while in the script above, learners were immediately given the data and the calls returned classifiers.
-
-.. literalinclude: code/classification-cv2.py
-
-Logistic regression wins in area under ROC curve::
-
-            nbc  tree lr  
-   Accuracy 0.90 0.95 0.94
-   AUC      0.97 0.94 0.99
-
-Reporting on Classification Models
-----------------------------------
-
-Classification models are objects, exposing every component of its structure. For instance, one can traverse classification tree in code and observe the associated data instances, probabilities and conditions. It is often, however, sufficient, to provide textual output of the model. For logistic regression and trees, this is illustrated in the script below:
-
-.. literalinclude: code/classification-models.py
-
-   The logistic regression part of the output is:
-   
-   class attribute = survived
-   class values = <no, yes>
-
-         Feature       beta  st. error     wald Z          P OR=exp(beta)
-   
-       Intercept      -1.23       0.08     -15.15      -0.00
-    status=first       0.86       0.16       5.39       0.00       2.36
-   status=second      -0.16       0.18      -0.91       0.36       0.85
-    status=third      -0.92       0.15      -6.12       0.00       0.40
-       age=child       1.06       0.25       4.30       0.00       2.89
-      sex=female       2.42       0.14      17.04       0.00      11.25
-
-Trees can also be rendered in `dot <http://en.wikipedia.org/wiki/DOT_language>`_::
-
-   tree.dot(file_name="0.dot", node_shape="ellipse", leaf_shape="box")
-
-Following figure shows an example of such rendering.
-
-.. image:: files/tree.png
-   :alt: A graphical presentation of a classification tree

File docs/tutorial/rst/code/accuracy.py

-# Description: Learn a naive Bayesian classifier, and measure classification accuracy on the same data set
-# Category:    evaluation
-# Uses:        voting.tab
-# Referenced:  c_performance.htm
-
-import orange
-data = orange.ExampleTable("voting")
-classifier = orange.BayesLearner(data)
-
-# compute classification accuracy
-correct = 0.0
-for ex in data:
-    if classifier(ex) == ex.getclass():
-        correct += 1
-print "Classification accuracy:", correct/len(data)

File docs/tutorial/rst/code/accuracy2.py

-# Description: Set a number of learners, for each build a classifier from the data and determine classification accuracy
-# Category:    evaluation
-# Uses:        voting.tab
-# Referenced:  c_performance.htm
-
-import orange, orngTree
-
-def accuracy(test_data, classifiers):
-    correct = [0.0]*len(classifiers)
-    for ex in test_data:
-        for i in range(len(classifiers)):
-            if classifiers[i](ex) == ex.getclass():
-                correct[i] += 1
-    for i in range(len(correct)):
-        correct[i] = correct[i] / len(test_data)
-    return correct
-
-# set up the classifiers
-data = orange.ExampleTable("voting")
-bayes = orange.BayesLearner(data)
-bayes.name = "bayes"
-tree = orngTree.TreeLearner(data);
-tree.name = "tree"
-classifiers = [bayes, tree]
-
-# compute accuracies
-acc = accuracy(data, classifiers)
-print "Classification accuracies:"
-for i in range(len(classifiers)):
-    print classifiers[i].name, acc[i]

File docs/tutorial/rst/code/accuracy3.py

-# Category:    evaluation
-# Description: Set a number of learners, split data to train and test set, learn models from train set and estimate classification accuracy on the test set
-# Uses:        voting.tab
-# Classes:     MakeRandomIndices2
-# Referenced:  c_performance.htm
-
-import orange, orngTree
-
-def accuracy(test_data, classifiers):
-    correct = [0.0]*len(classifiers)
-    for ex in test_data:
-        for i in range(len(classifiers)):
-            if classifiers[i](ex) == ex.getclass():
-                correct[i] += 1
-    for i in range(len(correct)):
-        correct[i] = correct[i] / len(test_data)
-    return correct
-
-# set up the classifiers
-data = orange.ExampleTable("voting")
-selection = orange.MakeRandomIndices2(data, 0.5)
-train_data = data.select(selection, 0)
-test_data = data.select(selection, 1)
-
-bayes = orange.BayesLearner(train_data)
-tree = orngTree.TreeLearner(train_data)
-bayes.name = "bayes"
-tree.name = "tree"
-classifiers = [bayes, tree]
-
-# compute accuracies
-acc = accuracy(test_data, classifiers)
-print "Classification accuracies:"
-for i in range(len(classifiers)):
-    print classifiers[i].name, acc[i]
-

File docs/tutorial/rst/code/accuracy4.py

-# Description: Estimation of accuracy by random sampling.
-# User can set what proportion of data will be used in training.
-# Demonstration of use for different learners.
-# Category:   evaluation
-# Uses:        voting.tab
-# Classes:     MakeRandomIndices2
-# Referenced:  c_performance.htm
-
-import orange, orngTree
-
-def accuracy(test_data, classifiers):
-    correct = [0.0] * len(classifiers)
-    for ex in test_data:
-        for i in range(len(classifiers)):
-            if classifiers[i](ex) == ex.getclass():
-                correct[i] += 1
-    for i in range(len(correct)):
-        correct[i] = correct[i] / len(test_data)
-    return correct
-
-def test_rnd_sampling(data, learners, p=0.7, n=10):
-    acc = [0.0] * len(learners)
-    for i in range(n):
-        selection = orange.MakeRandomIndices2(data, p)
-        train_data = data.select(selection, 0)
-        test_data = data.select(selection, 1)
-        classifiers = []
-        for l in learners:
-            classifiers.append(l(train_data))
-        acc1 = accuracy(test_data, classifiers)
-        print "%d: %s" % (i + 1, ["%.6f" % a for a in acc1])
-        for j in range(len(learners)):
-            acc[j] += acc1[j]
-    for j in range(len(learners)):
-        acc[j] = acc[j] / n
-    return acc
-
-orange.setrandseed(0)
-# set up the learners
-bayes = orange.BayesLearner()
-tree = orngTree.TreeLearner();
-#tree = orngTree.TreeLearner(mForPruning=2)
-bayes.name = "bayes"
-tree.name = "tree"
-learners = [bayes, tree]
-
-# compute accuracies on data
-data = orange.ExampleTable("voting")
-acc = test_rnd_sampling(data, learners)
-print "Classification accuracies:"
-for i in range(len(learners)):
-    print learners[i].name, acc[i]

File docs/tutorial/rst/code/accuracy5.py

-# Category:    evaluation
-# Description: Estimation of accuracy by cross validation. Demonstration of use for different learners.
-# Uses:        voting.tab
-# Classes:     MakeRandomIndicesCV
-# Referenced:  c_performance.htm
-
-import orange, orngTree
-
-def accuracy(test_data, classifiers):
-    correct = [0.0] * len(classifiers)
-    for ex in test_data:
-        for i in range(len(classifiers)):
-            if classifiers[i](ex) == ex.getclass():
-                correct[i] += 1
-    for i in range(len(correct)):
-        correct[i] = correct[i] / len(test_data)
-    return correct
-
-def cross_validation(data, learners, k=10):
-    acc = [0.0] * len(learners)
-    selection = orange.MakeRandomIndicesCV(data, folds=k)
-    for test_fold in range(k):
-        train_data = data.select(selection, test_fold, negate=1)
-        test_data = data.select(selection, test_fold)
-        classifiers = []
-        for l in learners:
-            classifiers.append(l(train_data))
-        acc1 = accuracy(test_data, classifiers)
-        print "%d: %s" % (test_fold + 1, ["%.6f" % a for a in acc1])
-        for j in range(len(learners)):
-            acc[j] += acc1[j]
-    for j in range(len(learners)):
-        acc[j] = acc[j] / k
-    return acc
-
-orange.setrandseed(0)
-# set up the learners
-bayes = orange.BayesLearner()
-tree = orngTree.TreeLearner(mForPruning=2)
-
-bayes.name = "bayes"
-tree.name = "tree"
-learners = [bayes, tree]
-
-# compute accuracies on data
-data = orange.ExampleTable("voting")
-acc = cross_validation(data, learners, k=10)
-print "Classification accuracies:"
-for i in range(len(learners)):
-    print learners[i].name, acc[i]

File docs/tutorial/rst/code/accuracy6.py

-# Description: Leave-one-out method for estimation of classification accuracy. Demonstration of use for different learners
-# Category:    evaluation
-# Uses:        voting.tab
-# Referenced:  c_performance.htm
-
-import orange, orngTree
-
-def leave_one_out(data, learners):
-    acc = [0.0]*len(learners)
-    selection = [1] * len(data)
-    last = 0
-    for i in range(len(data)):
-        print 'leave-one-out: %d of %d' % (i, len(data))
-        selection[last] = 1
-        selection[i] = 0
-        train_data = data.select(selection, 1)
-        for j in range(len(learners)):
-            classifier = learners[j](train_data)
-            if classifier(data[i]) == data[i].getclass():
-                acc[j] += 1
-        last = i
-
-    for j in range(len(learners)):
-        acc[j] = acc[j]/len(data)
-    return acc
-
-orange.setrandseed(0)    
-# set up the learners
-bayes = orange.BayesLearner()
-tree = orngTree.TreeLearner(minExamples=10, mForPruning=2)
-bayes.name = "bayes"
-tree.name = "tree"
-learners = [bayes, tree]
-
-# compute accuracies on data
-data = orange.ExampleTable("voting")
-acc = leave_one_out(data, learners)
-print "Classification accuracies:"
-for i in range(len(learners)):
-    print learners[i].name, acc[i]

File docs/tutorial/rst/code/accuracy7.py

-# Description: Demostration of use of cross-validation as provided in orngEval module
-# Category:    evaluation
-# Uses:        voting.tab
-# Classes:     orngTest.crossValidation
-# Referenced:  c_performance.htm
-
-import orange, orngTest, orngStat, orngTree
-
-# set up the learners
-bayes = orange.BayesLearner()
-tree = orngTree.TreeLearner(mForPruning=2)
-bayes.name = "bayes"
-tree.name = "tree"
-learners = [bayes, tree]
-
-# compute accuracies on data
-data = orange.ExampleTable("voting")
-results = orngTest.crossValidation(learners, data, folds=10)
-
-# output the results
-print "Learner  CA     IS     Brier    AUC"
-for i in range(len(learners)):
-    print "%-8s %5.3f  %5.3f  %5.3f  %5.3f" % (learners[i].name, \
-        orngStat.CA(results)[i], orngStat.IS(results)[i],
-        orngStat.BrierScore(results)[i], orngStat.AUC(results)[i])

File docs/tutorial/rst/code/accuracy8.py

-# Description: Demostration of use of cross-validation as provided in orngEval module
-# Category:    evaluation
-# Uses:        voting.tab
-# Classes:     orngTest.crossValidation
-# Referenced:  c_performance.htm
-
-import orange
-import orngTest, orngStat, orngTree
-
-# set up the learners
-bayes = orange.BayesLearner()
-tree = orngTree.TreeLearner(mForPruning=2)
-bayes.name = "bayes"
-tree.name = "tree"
-learners = [bayes, tree]
-
-# compute accuracies on data
-data = orange.ExampleTable("voting")
-res = orngTest.crossValidation(learners, data, folds=10)
-cm = orngStat.computeConfusionMatrices(res,
-        classIndex=data.domain.classVar.values.index('democrat'))
-
-stat = (('CA', lambda res,cm: orngStat.CA(res)),
-        ('Sens', lambda res,cm: orngStat.sens(cm)),
-        ('Spec', lambda res,cm: orngStat.spec(cm)),
-        ('AUC', lambda res,cm: orngStat.AUC(res)),
-        ('IS', lambda res,cm: orngStat.IS(res)),
-        ('Brier', lambda res,cm: orngStat.BrierScore(res)),
-        ('F1', lambda res,cm: orngStat.F1(cm)),
-        ('F2', lambda res,cm: orngStat.Falpha(cm, alpha=2.0)),
-        ('MCC', lambda res,cm: orngStat.MCC(cm)),
-        ('sPi', lambda res,cm: orngStat.scottsPi(cm)),
-        )
-
-scores = [s[1](res,cm) for s in stat]
-print
-print "Learner  " + "".join(["%-7s" % s[0] for s in stat])
-for (i, l) in enumerate(learners):
-    print "%-8s " % l.name + "".join(["%5.3f  " % s[i] for s in scores])