Wiki

Clone wiki

Lea / Lea3_Tutorial_4

lea3_title_4.png

< Tutorial [3/3] Wiki Home

Table of Content


Introduction

The present page is a special tutorial dedicated on Machine Learning using Lea library. It assumes that you have installed Lea 3 and that you are familiar at least with the techniques presented in the first part of Lea tutorial.

So far, most of the examples shown demonstrate various ways to calculate probability distributions from known probability distributions. However, apart trivial cases of uniform distributions found in gambling, it could be difficult to determine prior probability distributions for a given phenomenon. If no quantitative data are available for the phenomenon at hand, you have to make assumptions with some theoretical model, assuming this is not too complex. On the other hand, if quantitative data are available, then it is possible, with dedicated algorithms, to calculate prior probability distributions that tend to fit the given data. This last approach belongs to the field of Machine Learning (ML).

In the present page, we shall present a couple of techniques to "learn" probabilities from observed data. There are four sections:

  • Frequentist inference: basic techniques using observed frequencies and Markov chains
  • Bayesian learning: basic Bayesian reasoning to update probability distributions from observed data
  • Maximum-likelihood algorithm: learning parameters of a BN where dependencies are defined, without hidden variables
  • EM algorithm: learning parameters of a probabilistic model, with possibly hidden variables

These sections are mostly independent: you may jump directly to any one of them.

Frequentist inference

Actually, we have already presented a very basic ML approach. Counting occurrences of a given phenomenon and derive probabilities as calculated frequencies is indeed a way to "learn" from the data.

For instance, frequencies of English letters can be learned from text data:

lea.set_prob_type('r')
text_data = "to be or not to be: that is the question"
freq1 = lea.vals(*text_data)
print (freq1)
# ->   : 9/40
#    : : 1/40
#    a : 1/40
#    b : 2/40
#    e : 4/40
#    h : 2/40
#    i : 2/40
#    n : 2/40
#    o : 5/40
#    q : 1/40
#    r : 1/40
#    s : 2/40
#    t : 7/40
#    u : 1/40 
Generally speaking, the larger the data, the more sensible the result will be (this is the assumption made with "Big Data"!). We could now easily produce random sample strings from that distribution.
print (''.join(freq1.random(100)))
# -> 'oshheeo a ttt  ubt  ee o      ontettisboe t: hb nis    t bhoo nee bs o ttahsbhtne ttoa  o  oth   ete'
We get something silly, although it tends to respect letter/punctuation frequencies.

Frequencies of group of two letters can be easily calculated also, using the pairwise function (see implementation in Python’s "Itertools Recipes":

freq2 = lea.vals(*pairwise(text_data))
Then, you could examine the frequencies of transitions from a 't':
print (freq2.given(freq2[0] == 't'))
# -> ('t', ' ') : 2/7
#    ('t', 'h') : 2/7
#    ('t', 'i') : 1/7
#    ('t', 'o') : 2/7
... or to a 't':
print (freq2.given(freq2[1] == 't'))
# -> (' ', 't') : 3/6
#    ('a', 't') : 1/6
#    ('o', 't') : 1/6
#    ('s', 't') : 1/6
To produce again random samples, we have first to transform the tuple pairs into 2-letters strings (e.g. ('t', 'h') -> 'th'). Here is the trick to do that:
freq2s = freq2.map(lambda v: ''.join(v))
Then we are able to produce a random sample as before:
print (''.join(freq2s.random(50)))
# -> 't  b ns thuest: ue t nhais tt tht  tth bhaquioio t t tate: bnoe:uebeo :  t: r e heorbes titht e:uet '
This is still silly, but you can see that things improved a bit: for instance, you have no more repetitions of more than two whitespaces or more than two 't'.

A better approach to study transitions between letters is to build a Markov chain from the given data, as already explained here:

mc = markov.chain_from_seq(text_data)
print (mc)
# -> 
#      -> b : 2/9
#      -> i : 1/9
#      -> n : 1/9
#      -> o : 1/9
#      -> q : 1/9
#      -> t : 3/9
#    :
#      ->   : 1
#    a
#      -> t : 1
#    b
#      -> e : 1
#    e
#      ->   : 2/4
#      -> : : 1/4
#      -> s : 1/4
#    h
#      -> a : 1/2
#      -> e : 1/2
#    i
#      -> o : 1/2
#      -> s : 1/2
#    n
#      -> o : 1
#    o
#      ->   : 2/5
#      -> n : 1/5
#      -> r : 1/5
#      -> t : 1/5
#    q
#      -> u : 1
#    r
#      ->   : 11
#    s
#      ->   : 1/2
#      -> t : 1/2
#    t
#      ->   : 2/7
#      -> h : 2/7
#      -> i : 1/7
#      -> o : 2/7
#    u
#      -> e : 1
Note the difference from the previous case: freq2 captures probabilities of two consecutive letters (e.g. P('es') = 1/39), while mc captures conditional probabilities of transitions from one letter to another (e.g. P('s'|'e') = 1/4). The two approaches are then very different. Let's now generate a random string from this Markov chain (we arbitrarily start from an 'a'):
print (''.join(mc.get_state('a').random_seq(100)))
# -> 'thato be be tistis nonononor is nor the: or t tiot to que be that thest to t t no tonor que no be qu'
You can see that this is quite different from what has been produced before; it's now something more like a baby who's babbling alone. The system modestly learned how to babble from the Shakespearian sentence that we gave. We see, in particular, that it knows that a colon is always followed by a whitespace; we see also that, due to the small size of input text sample, it suffers limitations due to over-generalization such that following any 'b' by a 'e', etc.

Should you like that kind of linguistic experience, then you are kindly invited to have a look at the Bullshit Generator on the Examples page.

Bayesian learning

Beside frequentist approach seen above, Bayesian reasonning can be used also for machine learning. Let's consider the "surprise candy" example given in pp.712-715 of the book Artificial Intelligence: a Modern Approach (2nd edition), from S. Russel and P. Norvig. We'll merely provide a summary of the case, focusing on translation in Lea. You're invited to read the explanations given in the book to get a thorough understanding.

There are five bags of candies. A bag is chosen randomly and then a sample of several candies are drawn randomly from this bag. The candy's flavor is either cherry or lime. The distribution of flavors in a bag depend of the bag in a known way (see below). Although the chosen bag is unknown, the flavor of drawn candies can be tasted. In such situation, the chosen bag is an unknown variable while the flavor is an observed variable. Following AIMA book, we'll interpret the chosen bag as an hypothesis that can be any of the five identifiers h1, h2, h3, h4 and h5. Here is the model in Lea, based on data given by the candy manufacturer:

hyp_0 = pmf({'h1': 0.1, 'h2': 0.2, 'h3': 0.4,'h4': 0.2, 'h5': 0.1})
flavor_by_hyp = { 'h1': 'cherry',
                  'h2': pmf({'cherry': 0.75, 'lime': 0.25}),
                  'h3': pmf({'cherry': 0.50, 'lime': 0.50}),
                  'h4': pmf({'cherry': 0.25, 'lime': 0.75}),
                  'h5': 'lime'                             }
flavor = hyp_0.switch(flavor_by_hyp)
hyp_0 is the a priori probability distribution of the hypothesis, considering that there is no evidence yet. flavor is defined as a CPT depending of hyp_0.

Let's now draw 10 candies and let's assume that they all have lime flavor. On the basis of these observations, what is the evolution of the probability of hypothesis h5 (bag with lime candies only)? For sure, we expect that such probability, initially 0.1 shall grow as new lime candies are drawn. Let's calculate this with Lea.

A first approach consists in adding observations one by one in a list of evidences and evaluating the conditional probability of h5, given these evidences:

WARNING: requires Lea 3.2 at least (clone method)

P = lea.P
evidences = []
for n in range(11):
   print ("%2d: %f" % (n,P((hyp_0=='h5').given(*evidences))))
   flavor_x = flavor.clone(shared=(hyp_0,))
   evidences.append(flavor_x=='lime')
# -> 0: 0.100000
#    1: 0.200000
#    2: 0.307692
#    3: 0.421053
#    4: 0.528926
#    5: 0.624390
#    6: 0.704749
#    7: 0.770214
#    8: 0.822449
#    9: 0.863566
#   10: 0.895628
We have used the clone method to create 10 new events, independent one from each other. The argument shared=(hyp_0,) is meant to express that all these events share the dependency to the same hypothesis; in other words: all drawn candies come from the same bag. Also, based on this sample of 10 lime candies, we can easily get the probability that next drawn candy has also lime flavor:
print (P((flavor=='lime').given(*evidences)))
# -> 0.9796631050967436
If you are just interested by the probabilities after all observations have been made, you may use a simpler technique: the 10 events can be created altogether by adding argument n=10 in clone method.
flavors = flavor.clone(n=10,shared=(hyp_0,))
evidences = [flavor_x=='lime' for flavor_x in flavors]
print (hyp_0.given(*evidences))
# -> h2 : 1.7082745402179077e-06
#    h3 : 0.003498546258366275
#    h4 : 0.10087190332532722
#    h5 : 0.8956278421417664
print (P((flavor=='lime').given(*evidences)))
# -> 0.9730314698335799
Before ending this section, let's mention that Lea supports another approach for calculating these results. Actually, after each individual observation, we have a new evidence that allows us to change the probability distribution of hypotheses. Above, we have created 10 independent events (flavor_x) depending on the same initial hypotheses distribution (hyp_0). Instead of that, we could build these events one by one, based on hypothesis distributions updated from latest observations. Here is how to proceed:
hyp_x = hyp_0
for n in range(11):
   print ("%2d: %f"%(n,P((hyp_x=='h5'))))
   flavor_x = hyp_x.switch(flavor_by_hyp)
   hyp_x = hyp_x.given(flavor_x=='lime').calc()
# -> 0: 0.100000
#    1: 0.200000
#    2: 0.307692
#    3: 0.421053
#    4: 0.528926
#    5: 0.624390
#    6: 0.704749
#    7: 0.770214
#    8: 0.822449
#    9: 0.863566
#   10: 0.895628
We see that we get the very same results as before. Note the usage of calc() to calculate the new hypothesis pmf while getting rid of any dependency with previous hypotheses distribution. To get the probability of lime flavor for the next draw, we simply can use flavor_x, without calling .given(...) since this variable is based on up-to-date hypotheses distribution:
print (P((flavor_x=='lime')))
# -> 0.9730314698335799
Again, this result is in line with the conditional probability calculated above.

Maximum-likelihood algorithm for BN

If you reconsider the examples of data mining on a joint table, now interpreting the frequencies as probabilities, then you get another example of ML through frequentist approach. Admittedly, such simple statistical approach could be qualified as "Machine Learning for the layman"! A more difficult goal is to calculate the probabilities involved in a Bayesian network. Lea provides a method for this purpose; it can calculate the CPT entries provided that you define the dependency graph of the BN. This method, named Lea.build_bn_from_joint, makes the following assumptions:

  1. you study a set of given N discrete random variables, which are possibly interdependent and for which the pmf are unknown;
  2. you have got a joint table with S samples given the values of each of the N variables (i.e. a table of S rows and N columns); we assume that the these data are rich enough to discover all values of the domain of each random variable;
  3. you have defined the dependencies of the BN, by means of a direct acyclic graph linking the random variables.

Note that the second point means that there shall be no hidden variable. For dealing with probabilistic model having hidden variables, a special algorithm is required (see EM Algorithm section) below.

Let's illustrate these assumptions on an example reusing the patient joint table:

patient = lea.read_csv_file('patients.csv',col_names=('given_name','surname','gender','title','birthday','blood_type','weight{f}','height{f}','smoker{b}') )
Let's now consider the three assumptions above one by one:

  1. we shall focus on the five random variables gender, blood_type, smoker, height and weight (N = 5);
  2. these 5 data are present in the patient joint table, which contains fifty records (S = 50);
  3. we assume a BN with the following dependencies between the five random variables:

patient_bn1.png

The build_bn_from_joint method allows "learning" the different CPT of this BN. It uses a maximum-likelihood algorithm, calculating each CPT from the observed data (here, the patient joint table). Each argument of Lea.build_bn_from_joint is a tuple giving the names of antecedent nodes and the name of the dependent node:

bn = patient.build_bn_from_joint(
       ( ('gender','blood_type')     , 'smoker' ),
       ( ('gender','height','smoker'), 'weight' ))
Now, bn is a full-fledged BN on which you can make queries, such a calculating the gender probability of a person given that its weight is between 110 and 120 (not included):
print (bn.gender.given(110 <= bn.weight, bn.weight < 120))
female : 0.21234352561454817
male   : 0.7876564743854518
print (P((bn.gender=='male').given(110 <= bn.weight, bn.weight < 120)))
0.7876564743854518
You may wonder: What is the gain of building a BN, while the initial joint table already offers the ability to make the same query? There are two answers to this question. The first one concerns the data size: the space required to store BN data is far smaller than the space required to store the full joint table; to some extent, the BN makes a compression of the joint table. The second answer is that the joint table stores actual raw data while the BN captures the dependency patterns. This entails a phenomenon known as overfitting. It is blatant if we redo the query seen before on the patient joint table instead of the built-up BN:
print (P((patient.gender=='male').given(110 <= patient.weight, patient.weight < 120)))
1.0
Here, with the joint table, the given evidence on patient's weight makes the gender certain. This result is dubious but it is in line with the joint table:
print (patient.given(110 <= patient.weight, patient.weight < 120))
 given_name, surname   , gender, title, birthday    , blood_type, weight, height, smoker
('Jimmie' , 'Martinez', 'male', 'Mr.', '03-12-1930', 'B+'      ,  114.0,  171.0, False ) : 1.0
There is only one patient in the given weight range, who is incidentally a man. This is a trivial instance of overfitting. In this present case, the BN gives a more nuanced answer, which sounds more sensible.

Improving the Bayesian network

Of course, the merits of BN are directly linked to the quality of the dependency graph designed by the human expert who models the phenomenon. In this regard, the BN graph given above could for sure be improved. To illustrate some advanced techniques, let us revisit the BN by including the patient's age: it seems sensible to add age as height's influencer, as shown in the following graph:

patient_bn2.png

Now, the problem is that age is not an attribute of the patient joint table. Only birthday is available, which is a string, hence not adequate for a BN variable. As seen previously, assuming the year of the study is 2015, the age could be approximated with the following definition:

age = 2015 – patient.birthday[-4:].map(int)
Based on this definition, the joint table can now be augmented with the new column. Since each patient row is a named tuple, the trick is to convert age as a tuple through the map method, to append it to the patient tuples and then to use the as_Joint method to redefine the column names. This could be done through the following one-liner:
patient2 = (patient + age.map(lambda x:(x,))).as_joint('givenName','surname','gender','title','birthday','blood_type','weight','height','smoker','age')
You could check that each row of patient2 joint table contains a new column age, which is in line with birthday. Now that we have all the data needed, we can build the new BN, as sketched above:
bn2 = patient2.build_bn_from_joint(
        ( ('gender','age')            , 'height'),
        ( ('gender','blood_type')     , 'smoker'),
        ( ('gender','height','smoker'), 'weight'))
This new BN gives different results than the previous BN, which ignored the dependency on age:
print (P((bn2.gender=='male').given(110 <= bn2.weight, bn2.weight < 120)))
0.8397314844204974
We then see the impact of the added dependency of age to height, even if these two variables are absent from the query.

The EM Algorithm

WARNING: requires Lea 3.2 at least

So far, we have assumed that all variables of our probabilistic model are observable. In many cases however, models have hidden variables: these cannot be observed directly but they influence other variables, which are observable. The maximum-likelihood algorithm used in the previous section is unable to learn parameters of such models.

For machine learning on such models presenting hidden variable, Lea provides an implementation of the Expectation-Maximization algorithm (EM): the learn_by_em method. This method is powerful but also quite hard to master; in particular, it provides many optional arguments. Before giving examples of usage, let's sketch the general process of using EM algorithm (without detailing how EM algorithm works internally):

  1. You get some observation data obs_x, by means of a Lea instance representing a frequency table. If there are several variables involved, you shall use a joint probability table.
  2. You define a probabilistic model x0 that is capable of producing the observed data obs_x, without considering yet the fitting of probabilities; this model x0 is a Lea instance that uses techniques seen so far: CPT, BN, arithmetic, functions, etc. This model involves several random variables v1, ..., vn, observed or hidden, with arbitrary parameters (or best guesses).
  3. You call x0.learn_by_em(obs_x, ...) with some extra arguments (see later).
  4. You obtain a new model x1, which has similar variables and structure as x0 but with new parameters that better matches observed data obs_x.

We shall demonstrate this process on three use cases. Note that the last two ones can be found in literature, so you're able to cross-check the calculated results.

Case 1 - Learning Bernoulli probability distributions

Let's start with a basic example. Ron is an experimenter in some Las-Vegas laboratory. He patiently has repeated 10,000 times the following experience: throwing two coins, keeping track of the number of heads obtained (0, 1 or 2). Eventually, he gets the following counts:

0: 5,558
1: 3,850
2:   592
After some thinking, Ron understands that one of the coins (at least) seems to be biased. Unfortunately, he didn't keep track of individual coin results. The question is now: How can you help Ron to find probability bias of each coins?

The probabilistic model is:

X = A + B

where X is the random variable giving the observed count while A and B are random variables giving 0 (Tail) or 1 (Head) for each coin. A and B are hidden variables, with unknown a priori parameters.

em_1.png

So, EM algorithm is the right tool here! Let's first define the observed frequency distribution:

obs_x = lea.pmf({0: 5558, 1: 3850, 2: 592})
print(obs_x)
# -> 0 : 0.5558
#    1 : 0.385
#    2 : 0.0592
Now, let's define our initial probabilistic model using Bernoulli probability distributions, with arbitrary parameters:
a0 = lea.bernoulli(0.5)
b0 = lea.bernoulli(0.4)
x0 = a0 + b0
We put a "0" suffix to recall that this model is just a first guess, setting initial values on parameters (here probabilities 0.5 and 0.4). Now, we are almost ready to execute EM algorithm by calling learn_by_em. Because this algorithm is iterative, hopefully converging to the solution, you need to define a halting condition. Several conditions are expressible by means of optional arguments; we shall use here the most simple one, by fixing the number of algorithm steps to 100. Here is the call:
model_dict = x0.learn_by_em(obs_x,nb_steps=100)
The returned model_dict is a dictionary giving access to the variables of the new probabilistic model. Since the probabilistic model is made up of several variables, the dictionary is meant to provide a look-up table to retrieve each of the variables of the new model (remember that, by design, each Lea instance is immutable, so the internal parameters of a0, b0, x0 cannot be altered). Here is how you proceed to get variables of the optimized model:
a1 = model_dict[a0]
b1 = model_dict[b0]
x1 = model_dict[x0]
Note that, in the present simple additive model, x1 could be defined also as a1 + b1, which is equivalent. Here are the probability distributions of these variables:
print (a1)
# -> 0 : 0.6845898600927648
     1 : 0.31541013990723527
print (b1)
# -> 0 : 0.8120101399072353
     1 : 0.1879898600927648
print (x1)
# ->  0 : 0.5558939080730004
      1 : 0.38481218385399896
      2 : 0.059293908073000515
You can verify that x1 is close to obs_x, as expected. Here is how you may interpret these results: given the observations, the most likely hypothesis is that one coin has 31.5% chances to show Head and the other coin has 18.8% such chances. Since you have now access to all variables of the new model, whether observed or hidden, you may make query to have explanations on new observed results, to make forecasts,etc. For instance:
print (lea.joint(a1,b1).given(x1==1))
# ->  (0, 1) : 0.33443835049825915
      (1, 0) : 0.6655616495017408

Case 2 - Learning binomial probability distributions

We shall now see a more elaborate example, which will introduce other arguments of learn_by_em method. This example is detailed in a course of Stefanos Zafeiriou on Advanced Statistical Machine Learning

We have two coins, A and B, possibly biased. We want to determine the bias of each coin. We choose randomly 5 times one of the coin and, for each choice, we toss the chosen coin 10 times and note the number of heads obtained. Each value can then go form 0 to 10 heads. The observed results are: 5, 9, 8, 4, 7. We assume that the two coins cannot be distinguished, so the choice made for each experiment is a hidden random variable. Let's first build the frequency table of observed number of heads:

obs_nb_heads = lea.vals(5,9,8,4,7)
Then, let's build an initial model. The result obtained for each experiment can be modeled as a binomial probability distributions with n= 10 and an arbitrary bias probability :
ha0 = 0.6
hb0 = 0.5
nb_heads_A0 = lea.binom(10, ha0)
nb_heads_B0 = lea.binom(10, hb0)
We assume also that, for each of the 5 experiments, the coin has been chosen fairly with 50-50 chances (although this is not specified in the course above!).

em_2.png

coin_type0 = lea.vals("A","B")
nb_heads0 = coin_type0.switch( { "A" : nb_heads_A0,
                                 "B" : nb_heads_B0})
We are now ready to execute the EM algorithm. To be able to check the results with the ones given on the ML course (see above), we shall here execute just one step.
model_dict = nb_heads0.learn_by_em(obs_nb_heads, fixed_vars=(coin_type0,), nb_steps=1)
The call is similar to the one given in previous example. The sole difference is the introduction of the fixed_vars argument. By default, learn_by_em method shall optimize all variables of the given model. Here, because we control the process of selecting the coin (50-50 probabilities), we want that EM algorithm keeps this variable unchanged in the new model. The fixed_vars=(coin_type0,) expresses this constraint. The argument is any iterable so, in more complex model, several variables could be declared as fixed.

Now, we can display the new bias probability for each of the coins:

nb_heads_A1 = model_dict[nb_heads_A0]
nb_heads_B1 = model_dict[nb_heads_B0]
print (nb_heads_A1.p)
# -> 0.7130122354005161
print (nb_heads_B1.p)
# -> 0.5813393083136627
These values can be cross-checked with the afore-mentioned course.

Case 3 - Learning Bayesian network

The last example shows how to learn a simple Bayesian network. It's taken from the AIMA book - pp 727-730. We'll merely provide a summary of the case, focusing on translation in Lea. You're invited to read the explanations given in the book to get a thorough understanding.

Two bags containing candies have been mixed together. Candies are described by three features: wrapper (red/green), hole (true/false) and flavor (cherry/lime). A sample of 1,000 candies is drawn and the occurrences of the eight combinations of features are counted. Here are the raw data obtained, expressed as a Lea JPD:

obs_candy = lea.pmf({ ('red'  ,True,'cherry'): 273, ('red'  ,False,'cherry'):  93,
                      ('green',True,'cherry'): 104, ('green',False,'cherry'):  90,
                      ('red'  ,True,'lime'  ):  79, ('red'  ,False,'lime'  ): 100,
                      ('green',True,'lime'  ):  94, ('green',False,'lime'  ): 167})
For defining a probabilistic model that is meant to produce such data, we shall assume that

  1. the three candy features are independent given the origin bag,
  2. the conditional probability distribution for each feature depends of the bag.

em_3.png

Such kind of assumptions define actually a so-called naive Bayes model. Here is the initial model, using arbitrary parameters values presented in the AIMA book (refer to Bayesian Reasoning section of the present tutorial).

Note: w stands for "wrapper", h for "hole", f for flavor, wb1 for "wrapper if candy comes from bag 1", etc.

bag0 = lea.pmf({1: 0.6, 2: 0.4})
wb1_0 = lea.pmf({'red': 0.6, 'green': 0.4})
wb2_0 = lea.pmf({'red': 0.4, 'green': 0.6})
w0 = bag0.switch({1: wb1_0, 2: wb2_0})
hb1_0 = lea.event(0.6)
hb2_0 = lea.event(0.4)
h0 = bag0.switch({1: hb1_0, 2: hb2_0})
fb1_0 = lea.pmf({'cherry': 0.6, 'lime': 0.4})
fb2_0 = lea.pmf({'cherry': 0.4, 'lime': 0.6})
f0 = bag0.switch({1: fb1_0, 2: fb2_0})
candy0 = lea.joint(w0,h0,f0)
At this stage, candy0 is a probability distribution comparable to obs_candy, associating a probability for each of the eight types of candy (you may display the two distributions to convince yourself). To evaluate the degree of likelihood of the model with the given data, there are several possible indicators, which are closely interlinked: the cross-entropy, the log-likelihood and Kullback-Leibler divergence (see here for more details).
print (obs_candy.cross_entropy(candy0))
# -> 2.949244290266771
print (-1000 * obs_candy.cross_entropy(candy0) * math.log(2))
# -> -2044.260364580929
print (obs_candy.kl_divergence(candy0))
# -> 0.09363125084952317
Note the second expression calculates the log-likelihood; the value -2044 can be cross-checked with AIMA book (the log(2) factor is here to convert from base 2 to natural logarithm). Whatever the indicator, the smaller the value in absolute value, the likelier the model explains the observed data. Even if the indicators are roughly equivalent, we shall prefer here the KL divergence because it has the merit to tends to zero as the model better fits the data. You can convince yourself of this fact by comparing the observed data with themselves, i.e. as their own model:
print (obs_candy.kl_divergence(obs_candy))
# -> 0.0
print (obs_candy.cross_entropy(obs_candy))
# -> 2.8556130394172476
Actually, the value obtained for cross entropy is nothing else than the entropy of the observed data, which is the smallest reachable value. Note that there is not much mystery here: the KL divergence is precisely defined by subtracting the entropy from the cross-entropy!

We shall now execute one step of the EM algorithm:

model_dict = candy0.learn_by_em(obs_candy,nb_steps=1)
To extract the variables of the new model, we use the following one-liner:
(candy1,bag1,w1,h1,f1,wb1_1,wb2_1,hb1_1,hb2_1,fb1_1,fb2_1) = \
    tuple(model_dict[k] for k in (candy0,bag0,w0,h0,f0,wb1_0,wb2_0,hb1_0,hb2_0,fb1_0,fb2_0))
Here are the probabilities set in this new model (all the following values can be cross-checked with AIMA book):
P = lea.P
print (P(bag1==1))
# -> 0.6124306106264868
print (P(fb1_1=='cherry'))
# -> 0.6684082742546373
print (P(wb1_1=='red'))
# -> 0.6483118060276456
print (P(hb1_1))
# -> 0.6558479816127674
print (P(fb2_1=='cherry'))
# -> 0.3886950739168244
print (P(wb2_1=='red'))
# -> 0.38174842702951567
print (P(hb2_1))
# -> 0.38274080515627396
As expected, the log-likelihood and KL divergence have slightly improved from candy0 model to candy1 model:
print (-1000 * obs_candy.cross_entropy(candy1) * math.log(2))
# -> -2021.0262390280009
print (obs_candy.kl_divergence(candy1))
# -> 0.060111493134922256
Here are the indicators after 10 iterations of EM algorithm:
model_dict = candy0.learn_by_em(obs_candy,nb_steps=10)
candy10 = model_dict[candy0]  
print (-1000 * obs_candy.cross_entropy(candy10) * math.log(2))
# -> -1982.0177851139413
print (obs_candy.kl_divergence(candy10))
# -> 0.0038341901203629014
The model can still be improved by specifying bigger number of iterations. However, it's almost impossible to predict the number of iterations needed to reach a given level of likelihood. To help the process, it's possible, by means of extra arguments, to specify other halting conditions. In particular, the max_kld argument allows specifying a goal by means of a maximum KL divergence acceptable for the delivered model. For instance:
model_dict = candy0.learn_by_em(obs_candy,max_kld=0.0001)
candyx = model_dict[candy0]
print (obs_candy.kl_divergence(candyx))
# -> 9.809727643439459e-05
Since the required value could be impossible or intractable to reach, it's advised to add the nb_steps argument as a safeguard. For instance,
model_dict = candy0.learn_by_em(obs_candy,max_kld=0.0001,nb_steps=10000)
The last possible argument is max_delta_kld, which specifies the maximum difference of KL divergence between two consecutive iterations. This allows you to specify a "stall" condition that detects the moment at which subsequent iterations wouldn't bring much added value.

In any case, whatever the arguments specified, the execution shall stop as soon as any of specified halting conditions is verified.

Controlling / monitoring the EM algorithm

learn_by_em is meant to be a high-level method: it performs the job until specified halting condition is met. You cannot however monitor what's happening inside, like the evolution of log-likelihood or the evolution of some parameters of the model. You cannot control the algorithm execution (pause, step, resume, etc.). For such needs, Lea provides an alternative method, gen_em_steps, which is a generator. This method actually generates an infinite sequence of steps, that you can control.

Here is an example of usage, revisiting the candy use case to show the evolution of KL divergence on the first 10 steps.

candy_gen_steps = candy0.gen_em_steps(obs_candy)
for _ in range(10):
    model_dict = next(candy_gen_steps)
    candyx = model_dict[candy0]
    print (obs_candy.kl_divergence(candyx))
# -> 0.060111493134922256
     0.034141267160715305
     0.016746236124274727
     0.009082121028439083
     0.006414953854867456
     0.005401348869327194
     0.004855251952482931
     0.004458178493857989
     0.0041265818693028145
     0.0038341901203629014
You could similarly display other parameters like the probability of choosing bag 1, the probability of cherry flavor if candy comes from bag 2, and so on. These data could then be plotted using your favorite graphic package. The sole drawback of gen_em_steps is that it provides no argument for halting, you have to program the halting condition yourself.

What's next?

Thank you for reading the present tutorial pages! We hope that you found it enough clear and informative.

You should have now comprehensive information on what Lea can do. You should be able to use it in your projects, researches or as a toolkit for teaching probability theory and probabilistic programming. You can find more examples in the Examples page.

For enhancements or bugs, please create issues on bitbucket Lea page.

Feel free also to send an e-mail to pie.denis@skynet.be (English or French spoken). Use cases of Lea in your projects/researches are especially appreciated! A "references" page listing the projects using Lea could greatly improve the reputation of the library.

Updated