Wiki
Clone wikiLea / Lea3_Tutorial_4
Table of Content
Introduction
The present page is a special tutorial dedicated on Machine Learning using Lea library. It assumes that you have installed Lea 3 and that you are familiar at least with the techniques presented in the first part of Lea tutorial.
So far, most of the examples shown demonstrate various ways to calculate probability distributions from known probability distributions. However, apart trivial cases of uniform distributions found in gambling, it could be difficult to determine prior probability distributions for a given phenomenon. If no quantitative data are available for the phenomenon at hand, you have to make assumptions with some theoretical model, assuming this is not too complex. On the other hand, if quantitative data are available, then it is possible, with dedicated algorithms, to calculate prior probability distributions that tend to fit the given data. This last approach belongs to the field of Machine Learning (ML).
In the present page, we shall present a couple of techniques to "learn" probabilities from observed data. There are four sections:
- Frequentist inference: basic techniques using observed frequencies and Markov chains
- Bayesian learning: basic Bayesian reasoning to update probability distributions from observed data
- Maximum-likelihood algorithm: learning parameters of a BN where dependencies are defined, without hidden variables
- EM algorithm: learning parameters of a probabilistic model, with possibly hidden variables
These sections are mostly independent: you may jump directly to any one of them.
Frequentist inference
Actually, we have already presented a very basic ML approach. Counting occurrences of a given phenomenon and derive probabilities as calculated frequencies is indeed a way to "learn" from the data.
For instance, frequencies of English letters can be learned from text data:
lea.set_prob_type('r') text_data = "to be or not to be: that is the question" freq1 = lea.vals(*text_data) print (freq1) # -> : 9/40 # : : 1/40 # a : 1/40 # b : 2/40 # e : 4/40 # h : 2/40 # i : 2/40 # n : 2/40 # o : 5/40 # q : 1/40 # r : 1/40 # s : 2/40 # t : 7/40 # u : 1/40
print (''.join(freq1.random(100))) # -> 'oshheeo a ttt ubt ee o ontettisboe t: hb nis t bhoo nee bs o ttahsbhtne ttoa o oth ete'
Frequencies of group of two letters can be easily calculated also, using the pairwise
function (see implementation in Python’s "Itertools Recipes":
freq2 = lea.vals(*pairwise(text_data))
print (freq2.given(freq2[0] == 't')) # -> ('t', ' ') : 2/7 # ('t', 'h') : 2/7 # ('t', 'i') : 1/7 # ('t', 'o') : 2/7
print (freq2.given(freq2[1] == 't')) # -> (' ', 't') : 3/6 # ('a', 't') : 1/6 # ('o', 't') : 1/6 # ('s', 't') : 1/6
('t', 'h')
-> 'th'
). Here is the trick to do that:
freq2s = freq2.map(lambda v: ''.join(v))
print (''.join(freq2s.random(50))) # -> 't b ns thuest: ue t nhais tt tht tth bhaquioio t t tate: bnoe:uebeo : t: r e heorbes titht e:uet '
A better approach to study transitions between letters is to build a Markov chain from the given data, as already explained here:
mc = markov.chain_from_seq(text_data) print (mc) # -> # -> b : 2/9 # -> i : 1/9 # -> n : 1/9 # -> o : 1/9 # -> q : 1/9 # -> t : 3/9 # : # -> : 1 # a # -> t : 1 # b # -> e : 1 # e # -> : 2/4 # -> : : 1/4 # -> s : 1/4 # h # -> a : 1/2 # -> e : 1/2 # i # -> o : 1/2 # -> s : 1/2 # n # -> o : 1 # o # -> : 2/5 # -> n : 1/5 # -> r : 1/5 # -> t : 1/5 # q # -> u : 1 # r # -> : 11 # s # -> : 1/2 # -> t : 1/2 # t # -> : 2/7 # -> h : 2/7 # -> i : 1/7 # -> o : 2/7 # u # -> e : 1
freq2
captures probabilities of two consecutive letters (e.g. P('es'
) = 1/39), while mc
captures conditional probabilities of transitions from one letter to another (e.g. P('s'
|'e'
) = 1/4). The two approaches are then very different. Let's now generate a random string from this Markov chain (we arbitrarily start from an 'a'
):
print (''.join(mc.get_state('a').random_seq(100))) # -> 'thato be be tistis nonononor is nor the: or t tiot to que be that thest to t t no tonor que no be qu'
Should you like that kind of linguistic experience, then you are kindly invited to have a look at the Bullshit Generator on the Examples page.
Bayesian learning
Beside frequentist approach seen above, Bayesian reasonning can be used also for machine learning. Let's consider the "surprise candy" example given in pp.712-715 of the book Artificial Intelligence: a Modern Approach (2nd edition), from S. Russel and P. Norvig. We'll merely provide a summary of the case, focusing on translation in Lea. You're invited to read the explanations given in the book to get a thorough understanding.
There are five bags of candies. A bag is chosen randomly and then a sample of several candies are drawn randomly from this bag. The candy's flavor is either cherry or lime. The distribution of flavors in a bag depend of the bag in a known way (see below). Although the chosen bag is unknown, the flavor of drawn candies can be tasted. In such situation, the chosen bag is an unknown variable while the flavor is an observed variable. Following AIMA book, we'll interpret the chosen bag as an hypothesis that can be any of the five identifiers h1, h2, h3, h4 and h5. Here is the model in Lea, based on data given by the candy manufacturer:
hyp_0 = pmf({'h1': 0.1, 'h2': 0.2, 'h3': 0.4,'h4': 0.2, 'h5': 0.1}) flavor_by_hyp = { 'h1': 'cherry', 'h2': pmf({'cherry': 0.75, 'lime': 0.25}), 'h3': pmf({'cherry': 0.50, 'lime': 0.50}), 'h4': pmf({'cherry': 0.25, 'lime': 0.75}), 'h5': 'lime' } flavor = hyp_0.switch(flavor_by_hyp)
hyp_0
is the a priori probability distribution of the hypothesis, considering that there is no evidence yet. flavor
is defined as a CPT depending of hyp_0
.
Let's now draw 10 candies and let's assume that they all have lime flavor. On the basis of these observations, what is the evolution of the probability of hypothesis h5 (bag with lime candies only)? For sure, we expect that such probability, initially 0.1 shall grow as new lime candies are drawn. Let's calculate this with Lea.
A first approach consists in adding observations one by one in a list of evidences and evaluating the conditional probability of h5, given these evidences:
WARNING: requires Lea 3.2 at least (clone
method)
P = lea.P evidences = [] for n in range(11): print ("%2d: %f" % (n,P((hyp_0=='h5').given(*evidences)))) flavor_x = flavor.clone(shared=(hyp_0,)) evidences.append(flavor_x=='lime') # -> 0: 0.100000 # 1: 0.200000 # 2: 0.307692 # 3: 0.421053 # 4: 0.528926 # 5: 0.624390 # 6: 0.704749 # 7: 0.770214 # 8: 0.822449 # 9: 0.863566 # 10: 0.895628
clone
method to create 10 new events, independent one from each other. The argument shared=(hyp_0,)
is meant to express that all these events share the dependency to the same hypothesis; in other words: all drawn candies come from the same bag. Also, based on this sample of 10 lime candies, we can easily get the probability that next drawn candy has also lime flavor:
print (P((flavor=='lime').given(*evidences))) # -> 0.9796631050967436
n=10
in clone
method.flavors = flavor.clone(n=10,shared=(hyp_0,)) evidences = [flavor_x=='lime' for flavor_x in flavors] print (hyp_0.given(*evidences)) # -> h2 : 1.7082745402179077e-06 # h3 : 0.003498546258366275 # h4 : 0.10087190332532722 # h5 : 0.8956278421417664 print (P((flavor=='lime').given(*evidences))) # -> 0.9730314698335799
flavor_x
) depending on the same initial hypotheses distribution (hyp_0
). Instead of that, we could build these events one by one, based on hypothesis distributions updated from latest observations. Here is how to proceed:
hyp_x = hyp_0 for n in range(11): print ("%2d: %f"%(n,P((hyp_x=='h5')))) flavor_x = hyp_x.switch(flavor_by_hyp) hyp_x = hyp_x.given(flavor_x=='lime').calc() # -> 0: 0.100000 # 1: 0.200000 # 2: 0.307692 # 3: 0.421053 # 4: 0.528926 # 5: 0.624390 # 6: 0.704749 # 7: 0.770214 # 8: 0.822449 # 9: 0.863566 # 10: 0.895628
calc()
to calculate the new hypothesis pmf while getting rid of any dependency with previous hypotheses distribution. To get the probability of lime flavor for the next draw, we simply can use flavor_x
, without calling .given(...)
since this variable is based on up-to-date hypotheses distribution:
print (P((flavor_x=='lime'))) # -> 0.9730314698335799
Maximum-likelihood algorithm for BN
If you reconsider the examples of data mining on a joint table, now interpreting the frequencies as probabilities, then you get another example of ML through frequentist approach. Admittedly, such simple statistical approach could be qualified as "Machine Learning for the layman"! A more difficult goal is to calculate the probabilities involved in a Bayesian network. Lea provides a method for this purpose; it can calculate the CPT entries provided that you define the dependency graph of the BN. This method, named Lea.build_bn_from_joint
, makes the following assumptions:
- you study a set of given N discrete random variables, which are possibly interdependent and for which the pmf are unknown;
- you have got a joint table with S samples given the values of each of the N variables (i.e. a table of S rows and N columns); we assume that the these data are rich enough to discover all values of the domain of each random variable;
- you have defined the dependencies of the BN, by means of a direct acyclic graph linking the random variables.
Note that the second point means that there shall be no hidden variable. For dealing with probabilistic model having hidden variables, a special algorithm is required (see EM Algorithm section) below.
Let's illustrate these assumptions on an example reusing the patient
joint table:
patient = lea.read_csv_file('patients.csv',col_names=('given_name','surname','gender','title','birthday','blood_type','weight{f}','height{f}','smoker{b}') )
- we shall focus on the five random variables
gender
,blood_type
,smoker
,height
andweight
(N = 5); - these 5 data are present in the patient joint table, which contains fifty records (S = 50);
- we assume a BN with the following dependencies between the five random variables:
The build_bn_from_joint
method allows "learning" the different CPT of this BN. It uses a maximum-likelihood algorithm, calculating each CPT from the observed data (here, the patient
joint table). Each argument of Lea.build_bn_from_joint
is a tuple giving the names of antecedent nodes and the name of the dependent node:
bn = patient.build_bn_from_joint( ( ('gender','blood_type') , 'smoker' ), ( ('gender','height','smoker'), 'weight' ))
bn
is a full-fledged BN on which you can make queries, such a calculating the gender probability of a person given that its weight is between 110 and 120 (not included):
print (bn.gender.given(110 <= bn.weight, bn.weight < 120)) female : 0.21234352561454817 male : 0.7876564743854518 print (P((bn.gender=='male').given(110 <= bn.weight, bn.weight < 120))) 0.7876564743854518
patient
joint table instead of the built-up BN:
print (P((patient.gender=='male').given(110 <= patient.weight, patient.weight < 120))) 1.0
print (patient.given(110 <= patient.weight, patient.weight < 120)) given_name, surname , gender, title, birthday , blood_type, weight, height, smoker ('Jimmie' , 'Martinez', 'male', 'Mr.', '03-12-1930', 'B+' , 114.0, 171.0, False ) : 1.0
Improving the Bayesian network
Of course, the merits of BN are directly linked to the quality of the dependency graph designed by the human expert who models the phenomenon. In this regard, the BN graph given above could for sure be improved. To illustrate some advanced techniques, let us revisit the BN by including the patient's age
: it seems sensible to add age
as height
's influencer, as shown in the following graph:
Now, the problem is that age
is not an attribute of the patient
joint table. Only birthday
is available, which is a string, hence not adequate for a BN variable. As seen previously, assuming the year of the study is 2015, the age could be approximated with the following definition:
age = 2015 – patient.birthday[-4:].map(int)
map
method, to append it to the patient tuples and then to use the as_Joint
method to redefine the column names. This could be done through the following one-liner:
patient2 = (patient + age.map(lambda x:(x,))).as_joint('givenName','surname','gender','title','birthday','blood_type','weight','height','smoker','age')
patient2
joint table contains a new column age
, which is in line with birthday
. Now that we have all the data needed, we can build the new BN, as sketched above:
bn2 = patient2.build_bn_from_joint( ( ('gender','age') , 'height'), ( ('gender','blood_type') , 'smoker'), ( ('gender','height','smoker'), 'weight'))
print (P((bn2.gender=='male').given(110 <= bn2.weight, bn2.weight < 120))) 0.8397314844204974
age
to height
, even if these two variables are absent from the query.
The EM Algorithm
WARNING: requires Lea 3.2 at least
So far, we have assumed that all variables of our probabilistic model are observable. In many cases however, models have hidden variables: these cannot be observed directly but they influence other variables, which are observable. The maximum-likelihood algorithm used in the previous section is unable to learn parameters of such models.
For machine learning on such models presenting hidden variable, Lea provides an implementation of the Expectation-Maximization algorithm (EM): the learn_by_em
method. This method is powerful but also quite hard to master; in particular, it provides many optional arguments. Before giving examples of usage, let's sketch the general process of using EM algorithm (without detailing how EM algorithm works internally):
- You get some observation data obs_x, by means of a Lea instance representing a frequency table. If there are several variables involved, you shall use a joint probability table.
- You define a probabilistic model x0 that is capable of producing the observed data obs_x, without considering yet the fitting of probabilities; this model x0 is a Lea instance that uses techniques seen so far: CPT, BN, arithmetic, functions, etc. This model involves several random variables v1, ..., vn, observed or hidden, with arbitrary parameters (or best guesses).
- You call
x0.learn_by_em(obs_x, ...)
with some extra arguments (see later). - You obtain a new model x1, which has similar variables and structure as x0 but with new parameters that better matches observed data obs_x.
We shall demonstrate this process on three use cases. Note that the last two ones can be found in literature, so you're able to cross-check the calculated results.
Case 1 - Learning Bernoulli probability distributions
Let's start with a basic example. Ron is an experimenter in some Las-Vegas laboratory. He patiently has repeated 10,000 times the following experience: throwing two coins, keeping track of the number of heads obtained (0, 1 or 2). Eventually, he gets the following counts:
0: 5,558 1: 3,850 2: 592
The probabilistic model is:
X = A + B
where X is the random variable giving the observed count while A and B are random variables giving 0 (Tail) or 1 (Head) for each coin. A and B are hidden variables, with unknown a priori parameters.
So, EM algorithm is the right tool here! Let's first define the observed frequency distribution:
obs_x = lea.pmf({0: 5558, 1: 3850, 2: 592}) print(obs_x) # -> 0 : 0.5558 # 1 : 0.385 # 2 : 0.0592
a0 = lea.bernoulli(0.5) b0 = lea.bernoulli(0.4) x0 = a0 + b0
0
" suffix to recall that this model is just a first guess, setting initial values on parameters (here probabilities 0.5 and 0.4). Now, we are almost ready to execute EM algorithm by calling learn_by_em
. Because this algorithm is iterative, hopefully converging to the solution, you need to define a halting condition. Several conditions are expressible by means of optional arguments; we shall use here the most simple one, by fixing the number of algorithm steps to 100. Here is the call:
model_dict = x0.learn_by_em(obs_x,nb_steps=100)
model_dict
is a dictionary giving access to the variables of the new probabilistic model. Since the probabilistic model is made up of several variables, the dictionary is meant to provide a look-up table to retrieve each of the variables of the new model (remember that, by design, each Lea instance is immutable, so the internal parameters of a0
, b0
, x0
cannot be altered). Here is how you proceed to get variables of the optimized model:
a1 = model_dict[a0] b1 = model_dict[b0] x1 = model_dict[x0]
x1
could be defined also as a1 + b1
, which is equivalent. Here are the probability distributions of these variables:
print (a1) # -> 0 : 0.6845898600927648 1 : 0.31541013990723527 print (b1) # -> 0 : 0.8120101399072353 1 : 0.1879898600927648 print (x1) # -> 0 : 0.5558939080730004 1 : 0.38481218385399896 2 : 0.059293908073000515
x1
is close to obs_x
, as expected. Here is how you may interpret these results: given the observations, the most likely hypothesis is that one coin has 31.5% chances to show Head and the other coin has 18.8% such chances. Since you have now access to all variables of the new model, whether observed or hidden, you may make query to have explanations on new observed results, to make forecasts,etc. For instance:
print (lea.joint(a1,b1).given(x1==1)) # -> (0, 1) : 0.33443835049825915 (1, 0) : 0.6655616495017408
Case 2 - Learning binomial probability distributions
We shall now see a more elaborate example, which will introduce other arguments of learn_by_em
method. This example is detailed in a course of Stefanos Zafeiriou on Advanced Statistical Machine Learning
We have two coins, A and B, possibly biased. We want to determine the bias of each coin. We choose randomly 5 times one of the coin and, for each choice, we toss the chosen coin 10 times and note the number of heads obtained. Each value can then go form 0 to 10 heads. The observed results are: 5, 9, 8, 4, 7. We assume that the two coins cannot be distinguished, so the choice made for each experiment is a hidden random variable. Let's first build the frequency table of observed number of heads:
obs_nb_heads = lea.vals(5,9,8,4,7)
ha0 = 0.6 hb0 = 0.5 nb_heads_A0 = lea.binom(10, ha0) nb_heads_B0 = lea.binom(10, hb0)
coin_type0 = lea.vals("A","B") nb_heads0 = coin_type0.switch( { "A" : nb_heads_A0, "B" : nb_heads_B0})
model_dict = nb_heads0.learn_by_em(obs_nb_heads, fixed_vars=(coin_type0,), nb_steps=1)
fixed_vars
argument. By default, learn_by_em
method shall optimize all variables of the given model. Here, because we control the process of selecting the coin (50-50 probabilities), we want that EM algorithm keeps this variable unchanged in the new model. The fixed_vars=(coin_type0,)
expresses this constraint. The argument is any iterable so, in more complex model, several variables could be declared as fixed.
Now, we can display the new bias probability for each of the coins:
nb_heads_A1 = model_dict[nb_heads_A0] nb_heads_B1 = model_dict[nb_heads_B0] print (nb_heads_A1.p) # -> 0.7130122354005161 print (nb_heads_B1.p) # -> 0.5813393083136627
Case 3 - Learning Bayesian network
The last example shows how to learn a simple Bayesian network. It's taken from the AIMA book - pp 727-730. We'll merely provide a summary of the case, focusing on translation in Lea. You're invited to read the explanations given in the book to get a thorough understanding.
Two bags containing candies have been mixed together. Candies are described by three features: wrapper (red/green), hole (true/false) and flavor (cherry/lime). A sample of 1,000 candies is drawn and the occurrences of the eight combinations of features are counted. Here are the raw data obtained, expressed as a Lea JPD:
obs_candy = lea.pmf({ ('red' ,True,'cherry'): 273, ('red' ,False,'cherry'): 93, ('green',True,'cherry'): 104, ('green',False,'cherry'): 90, ('red' ,True,'lime' ): 79, ('red' ,False,'lime' ): 100, ('green',True,'lime' ): 94, ('green',False,'lime' ): 167})
- the three candy features are independent given the origin bag,
- the conditional probability distribution for each feature depends of the bag.
Such kind of assumptions define actually a so-called naive Bayes model. Here is the initial model, using arbitrary parameters values presented in the AIMA book (refer to Bayesian Reasoning section of the present tutorial).
Note: w
stands for "wrapper", h
for "hole", f
for flavor, wb1
for "wrapper if candy comes from bag 1", etc.
bag0 = lea.pmf({1: 0.6, 2: 0.4}) wb1_0 = lea.pmf({'red': 0.6, 'green': 0.4}) wb2_0 = lea.pmf({'red': 0.4, 'green': 0.6}) w0 = bag0.switch({1: wb1_0, 2: wb2_0}) hb1_0 = lea.event(0.6) hb2_0 = lea.event(0.4) h0 = bag0.switch({1: hb1_0, 2: hb2_0}) fb1_0 = lea.pmf({'cherry': 0.6, 'lime': 0.4}) fb2_0 = lea.pmf({'cherry': 0.4, 'lime': 0.6}) f0 = bag0.switch({1: fb1_0, 2: fb2_0}) candy0 = lea.joint(w0,h0,f0)
candy0
is a probability distribution comparable to obs_candy
, associating a probability for each of the eight types of candy (you may display the two distributions to convince yourself). To evaluate the degree of likelihood of the model with the given data, there are several possible indicators, which are closely interlinked: the cross-entropy, the log-likelihood and Kullback-Leibler divergence (see here for more details).
print (obs_candy.cross_entropy(candy0)) # -> 2.949244290266771 print (-1000 * obs_candy.cross_entropy(candy0) * math.log(2)) # -> -2044.260364580929 print (obs_candy.kl_divergence(candy0)) # -> 0.09363125084952317
print (obs_candy.kl_divergence(obs_candy)) # -> 0.0 print (obs_candy.cross_entropy(obs_candy)) # -> 2.8556130394172476
We shall now execute one step of the EM algorithm:
model_dict = candy0.learn_by_em(obs_candy,nb_steps=1)
(candy1,bag1,w1,h1,f1,wb1_1,wb2_1,hb1_1,hb2_1,fb1_1,fb2_1) = \ tuple(model_dict[k] for k in (candy0,bag0,w0,h0,f0,wb1_0,wb2_0,hb1_0,hb2_0,fb1_0,fb2_0))
P = lea.P print (P(bag1==1)) # -> 0.6124306106264868 print (P(fb1_1=='cherry')) # -> 0.6684082742546373 print (P(wb1_1=='red')) # -> 0.6483118060276456 print (P(hb1_1)) # -> 0.6558479816127674 print (P(fb2_1=='cherry')) # -> 0.3886950739168244 print (P(wb2_1=='red')) # -> 0.38174842702951567 print (P(hb2_1)) # -> 0.38274080515627396
candy0
model to candy1
model:
print (-1000 * obs_candy.cross_entropy(candy1) * math.log(2)) # -> -2021.0262390280009 print (obs_candy.kl_divergence(candy1)) # -> 0.060111493134922256
model_dict = candy0.learn_by_em(obs_candy,nb_steps=10) candy10 = model_dict[candy0] print (-1000 * obs_candy.cross_entropy(candy10) * math.log(2)) # -> -1982.0177851139413 print (obs_candy.kl_divergence(candy10)) # -> 0.0038341901203629014
max_kld
argument allows specifying a goal by means of a maximum KL divergence acceptable for the delivered model. For instance: model_dict = candy0.learn_by_em(obs_candy,max_kld=0.0001) candyx = model_dict[candy0] print (obs_candy.kl_divergence(candyx)) # -> 9.809727643439459e-05
nb_steps
argument as a safeguard. For instance,
model_dict = candy0.learn_by_em(obs_candy,max_kld=0.0001,nb_steps=10000)
max_delta_kld
, which specifies the maximum difference of KL divergence between two consecutive iterations. This allows you to specify a "stall" condition that detects the moment at which subsequent iterations wouldn't bring much added value.
In any case, whatever the arguments specified, the execution shall stop as soon as any of specified halting conditions is verified.
Controlling / monitoring the EM algorithm
learn_by_em
is meant to be a high-level method: it performs the job until specified halting condition is met. You cannot however monitor what's happening inside, like the evolution of log-likelihood or the evolution of some parameters of the model. You cannot control the algorithm execution (pause, step, resume, etc.). For such needs, Lea provides an alternative method, gen_em_steps
, which is a generator. This method actually generates an infinite sequence of steps, that you can control.
Here is an example of usage, revisiting the candy use case to show the evolution of KL divergence on the first 10 steps.
candy_gen_steps = candy0.gen_em_steps(obs_candy) for _ in range(10): model_dict = next(candy_gen_steps) candyx = model_dict[candy0] print (obs_candy.kl_divergence(candyx)) # -> 0.060111493134922256 0.034141267160715305 0.016746236124274727 0.009082121028439083 0.006414953854867456 0.005401348869327194 0.004855251952482931 0.004458178493857989 0.0041265818693028145 0.0038341901203629014
gen_em_steps
is that it provides no argument for halting, you have to program the halting condition yourself.
What's next?
Thank you for reading the present tutorial pages! We hope that you found it enough clear and informative.
You should have now comprehensive information on what Lea can do. You should be able to use it in your projects, researches or as a toolkit for teaching probability theory and probabilistic programming. You can find more examples in the Examples page.
For enhancements or bugs, please create issues on bitbucket Lea page.
Feel free also to send an e-mail to pie.denis@skynet.be (English or French spoken). Use cases of Lea in your projects/researches are especially appreciated! A "references" page listing the projects using Lea could greatly improve the reputation of the library.
Updated