Wiki

Clone wiki

Lea / LeaPyTutorial3

Lea Advanced Tutorial - Part 3/3

Table of Content

Introduction

WARNING: Lea 2 is no longer maintained. If you are new to Lea, you are invited to use Lea 3, which has a comprehensive tutorial.

The present page is the third part of the advanced tutorial for the Lea 2 package. It mainly covers new features introduced in Lea 2.2. It assumes that you have installed Lea 2.2.x and that you are familiar with the techniques presented in the previous parts of the tutorial.

New convenience functions: P, Pf, B, V, VP and X

Lea 2.2 provides different convenience functions for the most commonly used Lea methods. These are given in the following table where we assume that boolLea is a boolean Lea instance and argsLea is a sequence of Lea instances.

expression shortcut equivalent what it returns
boolLea.p(True) P(boolLea) a fraction giving the probability that boolLea is true
boolLea.P -same-
boolLea.pmf(True) Pf(boolLea) a float giving the probability that boolLea is true
boolLea.Pf -same-
Lea.boolProb(n,d) B(n,d) a boolean probability distribution such that such that P(True) = n/d
Lea.fromVals(*args) V(*args) a probability distribution, with P(v) = frequency of v in args
Lea.fromValFreqs(*args) VP(*args) a probability distribution with P(v) given by args = sequence of tuples (v,P(v))
vLea.cprod(*argsLea) X(*argsLea) a probability distribution corresponding to the cartesian product of argsLea

Example:

#!python
>>> from lea import *
>>> rain = B(1,4)
>>> P(rain)
1/4
>>> Pf(rain)
0.25
>>> die1 = V(1,2,3,4,5,6)
>>> P(die1 <= 4)
2/3
>>> Pf(die1 <= 4)
0.6666666666666666
>>> parityBit = die1 % 2
>>> parity = parityBit.map(lambda v: 'even' if v==0 else ' odd')
>>> P(parity == 'even')
1/2
>>> X(die1,parityBit,parity)
(1, 1, 'odd ') : 1/6
(2, 0, 'even') : 1/6
(3, 1, 'odd ') : 1/6
(4, 0, 'even') : 1/6
(5, 1, 'odd ') : 1/6
(6, 0, 'even') : 1/6
>>> die1.given(parity=='even')
2 : 1/3
4 : 1/3
6 : 1/3
>>> P((die1==2).given(parity=='even'))
1/3
>>> P((die1==1).given(parity=='even'))
0

Note that P and Pf shortcuts present a small difference with .p(True) and .pmf(True) : these check that the probability distribution is actually boolean; they raise an exception if not.

#!python
>>> die1.p(True)
1/6
>>> P(die1)
lea.Error: found <int> value although <bool> is expected
Note that the first result is due to the fact that Python's True is convertible to the integer value 1. In order to to avoid such silent inconsistency, the P and Pf shortcuts shall be preferred.

'given' method with multiple arguments

As from Lea 2.2, the given method, used for conditional probabilities and Bayesian reasoning, can get multiple evidence conditions as arguments. This is a handy alternative to ANDing conditions together. Example:

#!python
>>> die1.given((2<=die1) & (die1<5) & (parity=='even'))
2 : 1/2
4 : 1/2
can be replaced by
#!python
>>> die1.given(2<=die1, die1<5, parity=='even')
2 : 1/2
4 : 1/2

This presents several advantages:

  1. The expression is simpler since parentheses can be removed.
  2. A set of conditions can be passed easily (with * prefix), which expresses more naturally a set of evidences.
  3. Performance can be improved greatly compared to ANDing evidences because the given method is optimised with "short-circuit evaluation", i.e. the exploration can be pruned as soon as a condition is evaluated to false.

The sole functional difference between the two approaches is that an exception raised with an AND could maybe not occur with the new technique, since some value combinations could be dropped due to short-circuit evaluation (this is the same with Python's and).

#!python
>>> die1.given((2<=die1) & (6//(die1-1)==2))
ZeroDivisionError: integer division or modulo by zero
>>> die1.given(2<=die1, 6//(die1-1)==2)
4 : 1

Other new convenience functions and objects (games)

Lea 2.2 provides convenience functions for common games.

expression what it returns
die(f) probability distribution of a fair die with faces numbered 1, … , f
dice(n,f) probability distribution of the total of n fair dice with faces numbered 1, … , f
diceSeq(n,f) probability distribution of each combination of n fair dice with faces numbered 1, … , f, whatever the order of dice
diceSeq(n,f,False) probability distribution of each combination of n fair dice with faces numbered 1, … , f, taking into the order of dice

Examples:

#!python

>>> from lea import *
>>> die(6)
1 : 1/6
2 : 1/6
3 : 1/6
4 : 1/6
5 : 1/6
6 : 1/6
>>> dice(2,6)
 2 : 1/36
 3 : 2/36
 4 : 3/36
 5 : 4/36
 6 : 5/36
 7 : 6/36
 8 : 5/36
 9 : 4/36
10 : 3/36
11 : 2/36
12 : 1/36
>>> diceSeq(2,6)
(1, 1) : 1/36
(1, 2) : 2/36
(1, 3) : 2/36
(1, 4) : 2/36
(1, 5) : 2/36
(1, 6) : 2/36
(2, 2) : 1/36
(2, 3) : 2/36
(2, 4) : 2/36
(2, 5) : 2/36
(2, 6) : 2/36
(3, 3) : 1/36
(3, 4) : 2/36
(3, 5) : 2/36
(3, 6) : 2/36
(4, 4) : 1/36
(4, 5) : 2/36
(4, 6) : 2/36
(5, 5) : 1/36
(5, 6) : 2/36
(6, 6) : 1/36
>>> diceSeq(2,6,False)
(1, 1) : 1/36
(1, 2) : 1/36
(1, 3) : 1/36
(1, 4) : 1/36
(1, 5) : 1/36
(1, 6) : 1/36
(2, 1) : 1/36
(2, 2) : 1/36
(2, 3) : 1/36
(2, 4) : 1/36
(2, 5) : 1/36
(2, 6) : 1/36
(3, 1) : 1/36
(3, 2) : 1/36
(3, 3) : 1/36
(3, 4) : 1/36
(3, 5) : 1/36
(3, 6) : 1/36
(4, 1) : 1/36
(4, 2) : 1/36
(4, 3) : 1/36
(4, 4) : 1/36
(4, 5) : 1/36
(4, 6) : 1/36
(5, 1) : 1/36
(5, 2) : 1/36
(5, 3) : 1/36
(5, 4) : 1/36
(5, 5) : 1/36
(5, 6) : 1/36
(6, 1) : 1/36
(6, 2) : 1/36
(6, 3) : 1/36
(6, 4) : 1/36
(6, 5) : 1/36
(6, 6) : 1/36

Several convenience objects are also defined.

expression what it is
D6 probability distribution of a fair die with 6 faces (same as die(6))
cardRank probability distribution of the rank of a card, among 'S' (Spades), 'H' (Hearts), 'D' (Diamonds) or 'C' (Clubs)
cardSuite probability distribution of the suite of a card, among 'A', '2', '3', '4', '5', '6', '7', '8', '9', 'T', 'J', 'Q' and 'K'
card probability distribution of a deck of card (same as cardRank + cardSuite)

Examples:

>>> cardSuite
C : 1/4
D : 1/4
H : 1/4
S : 1/4
>>> card.given(cardRank=='A')
AC : 1/4
AD : 1/4
AH : 1/4
AS : 1/4
>>> card.given(cardSuite=='S')
2S : 1/13
3S : 1/13
4S : 1/13
5S : 1/13
6S : 1/13
7S : 1/13
8S : 1/13
9S : 1/13
AS : 1/13
JS : 1/13
KS : 1/13
QS : 1/13
TS : 1/13
>>> card.randomDraw(13)
('AD', 'JC', 'TD', 'JS', '9C', 'KC', 'KD', 'JD', '5D', 'TH', '6S', 'AC', '3C')

PMF histograms using matplotlib

As of version 2.2, Lea can plot the probability mass function of a given Lea instance. This requires you to install the matplotlib package. Then, all you have to do is invoking the plot() method on any Lea instance. Example:

#!python
dice(3,6).plot()
figure_1.png

By default the plot() method draws the pmf in a dedicated window. Several arguments can be specified to customize this behavior, including saving the chart in an image file. Here is the full signature, with default values:

#!python
leaInstance.plot(title=None,fname=None,savefigArgs=dict(),**barArgs)
argument description
title a string giving the title to display (no title if None)
fname if not None, then the chart is saved in a file specified by given fname, as described in matplotlib.pyplot.savefig function
savefigArgs (used only if fname is not None) a dictionary relayed to matplotlib.pyplot.savefig function and containing named arguments expected by this function
barArgs named arguments relayed to matplotlib.pyplot.bar function

Example:

#!python
dice(3,6).plot(title='3D6',fname='pmf_3d6.png',savefigArgs=dict(bbox_inches='tight'),color='green')

Drawing objects

In several situations, we are interested in a sequence of values obtained from the same random source : lottery, card drawing, dice throws, etc. These sequences can be defined using different protocols :

  • with or without replacement: whether the drawn value is withdrawn from the remaining values (no duplicates)
  • sorted or not: whether each sequence shall be sorted or not (i.e. order of drawings is irrelevant or not).

The draw method addresses such need, supporting the different protocols through two Boolean arguments. The signature is as follows:

#!python
leaInstance.draw(n,sorted=False,replacement=False)

argument description
n the number of objects to draw, each value of the resulting distribution will be a n-tuple
sorted boolean: True means that the order of drawing is irrelevant and the tuples are arbitrarily sorted by increasing order (default: False)
replacement boolean: True means that the drawing is made WITH replacement so the same element may occur several times in each tuple (default: False)

The two boolean arguments sorted and replacement can be used independently each from each other, leading to four cases. The best way to understand the role of this arguments is to show the result obtained for each of the four combinations. Let's define a non-uniform distribution with 3 objects:

#!python
>>> d = VP(('A',3),('B',2),('C',1))
>>> d
A : 3/6
B : 2/6
C : 1/6

case 1. No sorting, without replacement

#!python
>>> d.draw(2)
('A', 'B') : 20/60
('A', 'C') : 10/60
('B', 'A') : 15/60
('B', 'C') :  5/60
('C', 'A') :  6/60
('C', 'B') :  4/60

case 2. No sorting, with replacement

#!python
>>> d.draw(2,replacement=True)
('A', 'A') : 9/36
('A', 'B') : 6/36
('A', 'C') : 3/36
('B', 'A') : 6/36
('B', 'B') : 4/36
('B', 'C') : 2/36
('C', 'A') : 3/36
('C', 'B') : 2/36
('C', 'C') : 1/36

case 3. Sorting, without replacement

#!python
>>> d.draw(2,sorted=True)
('A', 'B') : 35/60
('A', 'C') : 16/60
('B', 'C') :  9/60

case 4. Sorting, with replacement

#!python
>>> d.draw(2,sorted=True,replacement=True)
('A', 'A') :  9/36
('A', 'B') : 12/36
('A', 'C') :  6/36
('B', 'B') :  4/36
('B', 'C') :  4/36
('C', 'C') :  1/36

Note: The processing of case 3 and 4 (sorted=True) could be far more efficient than the other cases thanks to a special combinatoric algorithm due to Paul Moore. Note however that, in case 3 (replacement=False), such algorithm shall be used ONLY if the probability distribution is uniform; this is not the case in the example above, but it applies fortunately in most "fair" games.

Likelihood ratio (LR)

The likelihood ratio of an evidence E given an hypothesis H is defined as

        P(E | H)
  LR = --------------
        P(E | not H)
It indicates how far the given evidence E confirms the given hypothesis H. It is traditionally used in medical research and forensic sciences.

As a toy problem, imagine that a die has been thrown but we have very limited ways to gather information on the result. Let us assume that our hypothesis is "the die result is even". After investigations, we get an evidence: the result is not 1. What is the LR of this evidence for our hypothesis? The Lea.lr(E,H) syntax makes this calculation:

#!python
>>> Lea.lr(die1 != 1, parity == 'even')
1.5
You could check this result, by translating the LR definition above in other Lea primitives:
#!python
>>> Pf((die1 != 1).given(parity == 'even')) / Pf((die1 != 1).given(parity != 'even'))
1.5
Since LR > 1, the given evidence is indeed somehow relevant for the hypothesis (as our intuition suggests!).

Notes: the following rules are enforced, in this sequence:

  1. if E is impossible, then an exception is raised (LR = 0/0)
  2. if H is impossible, then an exception is raised (P(E | H) undefined)
  3. if H is certain, then an exception is raised (P(E | not H) undefined)
  4. if H is impossible given E, then LR = 0.0
  5. if H is certain given E, then LR = +infinity (Python's math.inf)

Examples:

#!python
>>> Lea.lr(die1 == 3, parity == 'even')
0.0
>>> Lea.lr(die1 == 2, parity == 'even')
inf

There is another syntax to calculate the LR: E.given(H).lr(); this flavour emphasises the numerator of the LR expression and provides same result as the syntax seen above. It may be handier to express an hypothesis that is a conjonction of several sub-hypothesis, which can be put as arguments of given(…):

#!python
>>> (die1 >= 2).given(parity == 'even').lr()
1.5
>>> (die1 >= 2).given(parity == 'even', die1 <= 5).lr()
1.3333333333333333

Lea calculation and cloning methods

The present section shall reveal some important inner aspects of Lea.

The getAlea() method

Consider the following statements:

#!python
>>> die1 = V(1,2,3,4,5,6)
>>> die2 = V(1,2,3,4,5,6)
>>> dice = die1 + die2

die1 and die2 are instances of Alea class : these define explicit probability distributions, stored as a sequence of (value, probability(value)). dice is an instance of Flea2 class, which builds a tree structure having Alea die1 and die2 as leaves, remembering that the function to be applied is the 'add' function. No calculation is made so far however (i.e. no convolution). The calculation of the dice probability distribution shall be done only at the time it is needed, typically when its display is required (this is called "lazy evaluation"); then, the method getAlea() is called : it makes the required calculations (combines all the values of the Alea leaves, calculates the resulting value one by one with the associated probability) and returns the resulting distribution in a new Alea instance. So, when you execute

#!python
print (dice)
what happens behind the scene is
#!python
_alea1 = dice.getAlea() 
print (_alea1)

The getAlea() method is implicitly called for displaying the probability distribution, getting a probability for a given value or for calculating indicators (mean, variance, etc). As a Lea user, you may need on some occasions to call getAlea(), should you need to calculate a probability distribution without displaying it. Let's mention that calling getAlea() on an Alea instance requires no calculation: it returns the Alea instance itself; for example, die1.getAlea() simply returns die1.

Note that, to improve performance, the result of dice.getAlea() is cached in dice itself. So, should you notice a long time for displaying a complex variable for the very first time, the second and subsequent displays shall be instantaneous.

Now, you should ask: Fine, but what's the aim of such "lazy evaluation" thing? Why not making the calculations as soon as the dice are added? Why dice should not be an Alea instance?

Here is the answer. Indeed, in many simple cases, a direct evaluation would perfectly do the job (the prototype version of Lea did that actually). HOWEVER, there are expressions that mix several interdependent variables, for which the cached result of dice.getAlea() is of no use. Consider for example:

#!python
print (dice - die1)
''' displays:
1 : 1/6
2 : 1/6
3 : 1/6
4 : 1/6
5 : 1/6
6 : 1/6
'''
print (dice - die1 - die2)
''' displays:
0 : 1
'''
print (P(dice == die1 + die2))
''' displays:
1
'''
print (dice.given(die1==1))
''' displays:
2 : 1/6
3 : 1/6
4 : 1/6
5 : 1/6
6 : 1/6
7 : 1/6
'''
In all these expressions, .getAlea() is NOT called on dice but just on each printed expression as a whole. These can be correctly evaluated because dice has "remembered" its dependencies with die1 and die2.

The new() method

We have seen how easy you can create dependent variables in Lea. In some occasions however, you would like to calculate a probability distribution from other variables and, for future uses, to keep it independent of these variables. For example, you want to throw several dice or the same die several times. Assume that leaExpr is a Lea expression that is dependent of some variables; what you need now is building a new Alea instance from leaExpr, cutting off any dependency to leaExpr or to the variables leaExpr is referring to.

Note first that the leaExpr.getAlea() call is NOT fully adequate for this purpose: as seen above, it creates a new Alea instance ONLY if leaExpr is not an Alea instance AND leaExpr.getAlea() has not been called yet (cache). For example:

#!python
dice2 = dice.getAlea()
dice3 = dice.getAlea()
print (P(dice2 == dice3))
''' displays:
1
'''
dice, dice2 and dice3 refers to the same event. To circumvent this problem, Lea provides the special method .new(): leaExpr.new() shall produce a new independent Alea instance at each call.
#!python
dice4 = dice.new()
dice5 = dice.new()
>>> P(dice4 == dice5)
''' displays:
73/648
'''
Here, dice, dice4 and dice5 refers to independent events; it’s equivalent to throwing three times a pair of dice.

The clone() method

The clone() method is historical and shall normally not be used anymore : you can call the afore-mentioned new() instead. clone() provides the same effect of creating an independent random variable from an existing one but it does it by making a deep-copy. The difference appears behind the scene when the source object is an expression (ie not an Alea instance): new() evaluates the probability distribution and returns an Alea instance while clone() returns an instance of the same class as the object on which it applies, without evaluation. As an end-user, you shall not see any difference, except that calling clone() is always super-fast since it delays the evaluation (see discussion on getAlea() method above).

Monte-Carlo sampling

Random generation through MC (randomMC method)

We have seen in the first tutorial how to generate random samples with .random(…) method. An important point must be known on the way this method works. Consider a set of 9 different "dice", with number of faces ranging from 2 to 10. Let us make 20 throws of these 9 dice and get the sum:

#!python
>>> dice = tuple(Lea.interval(1,n) for n in range(2,11))
>>> sumDice = sum(dice)
>>> sumDice.random(20)
(32, 26, 33, 31, 28, 39, 32, 31, 38, 26, 26, 33, 33, 27, 26, 27, 33, 28, 47, 25)
You will note that the generation of the random sample takes some times. Why? Actually, what happens is that, behind the scene, the actual probability distribution of sumDice is calculated (a heavy combinatorial process), which produces ans Alea instance (see previous section); then, from this instance, the 20 random samples are obtained. So, what actually happens is equivalent to this:
#!python
>>> _alea1 = sumDice.getAlea()
>>> _alea1.random(20)
(32, 26, 33, 31, 28, 39, 32, 31, 38, 26, 26, 33, 33, 27, 26, 27, 33, 28, 47, 25)
Note that, because of the cache mechanism, any further call to sumDice.random(...) shall be very fast.

In some cases, using the .random(…) method will become intractable because of combinatorial explosion. The randomMC(…) method proposes an alternative implementation, which tackles this limitation. Redoing the same example

#!python
>>> sumDice2 = sum(dice)
>>> sumDice2.randomMC(20)
(37, 35, 38, 34, 29, 31, 32, 21, 31, 30, 35, 39, 35, 40, 31, 22, 30, 29, 39, 31)
now produces random samples without any latency. How does it work? Here it is: it does not try to calculate the final probability distribution; it just make random sample on each elementary distribution, then it aggregates the results together. In the current case, for each of the 20 trials, it throws the 9 dice and it sums up the results. This is closer to the real process since it's basically a simulation of it.

Take care that this example does NOT entail that randomMC method is the real McCoy. Both random and randomMC methods are complimentary:

  • random tends to be better when the end values result from long calculations and/or when the combinatorial is affordable and/or the random sample to produce is large;
  • randomMC tends to be better when the end values result from short calculations and/or when the combinatorial is voluminous and/or the random sample to produce is small.

Estimating probability distribution by MC sampling (estimateMC method)

The randomMC(…) method may serve another purpose than pure random sample generation. In the case of calculation on large combinatorial, it may be used to estimate the resulting probability distribution based on random sample with a given size. It offers then an alternative to getAlea() method, from which the exact result maybe prohibitively long to get.

For instance, here is how to estimate the probability distribution of the sum of 9 dice seen above, based on a sample of 10000 trials:

#!python
>>> V(*sumDice2.randomMC(10000))
13 :   2/10000
14 :   1/10000
15 :   5/10000
16 :  16/10000
17 :  25/10000
18 :  36/10000
19 :  63/10000
20 :  87/10000
21 : 127/10000
22 : 174/10000

One of the interest is that you can tune the sample size as you wish according to the precision you want. Since this approach is very useful, Lea provides a convenience method to do the job more easily, namely .estimateMC(…). The following

#!python
>>> sumDice2.estimateMC(10000)
produces an estimation similar as the one shown above.

This method of estimating a probability distribution by MC sampling is very useful in many situations having large combinatorial size, like querying a complex Bayesian network.

More about JPD (Joint Probability Distributions)

With the asJoint method (see advanced tutorial-part 1), we have seen how to build a join probability table from a probability distribution having, as values, tuples of equal size.

We shall here elaborate the topic of join probability distributions, showing how Lea can be used to read larger set of data and get statistics on them. First we shall see how to build a Lea JPD from a given CSV file or from a given Pandas dataframe. Then, we shall show advanced techniques to extract data from the JPD, including filtering, marginalisation, grouping, etc.

Building JPD from a CSV file

For the example, we shall assume that we have a database of patients, from which we have extracted data of 50 patients within a CSV file (the fields are: given name, surname, gender, title, birthday, bloodType, weight in kg, height in cm, smoker (Y/N)):

patients.csv

#!csv
Elaine,McLaughlin,female,Ms.,12-01-1984,O+,44.9,141.0,Y
Alan,Music,male,Mr.,30-12-1966,A-,61.3,181.0,Y
Debra,Roberts,female,Mrs.,01-12-1927,O+,79.6,168.0,N
Martine,Albright,female,Mrs.,05-01-1975,O+,46.0,156.0,N
Terry,Jordan,male,Mr.,28-12-1963,A-,96.9,181.0,N
Joshua,Spinney,male,Mr.,19-12-1952,O+,106.4,183.0,N
Tawanna,Short,female,Ms.,02-02-2012,O+,93.4,175.0,N
Frances,Moore,female,Ms.,07-01-1978,B+,91.6,164.0,N
Jon,Overbey,male,Mr.,12-01-1985,A+,60.9,164.0,N
Milton,Silva,male,Mr.,04-12-1932,O+,88.9,177.0,N
Damon,Kiser,male,Mr.,09-12-1938,A+,97.2,181.0,N
Ruby,Nunez,female,Mrs.,30-12-1966,O-,23.6,126.0,Y
David,Seguin,male,Mr.,21-01-1996,A+,28.8,118.0,N
Lillian,Wooley,female,Ms.,28-12-1964,A-,54.5,156.0,N
Lena,Sharp,female,Ms.,02-01-1971,AB+,58.4,170.0,N
Felecia,Decker,female,Mrs.,20-12-1953,O+,95.8,172.0,N
Belva,Dry,female,Mrs.,07-12-1936,AB+,76.9,151.0,N
Lawrence,McGovern,male,Mr.,06-01-1976,A+,16.8,91.0,Y
Shelley,Bailey,female,Ms.,14-12-1945,A+,91.2,157.0,N
Mildred,Heaps,female,Mrs.,01-01-1970,O+,67.8,162.0,N
Melinda,Edie,female,Ms.,20-12-1953,A+,34.5,145.0,N
Angela,Vaz,female,Mrs.,11-12-1941,AB+,30.3,126.0,N
Gilberto,Schultz,male,Mr.,02-01-1971,B+,63.5,175.0,N
Dianne,Zubia,female,Ms.,28-01-2006,O+,69.1,157.0,N
Truman,Conner,male,Mr.,10-12-1940,O-,105.4,176.0,N
Dorothy,Keegan,female,Mrs.,14-12-1945,A+,54.7,174.0,N
Pamela,Guzman,female,Ms.,11-12-1941,B-,70.9,154.0,Y
John,Davenport,male,Mr.,04-01-1974,B+,96.9,177.0,N
Dorothy,Christen,female,Ms.,05-12-1933,B+,98.0,171.0,N
Mary,Baird,female,Mrs.,26-01-2003,O+,61.3,170.0,N
Luis,Boyett,male,Mr.,17-01-1991,O+,103.4,170.0,N
Tricia,Lombardi,female,Mrs.,13-12-1944,A+,50.5,155.0,Y
Joshua,West,male,Mr.,17-01-1991,A+,14.9,93.0,Y
Jimmie,Martinez,male,Mr.,03-12-1930,B+,114.0,171.0,N
Carla,Moore,female,Ms.,24-12-1958,B-,61.5,154.0,N
Megan,Johanson,female,Mrs.,16-12-1948,B+,85.3,159.0,N
Jenna,Tipton,female,Ms.,06-01-1977,O+,80.1,155.0,Y
Goldie,Maestas,female,Mrs.,06-12-1934,O-,83.4,161.0,N
Jeffrey,Evatt,male,Mr.,26-12-1961,O+,99.6,183.0,N
Lloyd,Carroll,male,Mr.,16-12-1947,B-,98.6,189.0,N
Dennis,Akins,male,Mr.,07-01-1978,B+,66.2,174.0,N
Naomi,Paulus,female,Mrs.,04-02-2015,B+,73.4,167.0,N
Dallas,Iverson,male,Mr.,03-01-1973,A+,66.5,177.0,N
Thomas,Boren,male,Mr.,10-12-1939,A+,67.9,188.0,N
Daniel,Helms,male,Mr.,22-12-1956,A+,87.0,174.0,N
Muriel,Labrecque,female,Ms.,30-01-2009,B+,48.3,155.0,Y
Bernice,Humphries,female,Mrs.,26-01-2003,O+,23.2,123.0,N
Kimberly,Victory,female,Ms.,10-01-1982,A+,100.8,167.0,N
Pam,Beach,female,Mrs.,12-12-1942,O+,79.8,155.0,N
Rita,Hines,female,Ms.,06-01-1977,A+,51.5,166.0,N
To be able to execute the statements below, you should create such file by copy-pasting the data above. Note that these data are fictional; these have been created thanks to the excellent online tool Fake Name Generator.

Here is how you can read this file and build a joint probability distribution from it:

#!python
>>> patient = Lea.fromCSVFilename('patients.csv',colNames=('givenName','surname','gender','title','birthday','bloodType','weight{f}','height{f}','smoker{b}'))

Note that the given column names shall be valid Python identifiers (starts with a letter, no blank, etc). The special suffix with curly braces used for the three last names are NOT part of the name. By default (no suffix), all read values are stored as strings in the built JPD; a curly braces suffix indicates the corresponding read values shall be converted as follows:

suffix conversion type
{s} string (note that this suffix can be omitted)
{i} integer
{f} float
{b} boolean

For converting string to boolean, the following rules are applied (case insensitive):

  • '1', 't', 'true', 'y', 'yes' --> Python's True,
  • '0', 'f', 'false', 'n', 'no' --> Python's False

As we shall see, putting {f} on weight and height shall allow to make comparisons and calculations and these attributes; putting {b} on smoker shall ease the queries by removing references to 'Y'/'N' (see below).

Once the JPD is built, you can access the attributes using standard Python syntax. For example, to get distribution of genders and titles (marginalisation):

#!python
>>> patient.gender
female : 3/5
  male : 2/5
>>> patient.title
 Mr. : 4/10
Mrs. : 3/10
 Ms. : 3/10

We'll see more advanced queries in a next section.

Handling of CSV formats

The Lea.fromCSVFilename method as used above will work fine provided that your CSV file as a format similar to the format of our patient.csv file example. Here are indications on how to handle other CSV format.

Some CSV files contain a first line to be interpreted as a header containing field names.

patients2.csv

#!csv
givenName,surname,gender,title,birthday,bloodType,weight,height,smoker
Elaine,McLaughlin,female,Ms.,12-01-1984,O+,44.9,141.0,Y
Alan,Music,male,Mr.,30-12-1966,A-,61.3,181.0,Y

In such case, all you have to do is to omit the colNames argument in the call to Lea.fromCSVFilename:

#!python
>>> patient2 = Lea.fromCSVFilename('patients2.csv')
Then, the names found in the first row shall be used to set joint attributes. Note that the curly braces suffixes could be added in the CSV file, with the same semantics as described above.

Another thing to know is that Lea.fromCSVFilename relies on Python's standard csv module and it supports the same arguments (with same default) as csv.reader function, namely dialect and **fmtparams. So, assuming for instance that your CSV file is tab-delimited, you can use:

#!python
>>> myData = Lea.fromCSVFilename('data.csv',colNames=,delimiter='\t')

A last point concerns possible problems of CSV file's encoding. If such problem occur, you'll have to open the CSV file yourself with the right encoding and uses Lea.fromCSVFile method, which accepts the same arguments as Lea.fromCSVFilename excepting the first one, which is a file object instead of a filename. For example, in Python3,

#!python
>>> with open('exoticData.csv','r',encoding='utf-8') as f
...     myExoticData = Lea.fromCSVFile(f,)

Building JPD from a pandas dataframe

If you do data analysis using pandas library, you would be tempted to build a Lea JPD from a pandas "dataframe". To do this, you could manage to extract raw tuples from the data frames and use Lea fromJoint method. Actually, Lea provides a dedicated method to do that, Lea.fromPandasDF. In the following example, we build first a pandas dataframe by reading the above-mentioned patients2.csv file (this time using pandas method, not Lea's). Then, we build a JPD from that dataframe.

#!python
>>> import pandas as pd
>>> df = pd.DataFrame.from_csv('patients2.csv')
>>> patient2 = Lea.fromPandasDF(df)
The resulting JPD shall be the same as before, excepting that all the fields of the JPD are strings, as it is the field format used in the present dataframe. For other dataframe formats, the fields in Lea JPD will mimic the dataframe's.

Creating variables from JPD attributes

As explained above, all JPD attributes are accessed by using the standard dot syntax. To have lighter expressions, you could assign variables:

#!python
>>> givenName = patient.givenName
>>> surname = patient.surname
>>> gender = patient.gender
>>> 
This is perfectly right and the created variables shall act as handy aliases. Now, to avoid such boilerplate statements, Lea provides the makeVars method, which creates the variables for you:

#!python
>>> Lea.makeVars(patient,globals())
From now on, you can use the patient's attributes, without the patient. prefix, viz. givenName, surname , gender, title, birthday, bloodType, weight, height and smoker. That's what we'll assume in the following section.

Data mining on Lea JPD

Assuming that you have managed to get your data in a Lea JPD using the techniques explained in the previous sections, you can now start to query the data. In all the example below, the returned fractions have to be interpreted primarily as frequencies in the given 50 patients sample; if the sample is large enough and enough representative of the whole population, then these fractions can be interpreted as probabilities.

In what follows, it's important to understand that the attribute variables givenName, surname, extracted from the patient JPD don't refer to objects calculated once for all (in Lea's words, these are not Alea instances); they are special interdependent Lea objects that are lazy-evaluated in the context of the expressions in which they appear.

Let's start with basic examples:

Let's calculate the distribution of genders and titles.

#!python
>>> gender
female : 3/5
  male : 2/5
>>> title
 Mr. : 4/10
Mrs. : 3/10
 Ms. : 3/10

By using, the cartesian product, we can see that genders and titles are dependent of each other, as we expect:

#!python
>>> X(gender,title)
('female', 'Mrs.') : 3/10
 ('female', 'Ms.') : 3/10
   ('male', 'Mr.') : 4/10

Let's calculate the distribution of heights. To avoid too long display, we shall group with a precision of 10 cm, using Python's integer division.

#!python
>>> heightByTen = (height//10) * 10
>>> heightByTen
 90.0 :  2/50
110.0 :  1/50
120.0 :  3/50
140.0 :  2/50
150.0 : 12/50
160.0 :  8/50
170.0 : 15/50
180.0 :  7/50

Then, let's map these values into category labels:

#!python
>>> heightCat = heightByTen.map(lambda h: 'H-%03d-%03d'%(h,h+9))
>>> heightCat
H-090-099 :  2/50
H-110-119 :  1/50
H-120-129 :  3/50
H-140-149 :  2/50
H-150-159 : 12/50
H-160-169 :  8/50
H-170-179 : 15/50
H-180-189 :  7/50

We could check who is in the category 'H-120-129' and how tall they are (using English sentences, as an academic exercise!):

#!python
>>> (givenName+' is '+(height/100.).map(str)+' m high').given(heightCat=='H-120-129')
 Angela is 1.26 m high : 1/3
Bernice is 1.23 m high : 1/3
   Ruby is 1.26 m high : 1/3

Note that there seem to be only women in this category... This raises another question: how are heights distributed by gender?

#!python
>>> female = gender=='female'
>>> male = gender=='male'
>>> heightCat.given(female)
H-120-129 :  3/30
H-140-149 :  2/30
H-150-159 : 12/30
H-160-169 :  7/30
H-170-179 :  6/30
>>> heightCat.given(male)
H-090-099 : 2/20
H-110-119 : 1/20
H-160-169 : 1/20
H-170-179 : 9/20
H-180-189 : 7/20

You can note on this example how heightCat, male and female variables are interdependent. Another approach, more integrated, is to use the cartesian product:

#!python
>>> X(gender,heightCat)
('female', 'H-120-129') :  3/50
('female', 'H-140-149') :  2/50
('female', 'H-150-159') : 12/50
('female', 'H-160-169') :  7/50
('female', 'H-170-179') :  6/50
  ('male', 'H-090-099') :  2/50
  ('male', 'H-110-119') :  1/50
  ('male', 'H-160-169') :  1/50
  ('male', 'H-170-179') :  9/50
  ('male', 'H-180-189') :  7/50

Imagine now that you have to calculate the body mass index (BMI) of the patients. Lea's arithmetic shall do the job quite easily (note that we round the BMI to limit the number of distinct values):

#!python
>>> BMI = (weight/(height/100.)**2).map(round)
>>> BMI
15.0 : 2/50
16.0 : 1/50
17.0 : 1/50
18.0 : 1/50
19.0 : 5/50
20.0 : 3/50
21.0 : 5/50
22.0 : 2/50
23.0 : 2/50
26.0 : 3/50
28.0 : 4/50
29.0 : 1/50
30.0 : 5/50
31.0 : 1/50
32.0 : 3/50
33.0 : 2/50
34.0 : 5/50
36.0 : 2/50
37.0 : 1/50
39.0 : 1/50
>>> BMI.mean
26.22

We want to split patients into "underweight", "normal" and "obese" categories. For this purpose, we shall define 3 boolean variables, which are mutually exclusive.

>>> underweight = BMI < 18.5
>>> obese = BMI >= 30
>>> normal = ~underweight & ~obese
>>> P(underweight)
1/10
>>> P(obese)
2/5
>>> P(normal)
1/2
Note that we could have also defined a variable catBMI, mapping the BMI values into strings, using if / elif construct or, simpler, the Python's bisect module).

As a final example, let's select the female patients who are underweight and aged 18 years or more; for these patients, we want to display the full name, age, height, weight and BMI. Let's first build the full name and age; to ease things, we shall calculate the approximate age, for the year 2015, without considering the birth day and month.

#!python
>>> fullName = title + ' ' + givenName + ' ' + surname
>>> age = 2015 - birthday[-4:].map(int)

Now, using these two new variables, we can answer the initial request:

#!python
>>> X(fullName,age,height,weight,BMI).given(female,underweight,age>=18)
('Mrs. Dorothy Keegan', 70, 174.0, 54.7, 18.0) : 1/3
    ('Mrs. Ruby Nunez', 49, 126.0, 23.6, 15.0) : 1/3
   ('Ms. Melinda Edie', 62, 145.0, 34.5, 16.0) : 1/3

Machine Learning with Lea

So far, most of the examples shown demonstrate various ways to calculate probability distributions from known probability distributions. However, apart trivial cases of uniform distributions found in gambling, it could be difficult to determine prior probability distributions for a given phenomenon. If no quantitative data are available for the phenomenon at hand, you have to make assumptions with some theoretical model, assuming this is not too complex. On the other hand, if quantitative data are available, then it is possible, with dedicated algorithms, to calculate prior probability distributions that tend to fit the given data. This last approach belongs to the field of Machine Learning (ML). In the present section, we shall present a couple of basic techniques to "learn" probabilities from the data.

Frequentist inference

Actually, we have already presented a very basic ML approach. Counting occurrences of a given phenomenon and derive probabilities as calculated frequencies is indeed a way to "learn" from the data.

For instance, frequencies of English letters can be learned from text data:

>>> data = "to be or not to be: that is the question"
>>> freq1 = Lea.fromSeq(data)
>>> freq1
  : 9/40
: : 1/40
a : 1/40
b : 2/40
e : 4/40
h : 2/40
i : 2/40
n : 2/40
o : 5/40
q : 1/40
r : 1/40
s : 2/40
t : 7/40
u : 1/40 
Hint: as seen before, a shorter equivalent expression is freq1 = V(*data).

Generally speaking, the larger the data, the more sensible the result will be (this is the assumption made with Big Data!). Frequencies of group of two letters can be easily calculated also, using the pairwise function (see implementation in Python’s “Itertools Recipes”:

>>> freq2 = Lea.fromSeq(pairwise(data))
Then, you could examine the frequencies of transitions from a 't' or to a 't':
>>> freq2.given(freq2[0] == 't')
('t', ' ') : 2/7
('t', 'h') : 2/7
('t', 'i') : 1/7
('t', 'o') : 2/7
... or to a 't':
>>> freq2.given(freq2[1] == 't')
(' ', 't') : 3/6
('a', 't') : 1/6
('o', 't') : 1/6
('s', 't') : 1/6

Another approach to study transitions between letters is to build a Markov chain from the given data, as already explained here:

>>> mc = markov.Chain.fromSeq(data)
>>> mc

  -> b : 2/9
  -> i : 1/9
  -> n : 1/9
  -> o : 1/9
  -> q : 1/9
  -> t : 3/9
:
  ->   : 1
a
  -> t : 1
b
  -> e : 1
e
  ->   : 2/4
  -> : : 1/4
  -> s : 1/4
h
  -> a : 1/2
  -> e : 1/2
i
  -> o : 1/2
  -> s : 1/2
n
  -> o : 1
o
  ->   : 2/5
  -> n : 1/5
  -> r : 1/5
  -> t : 1/5
q
  -> u : 1
r
  ->   : 1
s
  ->   : 1/2
  -> t : 1/2
t
  ->   : 2/7
  -> h : 2/7
  -> i : 1/7
  -> o : 2/7
u
  -> e : 1

Building a Bayesian network from a joint table

If you reconsider the examples of data mining on a joint table seen above now interpreting the frequencies as probabilities, then you get another example of ML through frequentist approach. Admittedly, such simple statistical approach could be qualified as “Machine Learning for the layman”! A more difficult goal is to calculate the probabilities involved in a Bayesian network. Lea provides a method for this purpose; it can calculate the CPT entries provided that you define the dependency graph of the BN. This method, named Lea.buildBNfromJoint, makes the following assumptions:

  1. you study a set of given N discrete random variables, which are possibly interdependent and for which the pmf are unknown;
  2. you have got a joint table with S samples given the values of each of the N variables (i.e. a table of S rows and N columns); we assume that the these data are rich enough to discover all values of the domain of each random variable;
  3. you have defined the dependencies of the BN, by means of a direct acyclic graph linking the random variables.

Let us illustrate these assumptions on an example that reuses the patient joint table seen above. Let us consider the three assumptions one by one:

  1. we shall focus on the five random variables gender, bloodType, smoker, height and weight (N = 5);
  2. these 5 data are present in the patient joint table, which contains fifty records (S = 50);
  3. we assume a BN with the following dependencies between the five random elements:
                   gender  bloodType
                     |  \      /
                     |   \    /
             height  |   smoker      
                 \   |    /
                  \  |   /
                   weight

The Lea.buildBNfromJoint method allows "learning" the different CPT of this BN from the data present in the patient table. Each argument of Lea.buildBNfromJoint is a tuple giving the names of antecedent nodes and the name of the dependent node:

bn = patient.buildBNfromJoint(
  (('gender','bloodType'), 'smoker'),
  (('gender','height','smoker'), 'weight'))
Now, bn is a full-fledged BN on which you can make queries, such a calculating the gender probability of a person given that its weight is between 110 and 120 (not included):
print (bn.gender.given(110 <= bn.weight, bn.weight < 120))
# -> female : 166052/781997
#      male : 615945/781997
print (Pf((bn.gender=='male').given(110 <= bn.weight, bn.weight < 120)))
# -> 0.7876564743854516
You may wonder: What is the gain of building a BN, while the initial joint table already offers the ability to make the very same query? There are two answers to this question. The first one concerns the data size: the space required to store BN data is far smaller than the space required to store the full joint table; to some extent, the BN makes a compression of the joint table. The second answer is that the joint table stores actual raw data while the BN captures the dependency patterns. This phenomenon, known as overfitting, is blatant if we redo the query seen before on the patient joint table instead of the built-up BN:
print (Pf((patient.gender=='male').given(110 <= patient.weight, patient.weight < 120)))
# -> 1.0
Here, with the joint table, the given evidence on patient's weight makes the gender certain. This result is dubious but it is in line with the joint table:
print (patient.given(110 <= patient.weight, patient.weight < 120))
# -> _(givenName='Jimmie', surname='Martinez', gender='male', title='Mr.', birthday='03-12-1930', bloodType='B+', weight=114.0, height=171.0, smoker=False) : 1
There is only one patient in the given weight range, who is incidentally a man. This is a trivial instance of overfitting. In this present case, the BN provides a more nuanced answer, which sounds more sensible.

Improving the Bayesian network

Of course, the merits of BN are directly linked to the quality of the dependency graph designed by the human expert who models the phenomenon. In this regard, the BN graph given above could for sure be improved. To illustrate some advanced techniques, let us revisit the BN by including the patient's age: it seems sensible to add age as height's influencer, as shown in the following graph:

         age       gender  bloodType
           \      /  |  \      /
            \    /   |   \    /
            height   |   smoker      
                \    |    /
                 \   |   /
                   weight
Now, the problem is that age is not an attribute of the patient joint table. Only birthday is available, which is a string, hence not adequate for a BN variable. As seen previously, assuming the year of the study is 2015, the age could be approximated with the following definition:
age = 2015 – patient.birthday[-4:].map(int)
Based on this definition, the joint table can now be augmented with the new column. Since each patient row is a named tuple, the trick is to convert age as a tuple through the map method, to append it to the patient tuples and then to use the asJoint method to redefine the column names. This could be done through the following one-liner:
patient2 = (patient + age.map(lambda x:(x,))).asJoint('givenName','surname','gender','title','birthday','bloodType','weight','height','smoker','age')
You could check that each row of patient2 joint table contains a new column age, which is in line with birthday. Now that we have all the data needed, we can build the new BN, as sketched above:
bn2 = patient2.buildBNfromJoint(
  (('gender','age'), 'height'),
  (('gender','bloodType'), 'smoker'),
  (('gender','height','smoker'), 'weight'))
This new BN gives different results than the previous BN, which ignored the dependency on age:
print (Pf((bn2.gender=='male').given(110 <= bn2.weight, bn2.weight < 120)))
# -> 0.8397314844204872
We then see the impact of the added dependency of age to height, even if these two variables are absent from the query.

Updated