Wiki
Clone wikiLea / LeaPyTutorial3
Lea Advanced Tutorial - Part 3/3
Table of Content
- Introduction
- New convenience functions: P, Pf, B, V, VP and X
- 'given' method with multiple arguments
- Other new convenience functions and objects (games)
- PMF histograms using matplotlib
- Drawing objects
- Likelihood ratio (LR)
- Lea calculation and cloning methods
- Monte-Carlo sampling
- More about JPD (Joint Probability Distributions)
- Machine Learning with Lea
Introduction
WARNING: Lea 2 is no longer maintained. If you are new to Lea, you are invited to use Lea 3, which has a comprehensive tutorial.
The present page is the third part of the advanced tutorial for the Lea 2 package. It mainly covers new features introduced in Lea 2.2. It assumes that you have installed Lea 2.2.x and that you are familiar with the techniques presented in the previous parts of the tutorial.
New convenience functions: P, Pf, B, V, VP and X
Lea 2.2 provides different convenience functions for the most commonly used Lea methods. These are given in the following table where we assume that boolLea
is a boolean Lea instance and argsLea
is a sequence of Lea instances.
expression | shortcut equivalent | what it returns |
---|---|---|
boolLea.p(True) |
P(boolLea) |
a fraction giving the probability that boolLea is true |
boolLea.P |
-same- | |
boolLea.pmf(True) |
Pf(boolLea) |
a float giving the probability that boolLea is true |
boolLea.Pf |
-same- | |
Lea.boolProb(n,d) |
B(n,d) |
a boolean probability distribution such that such that P(True) = n/d |
Lea.fromVals(*args) |
V(*args) |
a probability distribution, with P(v) = frequency of v in args |
Lea.fromValFreqs(*args) |
VP(*args) |
a probability distribution with P(v) given by args = sequence of tuples (v,P(v)) |
vLea.cprod(*argsLea) |
X(*argsLea) |
a probability distribution corresponding to the cartesian product of argsLea |
Example:
#!python >>> from lea import * >>> rain = B(1,4) >>> P(rain) 1/4 >>> Pf(rain) 0.25 >>> die1 = V(1,2,3,4,5,6) >>> P(die1 <= 4) 2/3 >>> Pf(die1 <= 4) 0.6666666666666666 >>> parityBit = die1 % 2 >>> parity = parityBit.map(lambda v: 'even' if v==0 else ' odd') >>> P(parity == 'even') 1/2 >>> X(die1,parityBit,parity) (1, 1, 'odd ') : 1/6 (2, 0, 'even') : 1/6 (3, 1, 'odd ') : 1/6 (4, 0, 'even') : 1/6 (5, 1, 'odd ') : 1/6 (6, 0, 'even') : 1/6 >>> die1.given(parity=='even') 2 : 1/3 4 : 1/3 6 : 1/3 >>> P((die1==2).given(parity=='even')) 1/3 >>> P((die1==1).given(parity=='even')) 0
Note that P
and Pf
shortcuts present a small difference with .p(True)
and .pmf(True)
: these check that the probability distribution is actually boolean; they raise an exception if not.
#!python >>> die1.p(True) 1/6 >>> P(die1) lea.Error: found <int> value although <bool> is expected
True
is convertible to the integer value 1. In order to to avoid such silent inconsistency, the P
and Pf
shortcuts shall be preferred.
'given' method with multiple arguments
As from Lea 2.2, the given
method, used for conditional probabilities and Bayesian reasoning, can get multiple evidence conditions as arguments. This is a handy alternative to ANDing conditions together. Example:
#!python >>> die1.given((2<=die1) & (die1<5) & (parity=='even')) 2 : 1/2 4 : 1/2
#!python >>> die1.given(2<=die1, die1<5, parity=='even') 2 : 1/2 4 : 1/2
This presents several advantages:
- The expression is simpler since parentheses can be removed.
- A set of conditions can be passed easily (with * prefix), which expresses more naturally a set of evidences.
- Performance can be improved greatly compared to ANDing evidences because the
given
method is optimised with "short-circuit evaluation", i.e. the exploration can be pruned as soon as a condition is evaluated to false.
The sole functional difference between the two approaches is that an exception raised with an AND could maybe not occur with the new technique, since some value combinations could be dropped due to short-circuit evaluation (this is the same with Python's and
).
#!python >>> die1.given((2<=die1) & (6//(die1-1)==2)) ZeroDivisionError: integer division or modulo by zero >>> die1.given(2<=die1, 6//(die1-1)==2) 4 : 1
Other new convenience functions and objects (games)
Lea 2.2 provides convenience functions for common games.
expression | what it returns |
---|---|
die(f) |
probability distribution of a fair die with faces numbered 1, … , f |
dice(n,f) |
probability distribution of the total of n fair dice with faces numbered 1, … , f |
diceSeq(n,f) |
probability distribution of each combination of n fair dice with faces numbered 1, … , f, whatever the order of dice |
diceSeq(n,f,False) |
probability distribution of each combination of n fair dice with faces numbered 1, … , f, taking into the order of dice |
Examples:
#!python >>> from lea import * >>> die(6) 1 : 1/6 2 : 1/6 3 : 1/6 4 : 1/6 5 : 1/6 6 : 1/6 >>> dice(2,6) 2 : 1/36 3 : 2/36 4 : 3/36 5 : 4/36 6 : 5/36 7 : 6/36 8 : 5/36 9 : 4/36 10 : 3/36 11 : 2/36 12 : 1/36 >>> diceSeq(2,6) (1, 1) : 1/36 (1, 2) : 2/36 (1, 3) : 2/36 (1, 4) : 2/36 (1, 5) : 2/36 (1, 6) : 2/36 (2, 2) : 1/36 (2, 3) : 2/36 (2, 4) : 2/36 (2, 5) : 2/36 (2, 6) : 2/36 (3, 3) : 1/36 (3, 4) : 2/36 (3, 5) : 2/36 (3, 6) : 2/36 (4, 4) : 1/36 (4, 5) : 2/36 (4, 6) : 2/36 (5, 5) : 1/36 (5, 6) : 2/36 (6, 6) : 1/36 >>> diceSeq(2,6,False) (1, 1) : 1/36 (1, 2) : 1/36 (1, 3) : 1/36 (1, 4) : 1/36 (1, 5) : 1/36 (1, 6) : 1/36 (2, 1) : 1/36 (2, 2) : 1/36 (2, 3) : 1/36 (2, 4) : 1/36 (2, 5) : 1/36 (2, 6) : 1/36 (3, 1) : 1/36 (3, 2) : 1/36 (3, 3) : 1/36 (3, 4) : 1/36 (3, 5) : 1/36 (3, 6) : 1/36 (4, 1) : 1/36 (4, 2) : 1/36 (4, 3) : 1/36 (4, 4) : 1/36 (4, 5) : 1/36 (4, 6) : 1/36 (5, 1) : 1/36 (5, 2) : 1/36 (5, 3) : 1/36 (5, 4) : 1/36 (5, 5) : 1/36 (5, 6) : 1/36 (6, 1) : 1/36 (6, 2) : 1/36 (6, 3) : 1/36 (6, 4) : 1/36 (6, 5) : 1/36 (6, 6) : 1/36
Several convenience objects are also defined.
expression | what it is |
---|---|
D6 |
probability distribution of a fair die with 6 faces (same as die(6) ) |
cardRank |
probability distribution of the rank of a card, among 'S' (Spades), 'H' (Hearts), 'D' (Diamonds) or 'C' (Clubs) |
cardSuite |
probability distribution of the suite of a card, among 'A', '2', '3', '4', '5', '6', '7', '8', '9', 'T', 'J', 'Q' and 'K' |
card |
probability distribution of a deck of card (same as cardRank + cardSuite ) |
Examples:
>>> cardSuite C : 1/4 D : 1/4 H : 1/4 S : 1/4 >>> card.given(cardRank=='A') AC : 1/4 AD : 1/4 AH : 1/4 AS : 1/4 >>> card.given(cardSuite=='S') 2S : 1/13 3S : 1/13 4S : 1/13 5S : 1/13 6S : 1/13 7S : 1/13 8S : 1/13 9S : 1/13 AS : 1/13 JS : 1/13 KS : 1/13 QS : 1/13 TS : 1/13 >>> card.randomDraw(13) ('AD', 'JC', 'TD', 'JS', '9C', 'KC', 'KD', 'JD', '5D', 'TH', '6S', 'AC', '3C')
PMF histograms using matplotlib
As of version 2.2, Lea can plot the probability mass function of a given Lea instance. This requires you to install the matplotlib package. Then, all you have to do is invoking the plot()
method on any Lea instance. Example:
#!python dice(3,6).plot()
By default the plot()
method draws the pmf in a dedicated window. Several arguments can be specified to customize this behavior, including saving the chart in an image file. Here is the full signature, with default values:
#!python leaInstance.plot(title=None,fname=None,savefigArgs=dict(),**barArgs)
argument | description |
---|---|
title |
a string giving the title to display (no title if None) |
fname |
if not None, then the chart is saved in a file specified by given fname , as described in matplotlib.pyplot.savefig function |
savefigArgs |
(used only if fname is not None) a dictionary relayed to matplotlib.pyplot.savefig function and containing named arguments expected by this function |
barArgs |
named arguments relayed to matplotlib.pyplot.bar function |
Example:
#!python dice(3,6).plot(title='3D6',fname='pmf_3d6.png',savefigArgs=dict(bbox_inches='tight'),color='green')
Drawing objects
In several situations, we are interested in a sequence of values obtained from the same random source : lottery, card drawing, dice throws, etc. These sequences can be defined using different protocols :
- with or without replacement: whether the drawn value is withdrawn from the remaining values (no duplicates)
- sorted or not: whether each sequence shall be sorted or not (i.e. order of drawings is irrelevant or not).
The draw method addresses such need, supporting the different protocols through two Boolean arguments. The signature is as follows:
#!python leaInstance.draw(n,sorted=False,replacement=False)
argument | description |
---|---|
n |
the number of objects to draw, each value of the resulting distribution will be a n -tuple |
sorted |
boolean: True means that the order of drawing is irrelevant and the tuples are arbitrarily sorted by increasing order (default: False) |
replacement |
boolean: True means that the drawing is made WITH replacement so the same element may occur several times in each tuple (default: False) |
The two boolean arguments sorted
and replacement
can be used independently each from each other, leading to four cases. The best way to understand the role of this arguments is to show the result obtained for each of the four combinations. Let's define a non-uniform distribution with 3 objects:
#!python >>> d = VP(('A',3),('B',2),('C',1)) >>> d A : 3/6 B : 2/6 C : 1/6
case 1. No sorting, without replacement
#!python >>> d.draw(2) ('A', 'B') : 20/60 ('A', 'C') : 10/60 ('B', 'A') : 15/60 ('B', 'C') : 5/60 ('C', 'A') : 6/60 ('C', 'B') : 4/60
case 2. No sorting, with replacement
#!python >>> d.draw(2,replacement=True) ('A', 'A') : 9/36 ('A', 'B') : 6/36 ('A', 'C') : 3/36 ('B', 'A') : 6/36 ('B', 'B') : 4/36 ('B', 'C') : 2/36 ('C', 'A') : 3/36 ('C', 'B') : 2/36 ('C', 'C') : 1/36
case 3. Sorting, without replacement
#!python >>> d.draw(2,sorted=True) ('A', 'B') : 35/60 ('A', 'C') : 16/60 ('B', 'C') : 9/60
case 4. Sorting, with replacement
#!python >>> d.draw(2,sorted=True,replacement=True) ('A', 'A') : 9/36 ('A', 'B') : 12/36 ('A', 'C') : 6/36 ('B', 'B') : 4/36 ('B', 'C') : 4/36 ('C', 'C') : 1/36
Note: The processing of case 3 and 4 (sorted=True
) could be far more efficient than the other cases thanks to a special combinatoric algorithm due to Paul Moore. Note however that, in case 3 (replacement=False
), such algorithm shall be used ONLY if the probability distribution is uniform; this is not the case in the example above, but it applies fortunately in most "fair" games.
Likelihood ratio (LR)
The likelihood ratio of an evidence E given an hypothesis H is defined as
P(E | H) LR = -------------- P(E | not H)
As a toy problem, imagine that a die has been thrown but we have very limited ways to gather information on the result. Let us assume that our hypothesis is "the die result is even". After investigations, we get an evidence: the result is not 1. What is the LR of this evidence for our hypothesis? The Lea.lr(E,H)
syntax makes this calculation:
#!python >>> Lea.lr(die1 != 1, parity == 'even') 1.5
#!python >>> Pf((die1 != 1).given(parity == 'even')) / Pf((die1 != 1).given(parity != 'even')) 1.5
Notes: the following rules are enforced, in this sequence:
- if E is impossible, then an exception is raised (LR = 0/0)
- if H is impossible, then an exception is raised (P(E | H) undefined)
- if H is certain, then an exception is raised (P(E | not H) undefined)
- if H is impossible given E, then LR = 0.0
- if H is certain given E, then LR = +infinity (Python's
math.inf
)
Examples:
#!python >>> Lea.lr(die1 == 3, parity == 'even') 0.0 >>> Lea.lr(die1 == 2, parity == 'even') inf
There is another syntax to calculate the LR: E.given(H).lr()
; this flavour emphasises the numerator of the LR expression and provides same result as the syntax seen above. It may be handier to express an hypothesis that is a conjonction of several sub-hypothesis, which can be put as arguments of given(…)
:
#!python >>> (die1 >= 2).given(parity == 'even').lr() 1.5 >>> (die1 >= 2).given(parity == 'even', die1 <= 5).lr() 1.3333333333333333
Lea calculation and cloning methods
The present section shall reveal some important inner aspects of Lea.
The getAlea() method
Consider the following statements:
#!python >>> die1 = V(1,2,3,4,5,6) >>> die2 = V(1,2,3,4,5,6) >>> dice = die1 + die2
die1
and die2
are instances of Alea class : these define explicit probability distributions, stored as a sequence of (value, probability(value)). dice
is an instance of Flea2
class, which builds a tree structure having Alea die1
and die2
as leaves, remembering that the function to be applied is the 'add' function. No calculation is made so far however (i.e. no convolution). The calculation of the dice
probability distribution shall be done only at the time it is needed, typically when its display is required (this is called "lazy evaluation"); then, the method getAlea()
is called : it makes the required calculations (combines all the values of the Alea leaves, calculates the resulting value one by one with the associated probability) and returns the resulting distribution in a new Alea instance. So, when you execute
#!python print (dice)
#!python _alea1 = dice.getAlea() print (_alea1)
The getAlea()
method is implicitly called for displaying the probability distribution, getting a probability for a given value or for calculating indicators (mean, variance, etc). As a Lea user, you may need on some occasions to call getAlea()
, should you need to calculate a probability distribution without displaying it. Let's mention that calling getAlea()
on an Alea instance requires no calculation: it returns the Alea instance itself; for example, die1.getAlea()
simply returns die1
.
Note that, to improve performance, the result of dice.getAlea()
is cached in dice
itself. So, should you notice a long time for displaying a complex variable for the very first time, the second and subsequent displays shall be instantaneous.
Now, you should ask: Fine, but what's the aim of such "lazy evaluation" thing? Why not making the calculations as soon as the dice are added? Why dice
should not be an Alea instance?
Here is the answer. Indeed, in many simple cases, a direct evaluation would perfectly do the job (the prototype version of Lea did that actually). HOWEVER, there are expressions that mix several interdependent variables, for which the cached result of dice.getAlea()
is of no use. Consider for example:
#!python print (dice - die1) ''' displays: 1 : 1/6 2 : 1/6 3 : 1/6 4 : 1/6 5 : 1/6 6 : 1/6 ''' print (dice - die1 - die2) ''' displays: 0 : 1 ''' print (P(dice == die1 + die2)) ''' displays: 1 ''' print (dice.given(die1==1)) ''' displays: 2 : 1/6 3 : 1/6 4 : 1/6 5 : 1/6 6 : 1/6 7 : 1/6 '''
.getAlea()
is NOT called on dice
but just on each printed expression as a whole. These can be correctly evaluated because dice
has "remembered" its dependencies with die1
and die2
.
The new() method
We have seen how easy you can create dependent variables in Lea. In some occasions however, you would like to calculate a probability distribution from other variables and, for future uses, to keep it independent of these variables. For example, you want to throw several dice or the same die several times. Assume that leaExpr
is a Lea expression that is dependent of some variables; what you need now is building a new Alea instance from leaExpr
, cutting off any dependency to leaExpr
or to the variables leaExpr
is referring to.
Note first that the leaExpr.getAlea()
call is NOT fully adequate for this purpose: as seen above, it creates a new Alea instance ONLY if leaExpr
is not an Alea instance AND leaExpr.getAlea()
has not been called yet (cache). For example:
#!python dice2 = dice.getAlea() dice3 = dice.getAlea() print (P(dice2 == dice3)) ''' displays: 1 '''
dice
, dice2
and dice3
refers to the same event. To circumvent this problem, Lea provides the special method .new()
: leaExpr.new()
shall produce a new independent Alea instance at each call.
#!python dice4 = dice.new() dice5 = dice.new() >>> P(dice4 == dice5) ''' displays: 73/648 '''
dice
, dice4
and dice5
refers to independent events; it’s equivalent to throwing three times a pair of dice.
The clone() method
The clone()
method is historical and shall normally not be used anymore : you can call the afore-mentioned new()
instead. clone()
provides the same effect of creating an independent random variable from an existing one but it does it by making a deep-copy. The difference appears behind the scene when the source object is an expression (ie not an Alea instance): new()
evaluates the probability distribution and returns an Alea instance while clone()
returns an instance of the same class as the object on which it applies, without evaluation. As an end-user, you shall not see any difference, except that calling clone()
is always super-fast since it delays the evaluation (see discussion on getAlea()
method above).
Monte-Carlo sampling
Random generation through MC (randomMC method)
We have seen in the first tutorial how to generate random samples with .random(…)
method. An important point must be known on the way this method works. Consider a set of 9 different "dice", with number of faces ranging from 2 to 10. Let us make 20 throws of these 9 dice and get the sum:
#!python >>> dice = tuple(Lea.interval(1,n) for n in range(2,11)) >>> sumDice = sum(dice) >>> sumDice.random(20) (32, 26, 33, 31, 28, 39, 32, 31, 38, 26, 26, 33, 33, 27, 26, 27, 33, 28, 47, 25)
sumDice
is calculated (a heavy combinatorial process), which produces ans Alea instance (see previous section); then, from this instance, the 20 random samples are obtained. So, what actually happens is equivalent to this:
#!python >>> _alea1 = sumDice.getAlea() >>> _alea1.random(20) (32, 26, 33, 31, 28, 39, 32, 31, 38, 26, 26, 33, 33, 27, 26, 27, 33, 28, 47, 25)
sumDice.random(...)
shall be very fast.
In some cases, using the .random(…)
method will become intractable because of combinatorial explosion. The randomMC(…)
method proposes an alternative implementation, which tackles this limitation. Redoing the same example
#!python >>> sumDice2 = sum(dice) >>> sumDice2.randomMC(20) (37, 35, 38, 34, 29, 31, 32, 21, 31, 30, 35, 39, 35, 40, 31, 22, 30, 29, 39, 31)
Take care that this example does NOT entail that randomMC
method is the real McCoy. Both random
and randomMC
methods are complimentary:
random
tends to be better when the end values result from long calculations and/or when the combinatorial is affordable and/or the random sample to produce is large;randomMC
tends to be better when the end values result from short calculations and/or when the combinatorial is voluminous and/or the random sample to produce is small.
Estimating probability distribution by MC sampling (estimateMC method)
The randomMC(…)
method may serve another purpose than pure random sample generation. In the case of calculation on large combinatorial, it may be used to estimate the resulting probability distribution based on random sample with a given size. It offers then an alternative to getAlea()
method, from which the exact result maybe prohibitively long to get.
For instance, here is how to estimate the probability distribution of the sum of 9 dice seen above, based on a sample of 10000 trials:
#!python >>> V(*sumDice2.randomMC(10000)) 13 : 2/10000 14 : 1/10000 15 : 5/10000 16 : 16/10000 17 : 25/10000 18 : 36/10000 19 : 63/10000 20 : 87/10000 21 : 127/10000 22 : 174/10000 …
One of the interest is that you can tune the sample size as you wish according to the precision you want. Since this approach is very useful, Lea provides a convenience method to do the job more easily, namely .estimateMC(…)
. The following
#!python >>> sumDice2.estimateMC(10000)
This method of estimating a probability distribution by MC sampling is very useful in many situations having large combinatorial size, like querying a complex Bayesian network.
More about JPD (Joint Probability Distributions)
With the asJoint
method (see advanced tutorial-part 1), we have seen how to build a join probability table from a probability distribution having, as values, tuples of equal size.
We shall here elaborate the topic of join probability distributions, showing how Lea can be used to read larger set of data and get statistics on them. First we shall see how to build a Lea JPD from a given CSV file or from a given Pandas dataframe. Then, we shall show advanced techniques to extract data from the JPD, including filtering, marginalisation, grouping, etc.
Building JPD from a CSV file
For the example, we shall assume that we have a database of patients, from which we have extracted data of 50 patients within a CSV file (the fields are: given name, surname, gender, title, birthday, bloodType, weight in kg, height in cm, smoker (Y/N)):
patients.csv
#!csv Elaine,McLaughlin,female,Ms.,12-01-1984,O+,44.9,141.0,Y Alan,Music,male,Mr.,30-12-1966,A-,61.3,181.0,Y Debra,Roberts,female,Mrs.,01-12-1927,O+,79.6,168.0,N Martine,Albright,female,Mrs.,05-01-1975,O+,46.0,156.0,N Terry,Jordan,male,Mr.,28-12-1963,A-,96.9,181.0,N Joshua,Spinney,male,Mr.,19-12-1952,O+,106.4,183.0,N Tawanna,Short,female,Ms.,02-02-2012,O+,93.4,175.0,N Frances,Moore,female,Ms.,07-01-1978,B+,91.6,164.0,N Jon,Overbey,male,Mr.,12-01-1985,A+,60.9,164.0,N Milton,Silva,male,Mr.,04-12-1932,O+,88.9,177.0,N Damon,Kiser,male,Mr.,09-12-1938,A+,97.2,181.0,N Ruby,Nunez,female,Mrs.,30-12-1966,O-,23.6,126.0,Y David,Seguin,male,Mr.,21-01-1996,A+,28.8,118.0,N Lillian,Wooley,female,Ms.,28-12-1964,A-,54.5,156.0,N Lena,Sharp,female,Ms.,02-01-1971,AB+,58.4,170.0,N Felecia,Decker,female,Mrs.,20-12-1953,O+,95.8,172.0,N Belva,Dry,female,Mrs.,07-12-1936,AB+,76.9,151.0,N Lawrence,McGovern,male,Mr.,06-01-1976,A+,16.8,91.0,Y Shelley,Bailey,female,Ms.,14-12-1945,A+,91.2,157.0,N Mildred,Heaps,female,Mrs.,01-01-1970,O+,67.8,162.0,N Melinda,Edie,female,Ms.,20-12-1953,A+,34.5,145.0,N Angela,Vaz,female,Mrs.,11-12-1941,AB+,30.3,126.0,N Gilberto,Schultz,male,Mr.,02-01-1971,B+,63.5,175.0,N Dianne,Zubia,female,Ms.,28-01-2006,O+,69.1,157.0,N Truman,Conner,male,Mr.,10-12-1940,O-,105.4,176.0,N Dorothy,Keegan,female,Mrs.,14-12-1945,A+,54.7,174.0,N Pamela,Guzman,female,Ms.,11-12-1941,B-,70.9,154.0,Y John,Davenport,male,Mr.,04-01-1974,B+,96.9,177.0,N Dorothy,Christen,female,Ms.,05-12-1933,B+,98.0,171.0,N Mary,Baird,female,Mrs.,26-01-2003,O+,61.3,170.0,N Luis,Boyett,male,Mr.,17-01-1991,O+,103.4,170.0,N Tricia,Lombardi,female,Mrs.,13-12-1944,A+,50.5,155.0,Y Joshua,West,male,Mr.,17-01-1991,A+,14.9,93.0,Y Jimmie,Martinez,male,Mr.,03-12-1930,B+,114.0,171.0,N Carla,Moore,female,Ms.,24-12-1958,B-,61.5,154.0,N Megan,Johanson,female,Mrs.,16-12-1948,B+,85.3,159.0,N Jenna,Tipton,female,Ms.,06-01-1977,O+,80.1,155.0,Y Goldie,Maestas,female,Mrs.,06-12-1934,O-,83.4,161.0,N Jeffrey,Evatt,male,Mr.,26-12-1961,O+,99.6,183.0,N Lloyd,Carroll,male,Mr.,16-12-1947,B-,98.6,189.0,N Dennis,Akins,male,Mr.,07-01-1978,B+,66.2,174.0,N Naomi,Paulus,female,Mrs.,04-02-2015,B+,73.4,167.0,N Dallas,Iverson,male,Mr.,03-01-1973,A+,66.5,177.0,N Thomas,Boren,male,Mr.,10-12-1939,A+,67.9,188.0,N Daniel,Helms,male,Mr.,22-12-1956,A+,87.0,174.0,N Muriel,Labrecque,female,Ms.,30-01-2009,B+,48.3,155.0,Y Bernice,Humphries,female,Mrs.,26-01-2003,O+,23.2,123.0,N Kimberly,Victory,female,Ms.,10-01-1982,A+,100.8,167.0,N Pam,Beach,female,Mrs.,12-12-1942,O+,79.8,155.0,N Rita,Hines,female,Ms.,06-01-1977,A+,51.5,166.0,N
Here is how you can read this file and build a joint probability distribution from it:
#!python >>> patient = Lea.fromCSVFilename('patients.csv',colNames=('givenName','surname','gender','title','birthday','bloodType','weight{f}','height{f}','smoker{b}'))
Note that the given column names shall be valid Python identifiers (starts with a letter, no blank, etc). The special suffix with curly braces used for the three last names are NOT part of the name. By default (no suffix), all read values are stored as strings in the built JPD; a curly braces suffix indicates the corresponding read values shall be converted as follows:
suffix | conversion type |
---|---|
{s} |
string (note that this suffix can be omitted) |
{i} |
integer |
{f} |
float |
{b} |
boolean |
For converting string to boolean, the following rules are applied (case insensitive):
- '1', 't', 'true', 'y', 'yes' --> Python's
True
, - '0', 'f', 'false', 'n', 'no' --> Python's
False
As we shall see, putting {f} on weight and height shall allow to make comparisons and calculations and these attributes; putting {b} on smoker shall ease the queries by removing references to 'Y'/'N' (see below).
Once the JPD is built, you can access the attributes using standard Python syntax. For example, to get distribution of genders and titles (marginalisation):
#!python >>> patient.gender female : 3/5 male : 2/5 >>> patient.title Mr. : 4/10 Mrs. : 3/10 Ms. : 3/10
We'll see more advanced queries in a next section.
Handling of CSV formats
The Lea.fromCSVFilename
method as used above will work fine provided that your CSV file as a format similar to the format of our patient.csv file example. Here are indications on how to handle other CSV format.
Some CSV files contain a first line to be interpreted as a header containing field names.
patients2.csv
#!csv givenName,surname,gender,title,birthday,bloodType,weight,height,smoker Elaine,McLaughlin,female,Ms.,12-01-1984,O+,44.9,141.0,Y Alan,Music,male,Mr.,30-12-1966,A-,61.3,181.0,Y …
In such case, all you have to do is to omit the colNames argument in the call to Lea.fromCSVFilename
:
#!python >>> patient2 = Lea.fromCSVFilename('patients2.csv')
Another thing to know is that Lea.fromCSVFilename
relies on Python's standard csv module and it supports the same arguments (with same default) as csv.reader
function, namely dialect
and **fmtparams
. So, assuming for instance that your CSV file is tab-delimited, you can use:
#!python >>> myData = Lea.fromCSVFilename('data.csv',colNames=…,delimiter='\t')
A last point concerns possible problems of CSV file's encoding. If such problem occur, you'll have to open the CSV file yourself with the right encoding and uses Lea.fromCSVFile
method, which accepts the same arguments as Lea.fromCSVFilename
excepting the first one, which is a file object instead of a filename. For example, in Python3,
#!python >>> with open('exoticData.csv','r',encoding='utf-8') as f ... myExoticData = Lea.fromCSVFile(f,…)
Building JPD from a pandas dataframe
If you do data analysis using pandas library, you would be tempted to build a Lea JPD from a pandas "dataframe". To do this, you could manage to extract raw tuples from the data frames and use Lea fromJoint
method. Actually, Lea provides a dedicated method to do that, Lea.fromPandasDF
. In the following example, we build first a pandas dataframe by reading the above-mentioned patients2.csv file (this time using pandas method, not Lea's). Then, we build a JPD from that dataframe.
#!python >>> import pandas as pd >>> df = pd.DataFrame.from_csv('patients2.csv') >>> patient2 = Lea.fromPandasDF(df)
Creating variables from JPD attributes
As explained above, all JPD attributes are accessed by using the standard dot syntax. To have lighter expressions, you could assign variables:
#!python >>> givenName = patient.givenName >>> surname = patient.surname >>> gender = patient.gender >>> …
makeVars
method, which creates the variables for you:
#!python >>> Lea.makeVars(patient,globals())
patient
's attributes, without the patient.
prefix, viz. givenName
, surname
, gender
, title
, birthday
, bloodType
, weight
, height
and smoker
. That's what we'll assume in the following section.
Data mining on Lea JPD
Assuming that you have managed to get your data in a Lea JPD using the techniques explained in the previous sections, you can now start to query the data. In all the example below, the returned fractions have to be interpreted primarily as frequencies in the given 50 patients sample; if the sample is large enough and enough representative of the whole population, then these fractions can be interpreted as probabilities.
In what follows, it's important to understand that the attribute variables givenName
, surname
, extracted from the patient
JPD don't refer to objects calculated once for all (in Lea's words, these are not Alea
instances); they are special interdependent Lea objects that are lazy-evaluated in the context of the expressions in which they appear.
Let's start with basic examples:
Let's calculate the distribution of genders and titles.
#!python >>> gender female : 3/5 male : 2/5 >>> title Mr. : 4/10 Mrs. : 3/10 Ms. : 3/10
By using, the cartesian product, we can see that genders and titles are dependent of each other, as we expect:
#!python >>> X(gender,title) ('female', 'Mrs.') : 3/10 ('female', 'Ms.') : 3/10 ('male', 'Mr.') : 4/10
Let's calculate the distribution of heights. To avoid too long display, we shall group with a precision of 10 cm, using Python's integer division.
#!python >>> heightByTen = (height//10) * 10 >>> heightByTen 90.0 : 2/50 110.0 : 1/50 120.0 : 3/50 140.0 : 2/50 150.0 : 12/50 160.0 : 8/50 170.0 : 15/50 180.0 : 7/50
Then, let's map these values into category labels:
#!python >>> heightCat = heightByTen.map(lambda h: 'H-%03d-%03d'%(h,h+9)) >>> heightCat H-090-099 : 2/50 H-110-119 : 1/50 H-120-129 : 3/50 H-140-149 : 2/50 H-150-159 : 12/50 H-160-169 : 8/50 H-170-179 : 15/50 H-180-189 : 7/50
We could check who is in the category 'H-120-129' and how tall they are (using English sentences, as an academic exercise!):
#!python >>> (givenName+' is '+(height/100.).map(str)+' m high').given(heightCat=='H-120-129') Angela is 1.26 m high : 1/3 Bernice is 1.23 m high : 1/3 Ruby is 1.26 m high : 1/3
Note that there seem to be only women in this category... This raises another question: how are heights distributed by gender?
#!python >>> female = gender=='female' >>> male = gender=='male' >>> heightCat.given(female) H-120-129 : 3/30 H-140-149 : 2/30 H-150-159 : 12/30 H-160-169 : 7/30 H-170-179 : 6/30 >>> heightCat.given(male) H-090-099 : 2/20 H-110-119 : 1/20 H-160-169 : 1/20 H-170-179 : 9/20 H-180-189 : 7/20
You can note on this example how heightCat
, male
and female
variables are interdependent. Another approach, more integrated, is to use the cartesian product:
#!python >>> X(gender,heightCat) ('female', 'H-120-129') : 3/50 ('female', 'H-140-149') : 2/50 ('female', 'H-150-159') : 12/50 ('female', 'H-160-169') : 7/50 ('female', 'H-170-179') : 6/50 ('male', 'H-090-099') : 2/50 ('male', 'H-110-119') : 1/50 ('male', 'H-160-169') : 1/50 ('male', 'H-170-179') : 9/50 ('male', 'H-180-189') : 7/50
Imagine now that you have to calculate the body mass index (BMI) of the patients. Lea's arithmetic shall do the job quite easily (note that we round the BMI to limit the number of distinct values):
#!python >>> BMI = (weight/(height/100.)**2).map(round) >>> BMI 15.0 : 2/50 16.0 : 1/50 17.0 : 1/50 18.0 : 1/50 19.0 : 5/50 20.0 : 3/50 21.0 : 5/50 22.0 : 2/50 23.0 : 2/50 26.0 : 3/50 28.0 : 4/50 29.0 : 1/50 30.0 : 5/50 31.0 : 1/50 32.0 : 3/50 33.0 : 2/50 34.0 : 5/50 36.0 : 2/50 37.0 : 1/50 39.0 : 1/50 >>> BMI.mean 26.22
We want to split patients into "underweight", "normal" and "obese" categories. For this purpose, we shall define 3 boolean variables, which are mutually exclusive.
>>> underweight = BMI < 18.5 >>> obese = BMI >= 30 >>> normal = ~underweight & ~obese >>> P(underweight) 1/10 >>> P(obese) 2/5 >>> P(normal) 1/2
catBMI
, mapping the BMI values into strings, using if
/ elif
construct or, simpler, the Python's bisect
module).
As a final example, let's select the female patients who are underweight and aged 18 years or more; for these patients, we want to display the full name, age, height, weight and BMI. Let's first build the full name and age; to ease things, we shall calculate the approximate age, for the year 2015, without considering the birth day and month.
#!python >>> fullName = title + ' ' + givenName + ' ' + surname >>> age = 2015 - birthday[-4:].map(int)
Now, using these two new variables, we can answer the initial request:
#!python >>> X(fullName,age,height,weight,BMI).given(female,underweight,age>=18) ('Mrs. Dorothy Keegan', 70, 174.0, 54.7, 18.0) : 1/3 ('Mrs. Ruby Nunez', 49, 126.0, 23.6, 15.0) : 1/3 ('Ms. Melinda Edie', 62, 145.0, 34.5, 16.0) : 1/3
Machine Learning with Lea
So far, most of the examples shown demonstrate various ways to calculate probability distributions from known probability distributions. However, apart trivial cases of uniform distributions found in gambling, it could be difficult to determine prior probability distributions for a given phenomenon. If no quantitative data are available for the phenomenon at hand, you have to make assumptions with some theoretical model, assuming this is not too complex. On the other hand, if quantitative data are available, then it is possible, with dedicated algorithms, to calculate prior probability distributions that tend to fit the given data. This last approach belongs to the field of Machine Learning (ML). In the present section, we shall present a couple of basic techniques to "learn" probabilities from the data.
Frequentist inference
Actually, we have already presented a very basic ML approach. Counting occurrences of a given phenomenon and derive probabilities as calculated frequencies is indeed a way to "learn" from the data.
For instance, frequencies of English letters can be learned from text data:
>>> data = "to be or not to be: that is the question" >>> freq1 = Lea.fromSeq(data) >>> freq1 : 9/40 : : 1/40 a : 1/40 b : 2/40 e : 4/40 h : 2/40 i : 2/40 n : 2/40 o : 5/40 q : 1/40 r : 1/40 s : 2/40 t : 7/40 u : 1/40
freq1 = V(*data)
.
Generally speaking, the larger the data, the more sensible the result will be (this is the assumption made with Big Data!). Frequencies of group of two letters can be easily calculated also, using the pairwise
function (see implementation in Python’s “Itertools Recipes”:
>>> freq2 = Lea.fromSeq(pairwise(data))
>>> freq2.given(freq2[0] == 't') ('t', ' ') : 2/7 ('t', 'h') : 2/7 ('t', 'i') : 1/7 ('t', 'o') : 2/7
>>> freq2.given(freq2[1] == 't') (' ', 't') : 3/6 ('a', 't') : 1/6 ('o', 't') : 1/6 ('s', 't') : 1/6
Another approach to study transitions between letters is to build a Markov chain from the given data, as already explained here:
>>> mc = markov.Chain.fromSeq(data) >>> mc -> b : 2/9 -> i : 1/9 -> n : 1/9 -> o : 1/9 -> q : 1/9 -> t : 3/9 : -> : 1 a -> t : 1 b -> e : 1 e -> : 2/4 -> : : 1/4 -> s : 1/4 h -> a : 1/2 -> e : 1/2 i -> o : 1/2 -> s : 1/2 n -> o : 1 o -> : 2/5 -> n : 1/5 -> r : 1/5 -> t : 1/5 q -> u : 1 r -> : 1 s -> : 1/2 -> t : 1/2 t -> : 2/7 -> h : 2/7 -> i : 1/7 -> o : 2/7 u -> e : 1
Building a Bayesian network from a joint table
If you reconsider the examples of data mining on a joint table seen above now interpreting the frequencies as probabilities, then you get another example of ML through frequentist approach. Admittedly, such simple statistical approach could be qualified as “Machine Learning for the layman”! A more difficult goal is to calculate the probabilities involved in a Bayesian network. Lea provides a method for this purpose; it can calculate the CPT entries provided that you define the dependency graph of the BN. This method, named Lea.buildBNfromJoint
, makes the following assumptions:
- you study a set of given N discrete random variables, which are possibly interdependent and for which the pmf are unknown;
- you have got a joint table with S samples given the values of each of the N variables (i.e. a table of S rows and N columns); we assume that the these data are rich enough to discover all values of the domain of each random variable;
- you have defined the dependencies of the BN, by means of a direct acyclic graph linking the random variables.
Let us illustrate these assumptions on an example that reuses the patient
joint table seen above. Let us consider the three assumptions one by one:
- we shall focus on the five random variables
gender
,bloodType
,smoker
,height
andweight
(N = 5); - these 5 data are present in the patient joint table, which contains fifty records (S = 50);
- we assume a BN with the following dependencies between the five random elements:
gender bloodType | \ / | \ / height | smoker \ | / \ | / weight
The Lea.buildBNfromJoint
method allows "learning" the different CPT of this BN from the data present in the patient table. Each argument of Lea.buildBNfromJoint
is a tuple giving the names of antecedent nodes and the name of the dependent node:
bn = patient.buildBNfromJoint( (('gender','bloodType'), 'smoker'), (('gender','height','smoker'), 'weight'))
bn
is a full-fledged BN on which you can make queries, such a calculating the gender probability of a person given that its weight is between 110 and 120 (not included):
print (bn.gender.given(110 <= bn.weight, bn.weight < 120)) # -> female : 166052/781997 # male : 615945/781997 print (Pf((bn.gender=='male').given(110 <= bn.weight, bn.weight < 120))) # -> 0.7876564743854516
patient
joint table instead of the built-up BN:
print (Pf((patient.gender=='male').given(110 <= patient.weight, patient.weight < 120))) # -> 1.0
print (patient.given(110 <= patient.weight, patient.weight < 120)) # -> _(givenName='Jimmie', surname='Martinez', gender='male', title='Mr.', birthday='03-12-1930', bloodType='B+', weight=114.0, height=171.0, smoker=False) : 1
Improving the Bayesian network
Of course, the merits of BN are directly linked to the quality of the dependency graph designed by the human expert who models the phenomenon. In this regard, the BN graph given above could for sure be improved. To illustrate some advanced techniques, let us revisit the BN by including the patient's age
: it seems sensible to add age
as height
's influencer, as shown in the following graph:
age gender bloodType \ / | \ / \ / | \ / height | smoker \ | / \ | / weight
age
is not an attribute of the patient
joint table. Only birthday
is available, which is a string, hence not adequate for a BN variable. As seen previously, assuming the year of the study is 2015, the age could be approximated with the following definition:
age = 2015 – patient.birthday[-4:].map(int)
map
method, to append it to the patient tuples and then to use the asJoint
method to redefine the column names. This could be done through the following one-liner:
patient2 = (patient + age.map(lambda x:(x,))).asJoint('givenName','surname','gender','title','birthday','bloodType','weight','height','smoker','age')
patient2
joint table contains a new column age
, which is in line with birthday
. Now that we have all the data needed, we can build the new BN, as sketched above:
bn2 = patient2.buildBNfromJoint( (('gender','age'), 'height'), (('gender','bloodType'), 'smoker'), (('gender','height','smoker'), 'weight'))
print (Pf((bn2.gender=='male').given(110 <= bn2.weight, bn2.weight < 120))) # -> 0.8397314844204872
age
to height
, even if these two variables are absent from the query.
Updated