Wiki

Clone wiki

Lea / Examples

Table of Content

introduction

WARNING: Lea 2 is no longer maintained. If you are new to Lea, you are invited to use Lea 3, which has a comprehensive tutorial and examples.

The following page provides several examples using Lea 2, Python flavor.

Unless otherwise specified, all the examples just require Lea module on top of a standard Python installation.

If some Lea statements are unclear, please refer to the basic tutorial or the advanced tutorial. Note that some of the use cases assumes a good knowledge of the Python language!

Sicherman dice

I discovered the strange Sicherman dice in the Scientific American's column of Martin Gardner.

Here is how you can verify that they result in the same probability distribution as two normal dice.

>>> dieS1 = Lea.fromVals(1,2,2,3,3,4)
>>> dieS2 = Lea.fromVals(1,3,4,5,6,8)
>>> dieS1 + dieS2
 2 : 1/36
 3 : 2/36
 4 : 3/36
 5 : 4/36
 6 : 5/36
 7 : 6/36
 8 : 5/36
 9 : 4/36
10 : 3/36
11 : 2/36
12 : 1/36

nontransitive dice

Nontransitive dice are an intriguing mathematical curiosity. It is another counterintuitive fact in the domain of probabilities.

In the following example, we consider the 3 special dice A, B and C (see face labels below).

>>> dieA = Lea.fromVals(3,3,5,5,7,7)
>>> dieB = Lea.fromVals(2,2,4,4,9,9)
>>> dieC = Lea.fromVals(1,1,6,6,8,8)

The calculated probabilities associated to inequalities show that

  • die A beats die B, with probability 5/9
  • die B beats die C, with probability 5/9
  • die C beats die A, with probability 5/9
>>> dieA > dieB
False : 4/9
 True : 5/9
>>> dieB > dieC
False : 4/9
 True : 5/9
>>> dieC > dieA
False : 4/9
 True : 5/9

This forms a strange cycle where, on the average, each die beats another die while it is itself beaten by the remaining die.

Other examples (Grime's dice), with thorough explanation and video, can be found here.

time management

Suppose you have to do 2 tasks T1 and T2, which can last uncertain durations. After careful evaluations, you estimate the following probability distributions for the duration of each task:

  • T1 : 2 days (20%), 3 days (60%), 4 days (20%)
  • T2 : 3 days (20%), 4 days (80%)

To handle time calculation, we shall use the handy datetime built-in Python module. For convenience, we define the constant DAY representing the duration of one day:

>>> from datetime import *
>>> DAY = timedelta(1)
>>> T1 = Lea.fromValFreqs((2,20),(3,60),(4,20)) * DAY
>>> T2 = Lea.fromValFreqs((3,20),(4,80)) * DAY
>>> T1
2 days, 0:00:00 : 1/5
3 days, 0:00:00 : 3/5
4 days, 0:00:00 : 1/5
>>> T2
3 days, 0:00:00 : 1/5
4 days, 0:00:00 : 4/5

Now assuming that T1 could start on 28th September 2020 and that T1 shall be completed before starting T2, what is the distribution of the end time?

>>> startTime = datetime(2020,9,28)
>>> endTime = startTime + T1 + T2
>>> endTime
2020-10-03 00:00:00 :  1/25
2020-10-04 00:00:00 :  7/25
2020-10-05 00:00:00 : 13/25
2020-10-06 00:00:00 :  4/25

For an "executive summary" point of view, the mean time of completion is

>>> print (endTime.mean)
2020-10-04 19:12:00

Note for Lea 1.x users: use the syntax .mean() !

Note (for the curious): How can Lea calculate a mean time, although it is impossible (and meaningless) to add dates together? Actually, the implementation of Lea's mean method works by calculating the mean of the differences between each values and the first value; then this mean difference is added to the first value. With numbers, this provides the same results as the simple algorithm, with the advantage of avoiding any risk of overflow; in the case of datetime objects, it sums actually timedelta objects (i.e. durations), thus avoiding a raised exception.

Here is how to produce figures on the overall duration of the tasks:

>>> endTime - startTime
5 days, 0:00:00 :  1/25
6 days, 0:00:00 :  7/25
7 days, 0:00:00 : 13/25
8 days, 0:00:00 :  4/25
>>> print ((endTime - startTime).mean)
6 days, 19:12:00

Assume now that there is no dependency between T1 and T2 and that you have enough resources to start T1 and T2 in parallel. Then, the distribution of the overall duration is no more T1+T2 but it is max(T1,T2), i.e. it is the duration of the longest task. Unfortunately, you can not write such simple expression because the Python max(...) function does not know how to handle Lea instances (actually, it produces a wrong result) ; you have to use the Lea.max(...) method:

>>> endTimeP = startTime + Lea.max(T1,T2)
>>> endTimeP
2020-10-01 00:00:00 :  4/25
2020-10-02 00:00:00 : 21/25
>>> print (endTimeP.mean)
2020-10-01 20:09:36

We see, as expected, that the mean end time here is earliest than in the sequential scenario.

the 3 coins problem

(taken from Visual Explanations of Probabilistic Reasoning)

Note: we use here the convenience functions V, Pf and X introduced in Lea 2.2.

Three coins are flipped. Given that two of the coins have come up heads, what is the probability that the third has come up tails? Many people respond that the probability is 50% but...

>>> flip = V('Head','Tail')
>>> flip3 = flip.timesTuple(3)
>>> nbHeads = flip3.count('Head')
>>> Pf((nbHeads==2).given(nbHeads>=2))
0.75

... it is, in fact, 75%!

In order to convince yourself that this result is right, you can make several queries to get clues on intermediate steps.

>>> flip3
('Head', 'Head', 'Head') : 1/8
('Head', 'Head', 'Tail') : 1/8
('Head', 'Tail', 'Head') : 1/8
('Head', 'Tail', 'Tail') : 1/8
('Tail', 'Head', 'Head') : 1/8
('Tail', 'Head', 'Tail') : 1/8
('Tail', 'Tail', 'Head') : 1/8
('Tail', 'Tail', 'Tail') : 1/8
>>> X(nbHeads,flip3)
(0, ('Tail', 'Tail', 'Tail')) : 1/8
(1, ('Head', 'Tail', 'Tail')) : 1/8
(1, ('Tail', 'Head', 'Tail')) : 1/8
(1, ('Tail', 'Tail', 'Head')) : 1/8
(2, ('Head', 'Head', 'Tail')) : 1/8
(2, ('Head', 'Tail', 'Head')) : 1/8
(2, ('Tail', 'Head', 'Head')) : 1/8
(3, ('Head', 'Head', 'Head')) : 1/8
>>> X(nbHeads,flip3).given(nbHeads>=2)
(2, ('Head', 'Head', 'Tail')) : 1/4
(2, ('Head', 'Tail', 'Head')) : 1/4
(2, ('Tail', 'Head', 'Head')) : 1/4
(3, ('Head', 'Head', 'Head')) : 1/4
>>> nbHeads.given(nbHeads>=2)
2 : 3/4
3 : 1/4
>>> (nbHeads==2).given(nbHeads>=2)
False : 1/4
 True : 3/4

lottery

>>> lottery = Lea.fromSeq(range(1,46))
>>> lottery.random()
13
>>> lottery2 = lottery.given(lottery!=13)
>>> lottery2.random()
24
>>> lottery3 = lottery2.given(lottery2!=24)
>>> lottery3.random()
25
...

Same as (far more efficient)...

>>> lottery.randomDraw(6)
(13, 24, 25, 10, 1, 26)

character frequency counter

In the following example, we count the number of occurrences of each letter in a famous Latin sentence. This shows how a frequency analysis can be done.

>>> freqLatin = Lea.fromVals(*'ALEA JACTA EST')
>>> freqLatin
  : 2/14
A : 4/14
C : 1/14
E : 2/14
J : 1/14
L : 1/14
S : 1/14
T : 2/14

From this basic frequency table, we can generate a random message that tends to obey the same frequency.

>>> ''.join(freqLatin.random(20))
'AJ  LTC TAJA ATLEALJ'

You can notice how the 'A' occurs more than the other letters.

the lifetime problem

Suppose we have a device that has 1% to break every time it is switched on. Let u represents the probability that it is not broken after one switch on:

>>> u = Lea.boolProb(99,100)
>>> u
False :  1/100
 True : 99/100

To determine the probability that the device is OK after two switches on, you may write:

>>> u & u.clone()
False :  199/10000
 True : 9801/10000

To determine the probability that it is OK after ten switches on, you may write:

>>> from operator import and_
>>> u.times(10,and_)
False :  9561792499119550999/100000000000000000000
 True : 90438207500880449001/100000000000000000000

You can verify the result as follows:

>>> u.times(10,and_).pf(True)
0.9043820750088044
>>> 0.99 ** 10
0.9043820750088044

2D random walk

We can use Python complex numbers to model a body (e.g. a drunk rabbit) that moves randomly in a plane : in a complex plane, the North direction is the imaginary 1j value, the East direction is the value 1, etc. Let us assume that the body moves at each time step East, North, West or South, with equal probabilities:

>>> b = Lea.fromVals(1,1j,-1,-1j)
>>> b
  1 : 1/4
-1j : 1/4
 1j : 1/4
 -1 : 1/4

Then, assuming that the body is initially located at the origin of the complex plane, the probability distribution of the body's position after two time steps is given by

>>> b + b.clone()
      0 : 4/16
(-1+1j) : 2/16
      2 : 1/16
 (1+1j) : 2/16
     2j : 1/16
(-1-1j) : 2/16
    -2j : 1/16
 (1-1j) : 2/16
     -2 : 1/16

After four time steps,

>>> b.times(4)
      0 : 36/256
      2 : 16/256
      4 :  1/256
     2j : 16/256
 (2+2j) :  6/256
     4j :  1/256
(-2+2j) :  6/256
(-1-3j) :  4/256
 (1-3j) :  4/256
(-3+1j) :  4/256
(-1+1j) : 24/256
 (1+1j) : 24/256
 (3+1j) :  4/256
(-1+3j) :  4/256
 (1+3j) :  4/256
(-1-1j) : 24/256
     -4 :  1/256
    -4j :  1/256
(-3-1j) :  4/256
(-2-2j) :  6/256
    -2j : 16/256
 (1-1j) : 24/256
 (2-2j) :  6/256
 (3-1j) :  4/256
     -2 : 16/256

Here is how to get the probability that, after four time steps, the body is inside a circle of radius 2, centred on the origin (remind that abs(c) gives the modulus of the complex number c):

>>> abs(b.times(4)) <= 2
False : 15/64
 True : 49/64

To get the full distribution of the body's distance from the origin:

>>> abs(b.times(4))
            0 :  9/64
1.41421356237 : 24/64
            2 : 16/64
2.82842712475 :  6/64
3.16227766017 :  8/64
            4 :  1/64

You can check the consistency of the last results since 49/64 = (9+24+16)/64.

RPG combat

Imagine a Role-Playing Game combat situation opposing you, a warrior, against a giant troll. We shall use here a simplified version of the "d20 System" (Open Game License). To know whether you can hit the troll, you have to calculate your "attack roll" (AR) and the "armour class" (AC) of the troll; if your AR is greater or equal to the AC of the troll, then you hit him and causes him some "damage".

Let us first define the dice we need (using RPG standards: "D6", "2D6" and "D20"):

>>> D6 = Lea.fromVals(*range(1,7))
>>> _2D6 = D6.times(2)
>>> D20 = Lea.fromVals(*range(1,21))

On a given combat round, you hit the troll if your attack roll (D20 + bonus of 8) is greater or equal to the troll's armour class (24):

>>> targetArmorClass = 24
>>> attackRoll = D20 + 8
>>> hit0 = attackRoll >= targetArmorClass
>>> hit0
False : 3/4
 True : 1/4

Well,... but we forget exceptions to the the rule! It says that if the D20 throw is 1, then there is a certain miss; also, if the D20 throw is 20, then there is a certain hit. Let's make the correction with a CPT (Conditional Probability Table) :

>>> hit = Lea.buildCPT( ( D20 ==  1, False ),
                        ( D20 == 20, True  ),
                        ( None     , hit0  ))

If you use Lea 1.x (no CPT), you can have the same result using a boolean expression:

>>> hit = (hit0 & (D20 != 1)) | (D20 == 20)
Let us see the new odds:
>>> hit
False : 3/4
 True : 1/4
Actually, that does not change anything compared to hit0 but it could have on a more unbalanced combat.

Now, if you can hit the troll, your magic sword shall cause him a damage of 2D6 + a bonus of 3:

>>> damageIfHit = _2D6 + 3
>>> damageIfHit
 5 : 1/36
 6 : 2/36
 7 : 3/36
 8 : 4/36
 9 : 5/36
10 : 6/36
11 : 5/36
12 : 4/36
13 : 3/36
14 : 2/36
15 : 1/36
>>> damageIfHit.mean
10.0

Note for Lea 1.x users: use the syntax .mean() !

If you are lucky to hit the troll, then the damage average is 10 points.

If you want to know the overall damage distribution considering that the hit is uncertain, then we could define a CPT as above (Lea.buildCPT) or use simply the Lea.if_ convenience method:

>>> damage = Lea.if_( hit, damageIfHit, 0)
Consider this as the probabilistic version of an if-then-else! An alternative, needed if using Lea 1.x, is to the following expression, which uses the "absorbing" property of zero :
>>> damage = hit.map(lambda h: 1 if h else 0) * damageIfHit

In both cases, the resulting distribution of damage is as follows:

>>> damage 
 0 : 108/144
 5 :   1/144
 6 :   2/144
 7 :   3/144
 8 :   4/144
 9 :   5/144
10 :   6/144
11 :   5/144
12 :   4/144
13 :   3/144
14 :   2/144
15 :   1/144
>>> damage.mean
2.5

We see that the damage average has lowered to 2.5 points, since the chance to hit is 1/4. Let us evaluate total damage distribution after 2 rounds of combat (with no reaction from the troll!):

>>> damage.times(2)
 0 : 11664/20736
 5 :   216/20736
 6 :   432/20736
 7 :   648/20736
 8 :   864/20736
 9 :  1080/20736
10 :  1297/20736
11 :  1084/20736
12 :   874/20736
13 :   668/20736
14 :   467/20736
15 :   272/20736
16 :    80/20736
17 :   104/20736
18 :   125/20736
19 :   140/20736
20 :   146/20736
21 :   140/20736
22 :   125/20736
23 :   104/20736
24 :    80/20736
25 :    56/20736
26 :    35/20736
27 :    20/20736
28 :    10/20736
29 :     4/20736
30 :     1/20736

If current troll's health point is 4, the probability to kill him in the next two rounds is calculated as follows:

>>> healthPoints = 4
>>> healthPoints - damage.times(2) <= 0
False : 9/16
 True : 7/16

To end, let's simulate 20 "real" rounds of combat:

>>> damage.random(20)
(0, 0, 0, 0, 11, 0, 9, 7, 0, 0, 0, 10, 0, 0, 13, 0, 0, 9, 0, 0)

The reader can develop the case to calculate troll's attacks, to evaluate health points evolution after damages, simulate combats until dead, etc.

Bayesian network - the Rain-Sprinkler-Grass example

We refer here to the example of bayesian network on Wikipedia. We show hereafter one possible translation of this model in Lea:

# -- rsg_net.py --------

from lea import *

rain = B(20,100)

sprinkler = Lea.if_(rain, B(1,100),
                          B(40,100))

grassWet = X(sprinkler,rain).switch({ (False,False): False,
                                      (False,True ): B(80,100),
                                      (True ,False): B(90,100),
                                      (True ,True ): B(99,100) })
Then, the model can be queried in different ways.
>>> from rsg_net import *
>>> grassWet & sprinkler & rain
False : 49901/50000
 True :    99/50000
>>> Pf(grassWet & sprinkler & rain)
0.00198
>>> rain.given(grassWet)
False : 1600/2491
 True :  891/2491
>>> Pf(rain.given(grassWet))
0.3576876756322762
>>> Pf(grassWet.given(rain))
0.8019
>>> Pf(grassWet.given(sprinkler&~rain))
0.9
>>> Pf(grassWet.given(~sprinkler&~rain))
0.0
You can check the first resulting probabilities (floating-point values) on the afore-mentioned Wikipedia page. Note that the results are exact (i.e. no MCMC estimation).

To know more about bayesian reasoning in Lea, including the handling of non-boolean variables, have a look in the advanced tutorial: part 2.

A murder case

This one is a simple fictional example of using Lea's bayesian reasoning for forensic science.

Here are the facts, which are certain:

  • Dr. Black has been murdered last night.
  • There are three suspects: Colonel Mustard, Mrs. White and Mrs. Peacock. The killer is one of these and only one.

Let us assume that we have the following respective prior probabilities for each: 60%, 20%, 10%. Let us define then the killer variable.

>>> killer = Lea.fromValFreqs(("Colonel Mustard",60), ("Mrs. White",30), ("Mrs. Peacock",10))
>>> print (killer.asPct())
Colonel Mustard :  60.0 %
   Mrs. Peacock :  10.0 %
     Mrs. White :  30.0 %

Assume now that investigations allow us establish the following three probabilistic rules:

if Mrs. White is the killer,
then she’ll be absent with prob. 90 %
else she’ll be absent with prob. 20 %

if Mrs. Peacock is innocent,
then she knows who’s the killer with prob. 75 %

if Mrs. Peacock is the killer or if she knows who’s the killer
then she’ll be absent with prob. 99 %
else she’ll be absent with prob. 10 %

>>> mrsWhiteIsAbsent = \
      Lea.if_( killer == "Mrs. White",
               Lea.boolProb(90,100)  ,
               Lea.boolProb(20,100)  )

>>> mrsPeacockKnowsWhoIsKiller = \
      Lea.if_( killer != "Mrs. Peacock",
               Lea.boolProb(75,100)    ,
               True                    )

>>> mrsPeacockIsAbsent = \
      Lea.if_( (killer == "Mrs. Peacock") | mrsPeacockKnowsWhoIsKiller,
               Lea.boolProb(99,100),
               Lea.boolProb(10,100))

Note how we state with the True value, that Mrs. Peacock CERTAINLY knows who's the killer assuming she's HERSELF the killer (an assumption of sanity, missing in the given rule)!

Let's imagine we get now new evidences, one by one. Let us examine how the culpability probability distribution evolves as the evidences are added.

EVIDENCE 1: Mrs. Peacock is absent.

>>> evidences = mrsPeacockIsAbsent
>>> print (killer.given(evidences).asPct())
Colonel Mustard :  58.3 %
   Mrs. Peacock :  12.5 %
     Mrs. White :  29.2 %

EVIDENCE 2: Mrs. White is present.

>>> mrsWhiteIsPresent = ~mrsWhiteIsAbsent
>>> evidences &= mrsWhiteIsPresent
>>> print (killer.given(evidences).asPct())
Colonel Mustard :  78.3 %
   Mrs. Peacock :  16.8 %
     Mrs. White :   4.9 %

EVIDENCE 3: The killer is a woman.

>>> killerIsWoman = killer[:4] == "Mrs."
>>> evidences &= killerIsWoman
>>> print (killer.given(evidences).asPct())
Mrs. Peacock :  77.5 %
  Mrs. White :  22.5 %
We see on this example how new evidences, added to known evidences, can dramatically change the prior probabilities. This example may sound simplistic, far from true cases. Be aware however, that even simpler reasoning have been mistakenly used to famous cases in the past, leading to wrong results and tragic consequences (see Dreyfus affair and, more recently, Sally Clarke affair).

rock-paper-scissors game

Let us play the well-known rock-paper-scissors game! First, let us define a class that allows us to represent the three objects and the nontransitive win relationship:

class RPS(object):

    _curVal = 0

    def __init__(self,name):
        object.__init__(self)
        self.name = name
        self.val = RPS._curVal
        RPS._curVal += 1

    def __str__(self):
        return self.name

    __repr__ = __str__

    def __cmp__(self,other):
        delta = other.val - self.val
        if delta == 0:
            return 0
        if delta == 1 or delta == -2:
            return -1
        return +1 

r = RPS('rock')
p = RPS('paper')
s = RPS('scissors')

We can verify the win conditions by comparing the objects together:

>>> (r,p,s)
(rock, paper, scissors)
>>> r < p
True
>>> p < s
True
>>> s < r
True

Assuming that players follow a random, independent choice at each turn of play, let us define 3 behaviours:

  1. balanced : the 3 objects have equal probabilities
  2. morerock : the rock has twice more chances to be chosen than the others
  3. morepaper : the paper has twice more chances to be chosen than the others
>>> balanced  = Lea.fromValFreqs((r,1),(p,1),(s,1))
>>> morerock  = Lea.fromValFreqs((r,2),(p,1),(s,1))
>>> morepaper = Lea.fromValFreqs((r,1),(p,2),(s,1))
>>> balanced
    rock : 1/3
   paper : 1/3
scissors : 1/3
>>> morerock
    rock : 2/4
   paper : 1/4
scissors : 1/4
>>> morepaper
    rock : 1/4
   paper : 2/4
scissors : 1/4

Now, we are able to compare these behaviours to get the probabilities of win. The easier approach is to use the usual comparison operators. For example, the chances of win of morepaper over morerock are given by

>>> morepaper > morerock
False : 5/8
 True : 3/8

The advantage of morepaper is not blatant because this result does not show the probability of tied game. This can be found as follows

>>> morepaper == morerock
False : 11/16
 True :  5/16

A better approach consists in using the cmp(a,b) method, defined in Python 2.x : it returns -1, 0 or +1 depending whether a is less than, equal to or greater than b, respectively.

For Python 3.x, since cmp is not defined, you may define it yourself or use the following trick:

>>> cmp = RPS.__cmp__

The new approach requires to use the map method to be able to apply the cmp function on morepaper's values, providing morerock's values as second argument. We can then have the full picture of the results at once:

>>> morepaper.map(cmp,(morerock,))
-1 : 5/16
 0 : 5/16
 1 : 6/16

We see now clearly the advantage of choosing more rocks against an opponent that choose more papers. In the present case, it wins with probability 6/16 and loose with probability 5/16.

The skeptical reader can use the cprod method (cartesian product) to verify this results by consulting each possible draws of the game and add the atomic probabilities - provided that such reader is not skeptical on the cprod method! -:

>>> morepaper.cprod(morerock)
      (paper, paper) : 2/16
    (scissors, rock) : 2/16
   (scissors, paper) : 1/16
(scissors, scissors) : 1/16
        (rock, rock) : 2/16
       (rock, paper) : 1/16
    (rock, scissors) : 1/16
   (paper, scissors) : 2/16
       (paper, rock) : 4/16

An even more practical way to check atomic results consists in letting Lea provides directly game results by injecting it in the cartesian product (as first position, so the ordering makes a smart grouping):

>>> morepaper.map(cmp,(morerock,)).cprod(morerock,morepaper)
   (-1, rock, scissors) : 2/16
      (-1, paper, rock) : 1/16
  (-1, scissors, paper) : 2/16
      (0, paper, paper) : 2/16
(0, scissors, scissors) : 1/16
        (0, rock, rock) : 2/16
   (1, paper, scissors) : 1/16
    (1, scissors, rock) : 1/16
       (1, rock, paper) : 4/16

Since the map(cmp,...) method provides -1,0,+1 values, it is easy to get the distribution of fails, ties and wins after a given number of games by using the times(n) method. For example, here are the scores after 3 games, where values indicates the difference won games - lost games from the "morerock" player's point of view:

>>> morepaper.map(cmp,(morerock,)).times(3)
-3 :  125/4096
-2 :  375/4096
-1 :  825/4096
 0 : 1025/4096
 1 :  990/4096
 2 :  540/4096
 3 :  216/4096 

Now, are there better strategies against morepaper?

Yes! In particular,

  • choose paper with 8 more chances than the others:
>>> evenmorepaper = Lea.fromValFreqs((r,1),(p,8),(s,1))
>>> evenmorepaper.map(cmp,(morerock,))
-1 : 11/40
 0 : 11/40
 1 : 18/40
  • choose always paper:
>>> alwayspaper = Lea.fromVals(p)
>>> alwayspaper.map(cmp,(morerock,))
-1 : 1/4
 0 : 1/4
 1 : 2/4

Now, what about the balanced behaviour?

  • against morerock:
>>> balanced.map(cmp,(morerock,))
-1 : 1/3
 0 : 1/3
 1 : 1/3
  • against a fixed choice (e.g. paper):
>>> balanced.map(cmp,(p,))
-1 : 1/3
 0 : 1/3
 1 : 1/3
  • against itself:
>>> balanced.map(cmp,(balanced.clone(),))
-1 : 1/3
 0 : 1/3
 1 : 1/3

We demonstrate in these last few examples the known (and proven) fact that the balanced strategy gives a tied game on the long term, whatever opponent's behaviour. It is actually the best strategy when nothing is known about the opponent's behaviour.

von Neumann extractor

Imagine you have a physical device able to produce a "true" random stream of bits. Great! Unfortunately, this device has a bias : the probability of having 1 or 0, although constant in time, is not 1/2; furthermore, this probability is unknown.

Question: Can you find a mechanism to remove this bias and produce a stream of true random bits, with balanced probabilities (1/2)?

John von Neumann has invented a simple yet awesome trick to do the job. See von Neumann extractor.

Let us demonstrate the algorithm of von Neumann. Since we don't have the physical random device at our disposal, we shall first simulate the biased stream of bits. We use Lea to build an infinite random bits iterator, assuming a probability of 80% to have a 0:

biasedRandomIter = Lea.fromValFreqs((0,80),(1,20)).randomIter()

Note that the previous statement is equivalent to this:

def genBiasedRandom():
    source = Lea.fromValFreqs((0,80),(1,20))
    while True:
        yield source.random()
biasedRandomIter = genBiasedRandom()

We can make a trial by taking a sample of 500 bits on this biased random source and control the frequency:

>>> from itertools import islice
>>> biasedSample = tuple(islice(biasedRandomIter,500))
>>> biasedSample 
(0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0)
>>> print (Lea.fromVals(*biasedSample).asPct())
0 :  79.4 %
1 :  20.6 %

We see that the frequency of 0 is close to 80 %, as expected.

Now, we build the von Neumann extractor, as an infinite random bits iterator that takes the biased source as input:

from itertools import islice

def genVonNeumannExtractor(randomIter):
    while True:
        (b0,b1) = islice(randomIter,2)
        if b0 != b1:
            yield b0

vonNeumannExtractorIter = genVonNeumannExtractor(biasedRandomIter)

Let us make a try, taking a sample of 500 bits from this device (note that the process shall likely consume a lot more bits on the biased source!) :

>>> unbiasedSample = tuple(islice(vonNeumannExtractorIter,500))
>>> unbiasedSample
(0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1)
>>> print (Lea.fromVals(*unbiasedSample).asPct())
0 :  47.2 %
1 :  52.8 %

We can see that the frequencies are now close to 50%, as expected.

Thank you, Doctor Miraculis !

central limit law

a contribution of Nicky van Foreest

Requires: SciPy, numpy and matplotlib (see SciPy Stack)

From theory we know that the distribution of the sum of identical and independent random variables converges to the normal distribution. We can test this with Lea.

First we need some basic inputs.

>>> import numpy as np
>>> from scipy.stats import norm
>>> import matplotlib.pylab as plt
>>> from lea import Lea

Now make a uniform die with four sides and take the 5 fold sum.

>>> die = Lea.fromVals(0, 1, 2, 3)
>>> dieN = die.times(5)

We need the support and the cumulative distribution of dieN.

>>> T = np.array(zip(*dieN.integral()))
>>> # the support is stored in T[0,:]
>>> supp = T[0,:]
>>> # and the distribution in T[1,:]
>>> F = T[1,:]

As yet, F is a distribution function, but not a probability distribution function, that is bounded between 0 and 1. For this we need to normalize F.

>>> F = F/float(F[-1])

The last step is to compare F to the cumulative distribution of the normal distribution. We need a continuity correction, for otherwise the normal distribution seems to be shifted to the right.

>>> mu = dieN.mean
>>> std = dieN.stdev
>>> FN = norm(loc = mu , scale = std).cdf(supp + 0.5)

Note for Lea 1.x users: use the syntax .mean() and .stdev() !

Let's compare.

>>> np.max(np.abs(F-FN))
0.0049414764166033631

This is truly astonishing. A 5-fold convolution is already this good.

bullshit generator

Some years ago, out of desperation, I wrote a bullshit generator in Python.

Here is a typical sample of what it produces:


As a matter of fact, the engine is provided by the interface where the actor on top of a peer-to-peer interoperability under the specification derivation rule is performed by the available SSO. The timing thread is generated by a full XML when the Web 2.0 that executes the issuer is generated by the artifact. The world-leading Web browser orchestrates a SOAP message. An aware issuer is envisioned in the generic design operator, which competes with the factory, in the aggregator that is unaffected. The authorized entity provides an interface to the fat client. The engine updates the integration. The algorithm of the model that is efficient sent to a technology thin client, which triggers a cluster, is provided by the version. The approach is repeatedly required by the action timing security under a space when the ancillary owner generates the XML-based UML model.


The program works with a a small dictionary of words and a simplified English grammar (very close to the one wired in my brain, since I am not a native English speaker!). The words are classified in categories -nouns, verbs, etc-; to simplify, these words are chosen with equal probability, which was easily done with the built-in random.choice of Python (see TerminalNode class). For the grammar rules however, in order to avoid too long sentences, I put a probability weight on each derivation rule, so to make the simplest constructions the likeliest. Part of the complexity of the program was then in the algorithm to make the random choices according to the probability weights (see Node class).

In what follows, I show how to use Lea to greatly simplify the Node and TerminalNode classes. The statements to define English words and grammars are unchanged from the original version, excepting the swapping of probability weights and grammar rules to accommodate the Lea.fromValFreqs method. Note that Python's random module is no longer imported (although used by Lea internally).

'''
======================================================================
 Bullshit Generator 
    by Pierre Denis, 2009, 2014
======================================================================
'''

# --------------------------------------------------
# grammar engine
# --------------------------------------------------

from lea import Lea

class Node(object):

    def setTermsChoices(self,*termsChoices):
        self.termsChoices = Lea.fromValFreqs(*termsChoices)

    def getWords(self):
        terms = self.termsChoices.random()
        for term in terms:
            if isinstance(term,str):
                yield term
            else:
                for word in term.getWords():
                    yield word

    def getString(self):
        res = " ".join(self.getWords())
        res = ", ".join(w.strip() for w in res.split(",") if w.strip())
        if res.endswith(", "):
            res = res[:-2]
        return res[0].upper() + res[1:] + "."


class TerminalNode(object):

    def __init__(self,*words):
        self.words = Lea.fromVals(*words)

    def getWords(self):
        yield self.words.random()

# --------------------------------------------------
# grammar
# --------------------------------------------------

verb = TerminalNode(
    "accesses", "activates", "administrates", "aggregates", "builds",
    "calculates", "checks", "competes with", "completes", "complies with",
    "controls", "covers", "delivers", "dispatches", "eases", "encapsulates",
    "encompasses", "executes", "extracts", "features",
    "generates", "gets", "governs", "guides", "has", "increases",
    "inherits from", "is", "keeps track of", "leverages", "makes",
    "manages",
    "manages", "maximizes", "mitigates", "monitors", "must have", "needs",
    "offers", "opens", "operates on", "optimizes", "orchestrates",
    "overwrites", "performs", "populates", "precludes", "provides",
    "provides",
    "provides an interface to", "reads", "receives", "reduces",
    "reduces the need of", "registers", "regulates", "relies on",
    "requires",
    "resides on", "resides within", "retrieves", "retrieves the data in",
    "runs on",
    "schedules", "integrates with", "sends", "shall be",
    "shall have", "should be", "should have", "starts", "stores",
    "streamlines", "subscribes to", "subscribes to", "supersedes", "takes",
    "targets", "triggers", "updates", "validates", "writes")

passiveVerb = TerminalNode(
    "accessed by", "achieved by", "aggregated by", "applicable for",
    "asserted by", "authorized by",
    "based upon", "built from", "built upon", "collected by",
    "controlled by",
    "dedicated to", "deployed on", "derived from", "dispatched by",
    "driven by", "eased by", "enabled by", "envisioned in",
    "extracted from", "generated by", "in the scope of", "installed on",
    "integrated in",
    "located in", "managed by", "maximized by", "monitored by", "opened by",
    "optimized by", "orchestrated by", "packaged in", "performed by",
    "populated by", "processed by", "provided by", "provided by",
    "received by", "refreshed by", "registered in", "related to",
    "required by",
    "responsible for", "scheduled by", "sent to", "serialized by",
    "serialized in", "started in", "stored by", "stored in", "stored on",
    "the interface of", "updated by", "validated by")

aSimpleName = TerminalNode(
    "COTS", "GRID processing",
    "Java program", "LDAP registry", "Portal", "RSS feed", "SAML token",
    "SOAP message", "SSO", "TCP/IP", "UML model", "URL",
    "W3C", "Web", "Web 2.0", "Web browser", "Web page",
    "Web service", "back-end", "backbone", "bandwidth", "bean",
    "bridge", "browser", "bus", "business", "business model", "call",
    "catalogue", "class", "client", "cluster", "collection",
    "communication", "component", "compression",
    "concept", "conceptualization", "connexion", "console", "content",
    "context", "cookie", "customization", "data", "database",
    "datastore", "deployment",
    "derivation rule", "design", "development", "device", "directory",
    "discovery", "dispatcher", "document", "domain", "factory",
    "fat client", "feature", "file", "form", "frame", "framework",
    "function", "gateway", "genericity", "geomanagement", "goal",
    "governance", "granularity", "guideline", "header", "key", "layer",
    "leader", "library", "link", "list", "log file", "logic",
    "look-and-feel",
    "manager", "market", "mechanism", "message", "meta-model",
    "metadata", "model", "modeling", "module", "network", "performance",
    "persistence", "personalization", "plug-in", "policy", "port",
    "portal", "practice",
    "presentation layer", "privacy", "private key", "procedure",
    "process", "processor", "processing", "product", "protocol",
    "recommendation",
    "registration", "registry", "relationship", "resource",
    "responsibility", "role",
    "rule", "scenario", "scenario", "scheduler", "schema", "security",
    "server", "service", "service provider", "servlet", "session",
    "skeleton", "software", "solution", "source", "space",
    "specification", "standard", "state", "statement", "streaming",
    "style sheet", "subscriber", "subsystem", "system", "system",
    "table", "target", "task", "taxonomy", "technique", "technology",
    "template", "thin client", "thread", "throughput", "timing", "tool",
    "toolkit", "topic", "unit", "usage", "use case", "user",
    "user experience", "validation", "value", "version", "vision", "work",
    "workflow")

anSimpleName = TerminalNode(
    "API", "IP address", "Internet", "UDDI", "XML", "XML file",
    "abstraction", "access", "acknowledgment", "action", "actor",
    "administrator", "aggregator", "algorithm", "application", "approach",
    "architecture", "artifact", "aspect", "authentication", "availability",
    "encapsulation", "end-point", "engine", "engine", "entity",
    "entity", "environment", "event", "identifier", "information",
    "integration", "interface", "interoperability", "issuer", "object",
    "ontology", "operation", "operator", "operator", "opportunity",
    "orchestration", "owner")

aAdjective = TerminalNode(
    "BPEL",  "DOM", "DTD", "GRID", "HTML", "J2EE",
    "Java", "Java-based", "Java-based", "UML", "SAX", "WFS", "WSDL",
    "basic", "broad", "bug-free",
    "business-driven", "client-side", "coarse", "coherent", "compatible",
    "complete", "compliant", "comprehensive", "conceptual", "consistent",
    "control", "controller", "cost-effective",
    "custom", "data-driven", "dedicated", "distributed", 
    "dynamic", "encrypted", "event-driven", "fine-grained", "first-class",
    "free", "full",
    "generic", "geo-referenced", "global", "global", "graphical",
    "high-resolution", "high-level", "individual", "invulnerable",
    "just-in-time", "key",
    "layered", "leading", "lightweight", "logical", "main", "major",
    "message-based",
    "most important", "multi-tiers", "narrow", "native", "next",
    "next-generation",
    "normal", "password-protected", "operational", "peer-to-peer",
    "performant", "physical",
    "point-to-point", "polymorphic", "portable", "primary", "prime",
    "private", "proven", "public", "raw", "real-time", "registered",
    "reliable", "remote",
    "respective", "right", "robust", "rule-based", "scalable", "seamless",
    "secondary", "semantic", 
    "server-side", "service-based", "service-oriented", "simple", "sole",
    "specific", "state-of-the-art", "stateless", "storage", "sufficient",
    "technical", "thread-safe", "uniform", "unique", "used", "useful",
    "user-friendly", "virtual", "visual", "web-based", "web-centric",
    "well-documented", "wireless", "world-leading", "zero-default")

anAdjective = TerminalNode(
    "AJAX", "OO", "XML-based", "abstract", "ancillary", "asynchronous",
    "authenticated", "authorized", "auto-regulated", "available", "aware",
    "efficient",
    "international", "interoperable", "off-line", "official", "online",
    "open", "operational",
    "other", "own", "unaffected", "up-to-date")

adverb = TerminalNode(
    "basically", "comprehensively", "conceptually", "consistently",
    "definitely", "dramatically",
    "dynamically", "expectedly", "fully", "generally", "generically",
    "globally", "greatly", "individually", "locally", "logically",
    "mainly", "mostly", "natively",
    "officially", "physically", "practically", "primarily",
    "repeatedly", "roughly", "sequentially", "simply", "specifically", 
    "surely", "technically", "undoubtly", "usefully", "virtually")

sentenceHead = TerminalNode(
    "actually", "as a matter of fact", "as said before", "as stated before",
    "basically", "before all", "besides this", "beyond that point",
    "clearly",
    "conversely", "despite these facts", "for this reason",
    "generally speaking",
    "if needed", "in essence", "in other words", "in our opinion",
    "in the long term", "in the short term", "in this case", "incidentally",
    "moreover", "nevertheless", "now", "otherwise", "periodically",
    "roughly speaking", "that being said", "then", "therefore",
    "to summarize", "up to here", "up to now", "when this happens")

(name, aName, anName, nameTail, adjective, nameGroup,
 simpleNameGroup, verbalGroup, simpleVerbalGroup, sentence,
 sentenceTail) = [Node() for i in range(11)]

aName.setTermsChoices(
    (( aSimpleName,      ), 50 ),
    (( aSimpleName, name ),  5 ),
    (( aSimpleName, name ),  8 ),
    (( aName, nameTail   ),  5 ))

anName.setTermsChoices(
    (( anSimpleName,      ), 50 ),
    (( anSimpleName, name ),  8 ),
    (( anName, nameTail   ),  5 ))

nameTail.setTermsChoices(
    (( "of", nameGroup        ), 2 ),
    (( "from", nameGroup      ), 2 ),
    (( "under", nameGroup     ), 1 ),
    (( "on top of", nameGroup ), 1 ))

name.setTermsChoices(
    (( aName,  ), 1 ),
    (( anName, ), 1 ))

adjective.setTermsChoices(
    (( aAdjective,  ), 1 ),
    (( anAdjective, ), 1 ))

nameGroup.setTermsChoices(
    (( simpleNameGroup,                                   ), 10 ),
    (( simpleNameGroup, passiveVerb, nameGroup            ),  1 ),
    (( simpleNameGroup, "that", simpleVerbalGroup         ),  1 ),
    (( simpleNameGroup, ", which", simpleVerbalGroup, "," ),  1 ))

simpleNameGroup.setTermsChoices(
    (( "the", name             ), 40 ),
    (( "the", adjective, name  ), 20 ),
    (( "a", aName              ), 10 ),
    (( "an", anName            ), 10 ),
    (( "a", aAdjective, name   ),  5 ),                
    (( "an", anAdjective, name ),  5 ))  

verbalGroup.setTermsChoices(
    (( verb, nameGroup                      ), 10 ),
    (( adverb, verb, nameGroup              ),  1 ),
    (( "is", passiveVerb, nameGroup         ), 10 ),
    (( "is", adverb, passiveVerb, nameGroup ),  1 ),
    (( "is", adjective                      ),  1 ),
    (( "is", adverb, adjective              ),  1 ))

simpleVerbalGroup.setTermsChoices(
    (( verb, simpleNameGroup ), 2 ),
    (( "is", adjective       ), 1 ))

sentence.setTermsChoices(
    (( nameGroup, verbalGroup                     ), 20 ),
    (( sentenceHead, "," , nameGroup, verbalGroup ),  4 ),
    (( sentence, sentenceTail                     ),  4 ))

sentenceTail.setTermsChoices(
    (( "in", nameGroup                 ), 12 ),
    (( "within", nameGroup             ),  5 ),
    (( "where", nameGroup, verbalGroup ),  5 ),
    (( "when", nameGroup, verbalGroup  ),  5 ),
    (( "because it", verbalGroup       ),  2 ),
    (( "; that's why it", verbalGroup  ),  1 ))

# --------------------------------------------------
# main program
# --------------------------------------------------
try:
    import win32com.client
    voice = win32com.client.Dispatch("sapi.SPVoice")
except:
    voice = None

print ("press <enter> to resume, 'q'+<enter> to quit\n")

while True:
    print ('')
    for i in range(8):
        generatedSentence = sentence.getString()
        # PY3: print (generatedSentence,end='')
        print generatedSentence,
        if voice:
            voice.speak(generatedSentence)
    # PY3: cmd = input()
    cmd = raw_input()        
    if cmd.strip().lower() == "q":
        break

Note: for Python 3 users, just replace the two statements marked with # PY3:

You can run this program and verify that it produces the same kind of gobbledygook than the original version. Let him the last word:


The official operator is performed by the most important actor. The interoperability is off-line. The environment is authorized by the rule-based identifier that needs the availability thread. The W3C is located in a high-resolution style sheet because it is surely point-to-point. A session sends the use case owner that guides the asynchronous content. The approach aggregator, which shall have the algorithm, shall be an identifier. The OO link administrates a controller information that executes an identifier when the thread-safe opportunity engine user experience that populates a bridge roughly extracts the most important API. The environment is derived from the dispatcher from a unit acknowledgment of an interface.


Updated