Repr of lea instances with large fractions is very slow to compute

Issue #70 resolved
Paul Moore created an issue

Creating a lea instance with probabilities that are fractions with large numerator/denominator values is very quick:

from fractions import Fraction
import lea

x = [
 (1, Fraction(0, 1)),
 (2, Fraction(35, 648)),
 (3, Fraction(9065, 15552)),
 (4, Fraction(10085, 31104)),
 (5, Fraction(1130389, 30233088)),
 (6, Fraction(16523, 11337408)),
 (7, Fraction(49909, 2176782336)),
 (8, Fraction(139213, 940369969152)),
 (9, Fraction(34865, 101559956668416)),
 (10, Fraction(19, 101559956668416)),
 (11, Fraction(0, 1)),
 (-1, Fraction(0, 1))]

# This takes about 70 microseconds
l = lea.pmf(x)

however, displaying such an instance is extremely slow (20 seconds or more).

Calculating l.as_string("%") is very fast, though, so it isn’t the fraction calculations themselves that are slow.

Doing some debugging, it appears that the slowdown occurs in get_alea() via get_leaves_set() and ultimately with frozenset((l,)) (or equivalently, {l: 0}). But I cannot determine what method(s) of l get called when creating a frozenset or dict, which might cause such a massive slowdown. It doesn’t seem to be hash()

After a bit more digging, the issue seems to be in Lea.__getattribute__, called with an attribute name of _ps. But that’s as far as I can get, sorry. I suspect there’s an infinite loop somewhere, __getattribute__ is tricky to get right…

Comments (10)

  1. Paul Moore reporter

    I’ve just hit this again, in a different context, and I ran the code with a profiler. The __getattribute__ implementation is definitely a significant performance bottleneck - it’s not an infinite loop as I suggested above, it’s just very slow (it’s called far more often than any other function in the code, and while each call is fast enough, the times add up).

    I understand that the point here is to make Lea instances look as similar to “normal” Python values as possible, but the convenience has a significant cost in terms of performance. In my case, getting the absorbing MC info for a 120-state Markov chain takes around 25 seconds - slow enough that I’d rather do the calculations manually, using numpy directly…

  2. Pierre Denis repo owner
    • changed status to open

    I answer firstly to the primary issue.

    Well spotted! This is indeed very surprising, since no probabilistic algorithm is expected for such simple treatment.

    Now, after tests and analysis, I think that I’ve found the culprit. It lies in the process of calculating the common denominators to the probability fractions given. This is done in Lea through the homemade method ExtFraction.calc_lcm. For some input arguments, it appears that this is very inefficient, as you’ve experienced it.

    The fix is quite easy in Python 3.9+; there is a native math.lcm function that performs the job, seemingly in a very efficient way whatever the arguments. For earlier Python versions, there are fallback ways to do it also, relying on the native gcd function (using the mathematical equality lcm(a,b) = ab / gcd(a,b). More details here.

    The tests I’ve done show that replacing ExtFraction.calc_lcm by the native lcm function (or native gcd for Python <3.9) eliminates the slow down.

  3. Pierre Denis repo owner

    I answer here your last comment.

    Following my previous answer, I think that your Markov chain is probably another issue. Reading your comment, I concede that this Lea.__getattribute__ thing was eventually a disputable design choice. One of the rationale was to easily build prob. distributions based on attributes of objects put as support of a distribution. See for example character.race here.

    Now I’ve simply not anticipated such side effects. The best would be indeed to remove such syntactic sugar for something more conformist (e.g. character.get_attr("race"). This redesign is a bit heavier, and requires also change in the tutorial pages.

    Since this is not backward-compatible, I think that it’s sensible to put this in a new major version (Lea 4). This could be an opportunity to drop the support of Python 2, which will simplify several things alongside.

    If possible, could you please send your Markov chain example, so I can check the issue later?

  4. Paul Moore reporter

    Nice fix! I had originally thought it was related to large fractions (I was quite worried about the size of the numbers involved) but I couldn’t pin down where the slowdown was, and profiling showed the __getattribute__ as being a big contributor, so I was misled.

    I’ll send the Markov chain, probably tomorrow if that’s OK.

    Regarding the __getattribute__ design, I agree it’s not ideal - it’s convenient in many cases, but overall, it can end up being confusing. I assumed it was too deeply ingrained to be fixed, so I simply put up with it. But if you are doing a redesign and a Lea 4, there are probably a few other suggestions I’d make, but I’ll put them in a separate issue for you to consider.

  5. Pierre Denis repo owner

    Great! A separate ticket is the best indeed. Thanks for reporting these issues. I see already an overall improvement when dealing with fractions.

    I will do a minor version (3.4.4) soon for the present issue .

  6. Paul Moore reporter

    Here's the primes example. Run it with an argument of 120 and it takes 10 seconds or so on my PC. Run it with 240 and it takes nearly a minute and a half.

  7. Pierre Denis repo owner

    Well received, thanks! I made a quick test, removing the __getattribute__ method. On the specific case of 240, processing time can be decreased by a factor 3, at least. So, this is definitively the way forward!

    I’ll create a dedicated ticket for this enhancement. -> #72

    Also, I’ll answer probably tomorrow to #71.

  8. Log in to comment