Memoization decorator

Issue #22 resolved
John Sivak created an issue

There is a subtle bug in function memoize(f) in savReader.py:443.

After MAXCACHE is reached, the "cache.popitem()" is used to remove an item to prevent the cache from exceeding MAXCACHE items. The problem is that "popitem()" will pop any arbitrary item from the dictionary, and when processing a large file, it could pop the most recently "pushed" item. When this happens the code crashes with a KeyError.

So here is a crude workaround that I have in place:

    def memoize(f):
        """Memoization decorator
        http://code.activestate.com/recipes/577219-minimalistic-memoization/"""
        cache = {}
        MAXCACHE = 10**4

        def memf(*x):
            if x not in cache:
                cache[x] = f(*x)
            if len(cache) > MAXCACHE:
                popped_key, popped_value = cache.popitem()
                if popped_key == x:
                    cache.popitem()
                    cache[popped_key] = popped_value
            return cache[x]
        return memf  

Comments (14)

  1. Albert-Jan Roskam repo owner

    Hi John,

    Thanks for the time you are putting into this. I knew about functools.lru_cache, but I did not know about the Python 2 backport. The reason why I used memozation was (of course) to speed up working with dates, so I am curious about the overhead of lru_cache.

    I was thinking about simply capping the cache size. This is what I cobbled together. Given the sample data (and this really matters, because depending on the set of dates it could really slow down things too!), memoization led to a processing speed that is about one third of the "unmemoized" speed. I'll try later today with lru_cache (gotta go now, sorry):

    from __future__ import division
    import sys
    import datetime
    import random
    import profile
    
    def memoize(f):
        """Memoization decorator
        http://code.activestate.com/recipes/577219-minimalistic-memoization/"""
        cache = {}
        MAXCACHE = 10 ** 7
    
        def memf(*x):
            if x in cache:
                return cache[x]
            elif len(cache) < MAXCACHE:
                result = f(*x)
                cache[x] = result
                return result
            return f(*x)
        return memf 
    
    def spss_to_iso(spssdate):
        gregorianEpoch = datetime.datetime(1582, 10, 14, 0, 0, 0)
        theDate = (gregorianEpoch + datetime.timedelta(seconds=spssdate))
        return datetime.datetime.strftime(theDate, "%Y-%m-%d %H:%M:%S")
    
    # create some test data to work with
    SOME_DATE = 12121574400  # 1966-11-26 00:00:00
    ONE_DAY = 24 * 60 * 60
    SAMPLE_SIZE = 1000000
    epoch = xrange(0, 80 * 365 * ONE_DAY, ONE_DAY)  # suppose these are birthdates in a 80-year range
    random.seed(54321)
    random_dates = [SOME_DATE + random.choice(epoch) for i in xrange(SAMPLE_SIZE)]
    
    # how big is a filled cache anyway?
    print "Cache size".center(79, "-")
    MAXCACHE = 10 ** 7
    cache = {i: spss_to_iso(i) for i in xrange(SOME_DATE, SOME_DATE + MAXCACHE)}
    cache_size = sys.getsizeof(cache)
    print "%d items, %d bytes, %5.3f megabytes" % (len(cache), cache_size, (cache_size / 2 ** 20))
    
    ######
    def slow(spssdate):
        return spss_to_iso(spssdate)
    
    def without_memoize():
        for spssdate in random_dates:
            slow(spssdate)
    
    print "without memoize".center(79, "-")
    profile.run("without_memoize()")
    
    ######
    @memoize
    def fast(spssdate):
        return spss_to_iso(spssdate)
    
    def with_memoize():
        for spssdate in random_dates:
            fast(spssdate)
    
    print "with memoize".center(79, "-")
    profile.run("with_memoize()")
    
    -----------------------------------Cache size----------------------------------
    10000000 items, 402653464 bytes, 384.000 megabytes
    --------------------------------without memoize--------------------------------
             2000004 function calls in 15.833 seconds
    
       Ordered by: standard name
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.000    0.000    0.000    0.000 :0(setprofile)
            1    0.000    0.000   15.833   15.833 <string>:1(<module>)
      1000000    8.809    0.000    8.809    0.000 memooize_stuff.py:23(spss_to_iso)
      1000000    4.660    0.000   13.469    0.000 memooize_stuff.py:45(slow)
            1    2.364    2.364   15.833   15.833 memooize_stuff.py:48(without_memoize)
            0    0.000             0.000          profile:0(profiler)
            1    0.000    0.000   15.833   15.833 profile:0(without_memoize())
    
    
    ----------------------------------with memoize---------------------------------
             1087604 function calls in 5.668 seconds
    
       Ordered by: standard name
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        29200    0.056    0.000    0.056    0.000 :0(len)
            1    0.000    0.000    0.000    0.000 :0(setprofile)
            1    0.000    0.000    5.668    5.668 <string>:1(<module>)
      1000000    2.708    0.000    3.164    0.000 memooize_stuff.py:13(memf)
        29200    0.280    0.000    0.280    0.000 memooize_stuff.py:23(spss_to_iso)
        29200    0.120    0.000    0.400    0.000 memooize_stuff.py:57(fast)
            1    2.504    2.504    5.668    5.668 memooize_stuff.py:61(with_memoize)
            0    0.000             0.000          profile:0(profiler)
            1    0.000    0.000    5.668    5.668 profile:0(with_memoize())
    
  2. Albert-Jan Roskam repo owner

    Ok, I tested everything with Python 3, including LRU cache. I also tried running LRU cache with Python 2, but it throws an error.

    albertjan@debian:~$ uname -a
    Linux debian 3.2.0-4-amd64 #1 SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux
    albertjan@debian:~$ python3 -c "import sys; print(sys.version_info)"
    sys.version_info(major=3, minor=3, micro=4, releaselevel='final', serial=0)
    albertjan@debian:~$ python3 '/home/albertjan/Downloads/memooize_stuff.py' 
    --------------------------------without memoize--------------------------------
             2000005 function calls in 16.031 seconds
    
       Ordered by: standard name
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.000    0.000   16.024   16.024 :0(exec)
            1    0.008    0.008    0.008    0.008 :0(setprofile)
            1    0.000    0.000   16.024   16.024 <string>:1(<module>)
      1000000   12.323    0.000   12.323    0.000 memooize_stuff.py:33(spss_to_iso)
      1000000    2.381    0.000   14.704    0.000 memooize_stuff.py:55(slow)
            1    1.319    1.319   16.024   16.024 memooize_stuff.py:58(without_memoize)
            0    0.000             0.000          profile:0(profiler)
            1    0.000    0.000   16.031   16.031 profile:0(without_memoize())
    
    
    ----------------------------------with memoize---------------------------------
             1087605 function calls in 3.548 seconds
    
       Ordered by: standard name
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.000    0.000    3.548    3.548 :0(exec)
        29200    0.034    0.000    0.034    0.000 :0(len)
            1    0.000    0.000    0.000    0.000 :0(setprofile)
            1    0.000    0.000    3.548    3.548 <string>:1(<module>)
      1000000    1.622    0.000    2.112    0.000 memooize_stuff.py:23(memf)
        29200    0.388    0.000    0.388    0.000 memooize_stuff.py:33(spss_to_iso)
        29200    0.069    0.000    0.457    0.000 memooize_stuff.py:67(fast)
            1    1.436    1.436    3.548    3.548 memooize_stuff.py:71(with_memoize)
            0    0.000             0.000          profile:0(profiler)
            1    0.000    0.000    3.548    3.548 profile:0(with_memoize())
    
    
    ----------------------------with lru_cache decorator---------------------------
             11000005 function calls in 37.566 seconds
    
       Ordered by: standard name
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.000    0.000   37.566   37.566 :0(exec)
      7000000   10.588    0.000   10.588    0.000 :0(getattr)
      1000000    1.276    0.000    1.276    0.000 :0(setattr)
            1    0.000    0.000    0.000    0.000 :0(setprofile)
      1000000    1.085    0.000    1.085    0.000 :0(update)
            1    0.000    0.000   37.565   37.565 <string>:1(<module>)
      1000000    6.128    0.000   34.314    0.000 functools.py:222(decorating_function)
      1000000   15.237    0.000   28.186    0.000 functools.py:35(update_wrapper)
            1    3.251    3.251   37.565   37.565 memooize_stuff.py:83(with_lru_memoize)
            0    0.000             0.000          profile:0(profiler)
            1    0.000    0.000   37.566   37.566 profile:0(with_lru_memoize())
    
    albertjan@debian:~$ python2 -c "import sys; print(sys.version_info)"
    sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
    albertjan@debian:~$ python2 '/home/albertjan/Downloads/memooize_stuff.py' 
    --------------------------------without memoize--------------------------------
             2000004 function calls in 12.401 seconds
    ...
    ----------------------------------with memoize---------------------------------
             1087604 function calls in 4.496 seconds
    ...
    ----------------------------with lru_cache decorator---------------------------
    Traceback (most recent call last):
      File "/home/albertjan/Downloads/memooize_stuff.py", line 88, in <module>
        profile.run("with_lru_memoize()")
      File "/usr/lib/python2.7/profile.py", line 61, in run
        prof = prof.run(statement)
      File "/usr/lib/python2.7/profile.py", line 438, in run
        return self.runctx(cmd, dict, dict)
      File "/usr/lib/python2.7/profile.py", line 444, in runctx
        exec cmd in globals, locals
      File "<string>", line 1, in <module>
      File "/home/albertjan/Downloads/memooize_stuff.py", line 85, in with_lru_memoize
        uses_lru_decorator(spssdate)
      File "/usr/local/lib/python2.7/dist-packages/backports/functools_lru_cache.py", line 162, in decorating_function
        return update_wrapper(wrapper, user_function)
      File "/usr/lib/python2.7/functools.py", line 33, in update_wrapper
        setattr(wrapper, attr, getattr(wrapped, attr))
    AttributeError: 'int' object has no attribute '__module__'
    
  3. John Sivak reporter

    lru_cache does seem to add quite a bit more time.. My "hack" fix to the code doesn't look so bad now.. :)

    Also, forgot to mention we're using Python 2.7.5.

  4. Albert-Jan Roskam repo owner

    Ah, of course, shame on me, I should have included that in the comparison too. The only thing I changed to your code above is that I set the cache to 10 ** 7. Maybe it would be useful to have an environment variable that allows the user to set the cache size to zero (no memoizing, useful for files without duplicate datetimes, e.g logfiles), None (boundless cache --> no need to check the cache --> faster) or an integer > 0. Anyway here are the results for your implementation:

    albertjan@debian:~$ python2 '/home/albertjan/Downloads/memooize_stuff.py' 
    -------------------------------with john memoize-------------------------------
             2058404 function calls in 8.061 seconds
    ...
    albertjan@debian:~$ python3 '/home/albertjan/Downloads/memooize_stuff.py' 
    -------------------------------with john memoize-------------------------------
             2058405 function calls in 6.305 seconds
    ...
    
  5. Albert-Jan Roskam repo owner

    something like this (untested)

    def get_memoizer(self):
        maxcache = os.environ.get("SAVRW_MAXCACHE", "10000000")
        m = re.match("(None|[0-9]+)", maxcache, re.I)
        if m:
           maxcache = eval(m.group(1).title())
        else:
            raise ValueError("SAVRW_MAXCACHE has invalid value")
        if maxcache is None:
            # boundless caching
            def memoize(f):
                cache = {}
                def memf(*x):
                    if x in cache:
                        return cache[x]
                    result = f(*x)
                    cache[x] = result
                    return result
                return memf
        elif maxcache == 0:
            # no caching/memoizing at all
            def memoize(f):
                def memf(*x):
                    return f(*x)
                return memf 
        elif maxcache > 0:
            # cached memoizing
            def memoize(f):
                cache = {}
                MAXCACHE = 10 ** 7
                def memf(*x):
                    if x in cache:
                        return cache[x]
                    elif len(cache) < MAXCACHE:
                        result = f(*x)
                        cache[x] = result
                        return result
                    return f(*x)
                return memf 
        memoize.__doc__ = """Memoization decorator
        http://code.activestate.com/recipes/577219-minimalistic-memoization/"""
        return memoize
    
  6. John Sivak reporter

    I'm all for speed in the code (we're using savReaderWriter to export files from a data warehouse), so whatever choice you want to pursue for the memoization works for me. It seems that using the dict.popitem() is still the "fastest" and "simplest" as long as we correct for the somewhat rare occurrance of the .popitem() removing the value we just added to the cache. I've not needed to adjust the cache size, so I'm not sure of the value of providing a different cache size via an environment variable; unless you want to allow users to tweak the values.

    Also, based on your tests, it appears that any memoization is better than none, though the lru_cache approach may be too fancy/expensive..

    After re-reading your previous post: I personally wouldn't use a "boundless" cache, since I'm dealing with large amounts of data, so I'd lean towards a bounded or "self-cleaning" cache.

  7. John Sivak reporter

    Were you waiting for any further feedback from me? Just hit this bug again in a different virtualenv and was reminded to check the bug report.

    Thanks

  8. Albert-Jan Roskam repo owner

    hi John,

    Thanks for reminding me. I played with a bunch of implementations but in the end I decided to choose the fastest version (well, there was one faster implementation from Python Cookbook, but the cache could grow infinitely). I hope this is fixed now. The unittests run ok, but this does not say much in this case. I should generate a humonguous dataset to see how it behaves, but I think it works ok now. Btw, I also just added a GUI-based file viewer for .sav, .csv and xls. Much better than starting up SPSS or PSPP. It is under util.

    Best wishes, Albert-Jan

    PS: are you on LinkedIn?

  9. John Sivak reporter

    Are you going to push a new release to PYPI soon, or should I test/check with the latest code on the master branch?

    Yeah, I'm jsivak@winterjewel.com on LinkedIn.

  10. Albert-Jan Roskam repo owner

    Hi,

    Thanks. I don't think I am going to push a release any time soon. I still need to find the time to fix some unittests (striding and fancy indexing). Also, I would like to do some profiling (something which I haven't done since I made the code usable for Python 2 and 3. Better to pip install from the repo, as outlined in the readme.

    Best wishes, Albert-Jan

  11. Log in to comment