Memoization decorator

Issue #22 resolved

John Sivak created an issue 2014-08-05

There is a subtle bug in function memoize(f) in savReader.py:443.

After MAXCACHE is reached, the "cache.popitem()" is used to remove an item to prevent the cache from exceeding MAXCACHE items. The problem is that "popitem()" will pop any arbitrary item from the dictionary, and when processing a large file, it could pop the most recently "pushed" item. When this happens the code crashes with a KeyError.

So here is a crude workaround that I have in place:

    def memoize(f):
        """Memoization decorator
        http://code.activestate.com/recipes/577219-minimalistic-memoization/"""
        cache = {}
        MAXCACHE = 10**4

        def memf(*x):
            if x not in cache:
                cache[x] = f(*x)
            if len(cache) > MAXCACHE:
                popped_key, popped_value = cache.popitem()
                if popped_key == x:
                    cache.popitem()
                    cache[popped_key] = popped_value
            return cache[x]
        return memf

Comments (14)

John Sivak reporter
- edited description
- 2014-08-05T18:37:27+00:00
John Sivak reporter
- edited description
- 2014-08-05T18:37:54+00:00
John Sivak reporter
I've been looking around at other cache/memoization code and this may help: https://pypi.python.org/pypi/backports.functools_lru_cache
- 2014-08-06T04:07:38+00:00

Albert-Jan Roskam repo owner

Hi John,

Thanks for the time you are putting into this. I knew about functools.lru_cache, but I did not know about the Python 2 backport. The reason why I used memozation was (of course) to speed up working with dates, so I am curious about the overhead of lru_cache.

I was thinking about simply capping the cache size. This is what I cobbled together. Given the sample data (and this really matters, because depending on the set of dates it could really slow down things too!), memoization led to a processing speed that is about one third of the "unmemoized" speed. I'll try later today with lru_cache (gotta go now, sorry):

from __future__ import division
import sys
import datetime
import random
import profile

def memoize(f):
    """Memoization decorator
    http://code.activestate.com/recipes/577219-minimalistic-memoization/"""
    cache = {}
    MAXCACHE = 10 ** 7

    def memf(*x):
        if x in cache:
            return cache[x]
        elif len(cache) < MAXCACHE:
            result = f(*x)
            cache[x] = result
            return result
        return f(*x)
    return memf 

def spss_to_iso(spssdate):
    gregorianEpoch = datetime.datetime(1582, 10, 14, 0, 0, 0)
    theDate = (gregorianEpoch + datetime.timedelta(seconds=spssdate))
    return datetime.datetime.strftime(theDate, "%Y-%m-%d %H:%M:%S")

# create some test data to work with
SOME_DATE = 12121574400  # 1966-11-26 00:00:00
ONE_DAY = 24 * 60 * 60
SAMPLE_SIZE = 1000000
epoch = xrange(0, 80 * 365 * ONE_DAY, ONE_DAY)  # suppose these are birthdates in a 80-year range
random.seed(54321)
random_dates = [SOME_DATE + random.choice(epoch) for i in xrange(SAMPLE_SIZE)]

# how big is a filled cache anyway?
print "Cache size".center(79, "-")
MAXCACHE = 10 ** 7
cache = {i: spss_to_iso(i) for i in xrange(SOME_DATE, SOME_DATE + MAXCACHE)}
cache_size = sys.getsizeof(cache)
print "%d items, %d bytes, %5.3f megabytes" % (len(cache), cache_size, (cache_size / 2 ** 20))

######
def slow(spssdate):
    return spss_to_iso(spssdate)

def without_memoize():
    for spssdate in random_dates:
        slow(spssdate)

print "without memoize".center(79, "-")
profile.run("without_memoize()")

######
@memoize
def fast(spssdate):
    return spss_to_iso(spssdate)

def with_memoize():
    for spssdate in random_dates:
        fast(spssdate)

print "with memoize".center(79, "-")
profile.run("with_memoize()")

-----------------------------------Cache size----------------------------------
10000000 items, 402653464 bytes, 384.000 megabytes
--------------------------------without memoize--------------------------------
         2000004 function calls in 15.833 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
        1    0.000    0.000   15.833   15.833 <string>:1(<module>)
  1000000    8.809    0.000    8.809    0.000 memooize_stuff.py:23(spss_to_iso)
  1000000    4.660    0.000   13.469    0.000 memooize_stuff.py:45(slow)
        1    2.364    2.364   15.833   15.833 memooize_stuff.py:48(without_memoize)
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000   15.833   15.833 profile:0(without_memoize())


----------------------------------with memoize---------------------------------
         1087604 function calls in 5.668 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    29200    0.056    0.000    0.056    0.000 :0(len)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
        1    0.000    0.000    5.668    5.668 <string>:1(<module>)
  1000000    2.708    0.000    3.164    0.000 memooize_stuff.py:13(memf)
    29200    0.280    0.000    0.280    0.000 memooize_stuff.py:23(spss_to_iso)
    29200    0.120    0.000    0.400    0.000 memooize_stuff.py:57(fast)
        1    2.504    2.504    5.668    5.668 memooize_stuff.py:61(with_memoize)
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000    5.668    5.668 profile:0(with_memoize())

2014-08-06T09:27:26+00:00

Albert-Jan Roskam repo owner

Ok, I tested everything with Python 3, including LRU cache. I also tried running LRU cache with Python 2, but it throws an error.

albertjan@debian:~$ uname -a
Linux debian 3.2.0-4-amd64 #1 SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux
albertjan@debian:~$ python3 -c "import sys; print(sys.version_info)"
sys.version_info(major=3, minor=3, micro=4, releaselevel='final', serial=0)
albertjan@debian:~$ python3 '/home/albertjan/Downloads/memooize_stuff.py' 
--------------------------------without memoize--------------------------------
         2000005 function calls in 16.031 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   16.024   16.024 :0(exec)
        1    0.008    0.008    0.008    0.008 :0(setprofile)
        1    0.000    0.000   16.024   16.024 <string>:1(<module>)
  1000000   12.323    0.000   12.323    0.000 memooize_stuff.py:33(spss_to_iso)
  1000000    2.381    0.000   14.704    0.000 memooize_stuff.py:55(slow)
        1    1.319    1.319   16.024   16.024 memooize_stuff.py:58(without_memoize)
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000   16.031   16.031 profile:0(without_memoize())


----------------------------------with memoize---------------------------------
         1087605 function calls in 3.548 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    3.548    3.548 :0(exec)
    29200    0.034    0.000    0.034    0.000 :0(len)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
        1    0.000    0.000    3.548    3.548 <string>:1(<module>)
  1000000    1.622    0.000    2.112    0.000 memooize_stuff.py:23(memf)
    29200    0.388    0.000    0.388    0.000 memooize_stuff.py:33(spss_to_iso)
    29200    0.069    0.000    0.457    0.000 memooize_stuff.py:67(fast)
        1    1.436    1.436    3.548    3.548 memooize_stuff.py:71(with_memoize)
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000    3.548    3.548 profile:0(with_memoize())


----------------------------with lru_cache decorator---------------------------
         11000005 function calls in 37.566 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   37.566   37.566 :0(exec)
  7000000   10.588    0.000   10.588    0.000 :0(getattr)
  1000000    1.276    0.000    1.276    0.000 :0(setattr)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
  1000000    1.085    0.000    1.085    0.000 :0(update)
        1    0.000    0.000   37.565   37.565 <string>:1(<module>)
  1000000    6.128    0.000   34.314    0.000 functools.py:222(decorating_function)
  1000000   15.237    0.000   28.186    0.000 functools.py:35(update_wrapper)
        1    3.251    3.251   37.565   37.565 memooize_stuff.py:83(with_lru_memoize)
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000   37.566   37.566 profile:0(with_lru_memoize())

albertjan@debian:~$ python2 -c "import sys; print(sys.version_info)"
sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
albertjan@debian:~$ python2 '/home/albertjan/Downloads/memooize_stuff.py' 
--------------------------------without memoize--------------------------------
         2000004 function calls in 12.401 seconds
...
----------------------------------with memoize---------------------------------
         1087604 function calls in 4.496 seconds
...
----------------------------with lru_cache decorator---------------------------
Traceback (most recent call last):
  File "/home/albertjan/Downloads/memooize_stuff.py", line 88, in <module>
    profile.run("with_lru_memoize()")
  File "/usr/lib/python2.7/profile.py", line 61, in run
    prof = prof.run(statement)
  File "/usr/lib/python2.7/profile.py", line 438, in run
    return self.runctx(cmd, dict, dict)
  File "/usr/lib/python2.7/profile.py", line 444, in runctx
    exec cmd in globals, locals
  File "<string>", line 1, in <module>
  File "/home/albertjan/Downloads/memooize_stuff.py", line 85, in with_lru_memoize
    uses_lru_decorator(spssdate)
  File "/usr/local/lib/python2.7/dist-packages/backports/functools_lru_cache.py", line 162, in decorating_function
    return update_wrapper(wrapper, user_function)
  File "/usr/lib/python2.7/functools.py", line 33, in update_wrapper
    setattr(wrapper, attr, getattr(wrapped, attr))
AttributeError: 'int' object has no attribute '__module__'

2014-08-06T11:06:28+00:00

John Sivak reporter
lru_cache does seem to add quite a bit more time.. My "hack" fix to the code doesn't look so bad now.. :)

Also, forgot to mention we're using Python 2.7.5.
- 2014-08-06T14:34:44+00:00
Albert-Jan Roskam repo owner
Ah, of course, shame on me, I should have included that in the comparison too. The only thing I changed to your code above is that I set the cache to 10 ** 7. Maybe it would be useful to have an environment variable that allows the user to set the cache size to zero (no memoizing, useful for files without duplicate datetimes, e.g logfiles), None (boundless cache --> no need to check the cache --> faster) or an integer > 0. Anyway here are the results for your implementation:
```
albertjan@debian:~$ python2 '/home/albertjan/Downloads/memooize_stuff.py' 
-------------------------------with john memoize-------------------------------
         2058404 function calls in 8.061 seconds
...
albertjan@debian:~$ python3 '/home/albertjan/Downloads/memooize_stuff.py' 
-------------------------------with john memoize-------------------------------
         2058405 function calls in 6.305 seconds
...
```
- 2014-08-06T18:31:23+00:00

Albert-Jan Roskam repo owner

something like this (untested)

def get_memoizer(self):
    maxcache = os.environ.get("SAVRW_MAXCACHE", "10000000")
    m = re.match("(None|[0-9]+)", maxcache, re.I)
    if m:
       maxcache = eval(m.group(1).title())
    else:
        raise ValueError("SAVRW_MAXCACHE has invalid value")
    if maxcache is None:
        # boundless caching
        def memoize(f):
            cache = {}
            def memf(*x):
                if x in cache:
                    return cache[x]
                result = f(*x)
                cache[x] = result
                return result
            return memf
    elif maxcache == 0:
        # no caching/memoizing at all
        def memoize(f):
            def memf(*x):
                return f(*x)
            return memf 
    elif maxcache > 0:
        # cached memoizing
        def memoize(f):
            cache = {}
            MAXCACHE = 10 ** 7
            def memf(*x):
                if x in cache:
                    return cache[x]
                elif len(cache) < MAXCACHE:
                    result = f(*x)
                    cache[x] = result
                    return result
                return f(*x)
            return memf 
    memoize.__doc__ = """Memoization decorator
    http://code.activestate.com/recipes/577219-minimalistic-memoization/"""
    return memoize

2014-08-06T18:50:36+00:00

John Sivak reporter
I'm all for speed in the code (we're using savReaderWriter to export files from a data warehouse), so whatever choice you want to pursue for the memoization works for me. It seems that using the dict.popitem() is still the "fastest" and "simplest" as long as we correct for the somewhat rare occurrance of the .popitem() removing the value we just added to the cache. I've not needed to adjust the cache size, so I'm not sure of the value of providing a different cache size via an environment variable; unless you want to allow users to tweak the values.

Also, based on your tests, it appears that any memoization is better than none, though the lru_cache approach may be too fancy/expensive..

After re-reading your previous post: I personally wouldn't use a "boundless" cache, since I'm dealing with large amounts of data, so I'd lean towards a bounded or "self-cleaning" cache.
- 2014-08-07T03:03:39+00:00
John Sivak reporter
Were you waiting for any further feedback from me? Just hit this bug again in a different virtualenv and was reminded to check the bug report.

Thanks
- 2014-09-02T19:00:43+00:00
Albert-Jan Roskam repo owner
- changed status to resolved
fixed issue ~~#22~~ with memoization decorator for dates/times

→ <<cset e1226e681310>>
- 2014-09-04T12:04:06+00:00
Albert-Jan Roskam repo owner
hi John,

Thanks for reminding me. I played with a bunch of implementations but in the end I decided to choose the fastest version (well, there was one faster implementation from Python Cookbook, but the cache could grow infinitely). I hope this is fixed now. The unittests run ok, but this does not say much in this case. I should generate a humonguous dataset to see how it behaves, but I think it works ok now. Btw, I also just added a GUI-based file viewer for .sav, .csv and xls. Much better than starting up SPSS or PSPP. It is under util.

Best wishes, Albert-Jan

PS: are you on LinkedIn?
- 2014-09-04T12:14:19+00:00
John Sivak reporter
Are you going to push a new release to PYPI soon, or should I test/check with the latest code on the master branch?

Yeah, I'm jsivak@winterjewel.com on LinkedIn.
- 2014-09-05T17:23:45+00:00
Albert-Jan Roskam repo owner
Hi,

Thanks. I don't think I am going to push a release any time soon. I still need to find the time to fix some unittests (striding and fancy indexing). Also, I would like to do some profiling (something which I haven't done since I made the code usable for Python 2 and 3. Better to pip install from the repo, as outlined in the readme.

Best wishes, Albert-Jan
- 2014-09-05T19:55:47+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: resolved

Votes: 0

Watchers: 2