enron benchmark fails due to an outdated dataset URL

Issue #458 new
Thomas Koch created an issue

When trying to run the enron benchmark the download fails and the untar command fails subsequently:

$ cd benchmark
$ Python enron.py -d test --setup
Downloading Enron email archive to '/Users/koch/Projekte/whoosh/whoosh-src/whoosh/benchmark/test/enron_mail_082109.tar.gz'...
('Downloaded in ', 1.1230289936065674, 'seconds')
Caching messages in /Users/koch/Projekte/whoosh/whoosh-src/whoosh/benchmark/test/enron_cache.pickle...
Traceback (most recent call last):
  File "enron.py", line 185, in <module>
  File "build/bdist.macosx-10.11-intel/egg/whoosh/support/bench.py", line 601, in run
  File "enron.py", line 104, in setup
    self.cache_messages(archive, cache)
  File "enron.py", line 87, in cache_messages
    for d in self.get_messages(archive):
  File "enron.py", line 62, in get_messages
    for text in Enron.get_texts(archive):
  File "enron.py", line 48, in get_texts
    archive = tarfile.open(archive, "r:gz")
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/tarfile.py", line 1685, in open
    return func(name, filemode, fileobj, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/tarfile.py", line 1743, in gzopen
    raise ReadError("not a gzip file")
tarfile.ReadError: not a gzip file

The enron dataset was updated in 2015 and the download URL no longer works. I add a simple patch to enron.py to this ticket that fixes the issue.

There were also some issues with the pickled cache file when the dir option (-d) was used - that's fixed too.

Comments (2)

