Issue #366 resolved

Updating the index yields RuntimeError: maximum recursion depth exceeded

Anonymous created an issue

Hi,

i'm using whoosh 2.5.4 in combination with django-haystack to index text extracted by pyPdf from PDF files. Creating the index for the first time and searching with in works. But updating the index failed. I narrowed it down to one single document, where text extraction fails and has almost no spaces. Find a shortened commandline output below. The complete commandline output and the text that i put into the index is attached.

ERROR:root:Error updating myapp using default 
Traceback (most recent call last):
  File "/home/me/python/django-apps/haystack/management/commands/update_index.py", line 223, in handle_label
    self.update_backend(label, using)
  File "/home/me/python/django-apps/haystack/management/commands/update_index.py", line 269, in update_backend
    do_update(backend, index, qs, start, end, total, self.verbosity)
  File "/home/me/python/django-apps/haystack/management/commands/update_index.py", line 91, in do_update
    backend.update(index, current_qs)
  File "/home/me/python/django-apps/haystack/backends/whoosh_backend.py", line 208, in update
    writer.commit()
  File "/usr/local/lib/python2.7/dist-packages/Whoosh-2.5.4-py2.7.egg/whoosh/writing.py", line 1037, in commit
    self.writer.commit(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/Whoosh-2.5.4-py2.7.egg/whoosh/writing.py", line 922, in commit
    finalsegments = self._merge_segments(mergetype, optimize, merge)
  File "/usr/local/lib/python2.7/dist-packages/Whoosh-2.5.4-py2.7.egg/whoosh/writing.py", line 827, in _merge_segments
    return mergetype(self, self.segments)
  File "/usr/local/lib/python2.7/dist-packages/Whoosh-2.5.4-py2.7.egg/whoosh/writing.py", line 88, in MERGE_SMALL
    writer.add_reader(reader)
  File "/usr/local/lib/python2.7/dist-packages/Whoosh-2.5.4-py2.7.egg/whoosh/writing.py", line 707, in add_reader
    self.add_postings_to_pool(reader, basedoc, docmap)
  File "/usr/local/lib/python2.7/dist-packages/Whoosh-2.5.4-py2.7.egg/whoosh/writing.py", line 642, in add_postings_to_pool
    for word in reader.word_graph(fieldname).flatten():
  File "/usr/local/lib/python2.7/dist-packages/Whoosh-2.5.4-py2.7.egg/whoosh/automata/fst.py", line 410, in flatten
    for result in node.flatten(sofar + key):

....repeat last message quite often...

Comments (3)

  1. Daniel Black

    Ha, was doing the same thing with PyPDF2 though I suspect what you have is a content error and processing unrelated to the extraction.

            from PyPDF2.pdf import PdfFileReader
            content=open(path)
            pdf = PdfFileReader(content)
            content = unicode(
                        reduce(
                            lambda content, p: content+p.extractText().encode("utf-8", "ignore").replace("\xa0", " ") + "\n",
                            pdf.pages,
                            ''),
                         'utf-8'
                      )
    
            # fallback option if no title in document
            title = name
            try:
                title = pdf.documentInfo['\Title']
            except KeyError:
                pass
            update_text = None
            try:
                update_text = pdf.documentInfo['/ModDate']
                # Date format  '/ModDate': u'D:20090906232935'
                # or u"20130227144103+11'00'"
            except KeyError:
                update_text = pdf.documentInfo['/CreationDate']
            except KeyError:
                pass
            if update_text:
                try:
                    update = datetime.strptime(update_text[2:16], '%Y%m%d%H%M%S')
                except ValueError:
                    pass
            ixwr.update_document(path=name,
                        title=unicode(title),
                        content=content,
                        update=update)
    
  2. Matt Chaput repo owner

    Fixed add_postings_to_pool() to use Cursor.flatten() instead of Node.flatten(). Node.flatten() caused recursion error on very long words. Cursor.flatten() is iterative instead of recursive. Fixes issue #366.

    → <<cset c2f1f0544b65>>

  3. Log in to comment