Incremental index keeps reindexing same file

Issue #453 resolved
Eric Blood created an issue

The incremental index keeps indexing the same files each time it is run. To duplicate:

Rebuild the index from scratch.

Make a change in a repository.

Run the incremental index with debug output (it will indicate the changed file is being reindexed).

Run the incremental index again immediately (it will again indicate that the changed file is being reindexed).

It looks like this happens because the path field in the index is of type TEXT rather than ID. The incremental index tries to remove the old document based on the path name but it doesn't actually find it because the index is based on the parts of the path rather than the entire path. Because the old document isn't removed the indexer keeps thinking that the file is out of date.

I think that there should actually be two fields in the index: (1) full path which is an ID and contains the entire path and (2) the relative path which is of type TEXT and only contains the part of the path relative to the repo root. The file name search should be on the relative path. The full path should be used to remove documents during incremental indexing.

While investigating this I also noticed that if you do a file name search on the repositories location path you will get every single file in all repositories (for example, my repositories are located in c:\hg\repos--if I search on //hg// or //repos// then I get all files in the results). This is because it's searching the full file path rather than just the portion relative to the repository.

I'm running on Windows but I'm pretty sure this would apply to Linux too.

Comments (3)

  1. Marcin Kuzminski repo owner
    • changed status to open

    I'm having troubles reproducing this.

    I did commit a file `readme.md` fully index this repository

    DEBUG [whoosh_indexer] building index @ /home/marcink/hg_repos/a2
    DEBUG [whoosh_indexer]     >> /home/marcink/hg_repos/a2/2phingerScroll
    DEBUG [whoosh_indexer]     >> /home/marcink/hg_repos/a2/readme.md
    DEBUG [whoosh_indexer] added 2 files 0 with content for repo /home/marcink/hg_repos/a2
    DEBUG [whoosh_indexer] >> COMMITING CHANGES <<
    DEBUG [whoosh_indexer] >>> FINISHED BUILDING INDEX <<<
    

    move file from readme.md to readme.rst reindex

    DEBUG [whoosh_indexer]     >> /home/marcink/hg_repos/a2/readme.rst [WITH CONTENT]
    DEBUG [whoosh_indexer] re indexing /home/marcink/hg_repos/a2/readme.rst
    DEBUG [whoosh_indexer] added 1 files 1 with content for repo /home/marcink/hg_repos/a2
    DEBUG [whoosh_indexer] >> COMMITING CHANGES <<
    DEBUG [whoosh_indexer] >>> FINISHED REBUILDING INDEX <<<
    

    Than reindex again, and nothing happens

    DEBUG [whoosh_indexer] added 0 files 0 with content for repo /home/marcink/hg_repos/a2
    DEBUG [whoosh_indexer] >> COMMITING CHANGES <<
    DEBUG [whoosh_indexer] >>> FINISHED REBUILDING INDEX <<<
    

    Actually whoosh stores full path to a file not only the name path I'm not sure if this is related but https://bitbucket.org/marcinkuzminski/rhodecode/changeset/95bea8088213 did fix one issue with whoosh re-indexing things constantly but that was because of non-ascii chars in names

    Could you provide an more detailed example what changes needs to be done to reproduce that ?

  2. Eric Blood reporter

    I don't see the re-indexing behavior when a new file is added or renamed. But I do see it anytime an existing file is modified (without renaming).

    I did wonder if this might be fixed in a new version of whoosh or if it might be a Windows only issue so I started from scratch with Linux but with the same results:

    1. Installed Ubuntu 11.4 on a clean VM
    2. Cloned Rhodecode and updated to tip.
    3. Installed Rhodecode from the clone (python setup.py install)
    4. Ran paster make-config and paster setup-rhodecode (using all defaults)
    5. Edited production.ini to turn on debug logging to the console
    6. Ran paster serve
    7. In the web interface created a new repository 'test'
    8. Pushed changes over http from a small repository I had on the host computer (about 6 text files)
    9. Ran paster make-index
    10. From the web interface edited a file and committed the change
    11. Ran paster make-index --update-only test
    12. Ran paster make-index --update-only test

    The edited file was reindexed both times make-index was run.

    Note that there is another bug with the new --update-only option. If you don't pass the --update-only option it doesn't index any repositories (rather than updating all repositories).

  3. Marcin Kuzminski repo owner

    Ok i was able to reproduce that, you were very close with initial conclusion, seems like the delete_by_term stopped worked sometime after upgrading whoosh versions.

    Bottom line it made (!!!!) each time new entry for that file in the index.

    I changed SCHEMA to store also ID which is the again absolute path, made it unique so it should prevent from creating duplicates.

    Also now doing delete_by_term using this ID field which fixes the problem(should be much faster also).

    Thanks for posting all the info to solve this.

  4. Log in to comment