Issue #109 on hold

Incremental indexing

Lars Thon
created an issue

I have been using recollindex -i to do a sort of "incremental" indexing, for example update the index with files that are newer than 7 days old:

find . -mtime -7 -type f -print | recollindex -i

I then realized that this trick does not account for files that have been REMOVED (deleted) or just plain moved. And it appears that when I run a full "Update index" from the GUI, files that have not changed (same date and size, for example) appear to get re-indexed all over again. Please correct me if I am wrong, but I see filenames going by slowly and it appears they are being re-indexed.

My question is the following: Would it not make sense to keep track of the name, timestamp and size of all indexed files, and (optionally) skip them (if they already have been indexed) or erase them from the index (if they no longer exists)?

Comments (6)

  1. medoc repo owner

    Removed or moved files are purged from the index at the end of a normal incremental pass. You can also do it explicitly with recollindex -e.

    Files which have been indexed with recollindex -i have their reference signature (mtime and size in practice) recorded just like during a normal incremental pass.

    You can verify this by using recollindex -i twice on the same file. It should be faster the second time.

    If things are not working this way, we need to investigate, maybe raise the verbosity level in the log to see why files are reindexed.

    There is more information about the log file here: https://bitbucket.org/medoc/recoll/wiki/ProblemSolvingData

    If you have trouble interpreting what it says, please attach it here or send it to me: jfd at recoll.org.

    jf

  2. Lars Thon reporter

    I do not have clear evidence that there is a bug. It is just the appearance of slowness that makes me wonder.

    Please keep this request open for a while, I don't have time to investigate the details today.

  3. medoc repo owner

    Had another quick look at this. For a small file, you really need to look at the log, the time is almost the same in both cases because the constant startup/closedown time dominates. Closing this for now until more data arrives.

  4. Log in to comment