Issue #279 resolved

Method to clear index

Thomas Waldmann
created an issue

for some applications (that might have lots of data to index), I think it might be nice to be able to separately build a new index, while the application is still using the current index without getting disturbed in any way.

When the new index is fully built and fully up-to-date (that might need one full build + some incremental builds until nothing changes any more), there needs to be a way to quickly switch over to the new storage.

Here is some application code that supports this:

http://hg.moinmo.in/moin/2.0/file/082581e8688c/MoinMoin/storage/middleware/indexing.py#l326

http://hg.moinmo.in/moin/2.0/file/082581e8688c/MoinMoin/storage/middleware/indexing.py#l377

Problem: this is whoosh storage specific, so that code should be in whoosh, not in the application.

Or is there another way to reach this goal?

Comments (10)

  1. Matt Chaput repo owner

    Sorry, but could you please tell me more about the problem you're trying to solve? What you're asking for sounds to me to be how Whoosh works now.

    The way Whoosh works, readers always see the version of the index that existed when they were open. Another thread/process can update the index and it doesn't affect existing readers, unless they call searcher.refresh() to reopen the reader.

    So, the application can "still use the current index", and only re-open or refresh the searcher object if/when it wants to pick up new changes.

  2. Thomas Waldmann reporter

    if the web app process terminates and restarts, wouldn't it also try to use the new (and still unfinished, only partially built) indexes? i think we can't assume an endlessly running, never-reopening-the-index behaviour here.

    (in case it was not clear: with "a lot of data", i mean an amount that could take hours or days to index)

  3. Matt Chaput repo owner

    No, the new segment only becomes visible to new readers once it's finished (the very last thing a writer does is write a new TOC file with the updated segments, and subsequent readers will then use that file instead of the old one). If a writer fails, the segment it was writing is never visible (and any files it created will eventually be purged by a later writer).

    If the app restarts, it will pick up new, finished commits, but won't see anything "in progress".

  4. Thomas Waldmann reporter

    OK, maybe this is just an issue of me being too careful (and not knowing enough details). :)

    did i miss something in the documentation or does it need to be said more clearly maybe?

    commit()
    
        Finishes writing and unlocks the index.
    

    If this is a feature of "commit", maybe one could say more detailled there what "unlocks" means for the newly built index (becoming visible for readers) and for the old index files (getting deleted?).

    Is this nice behaviour true for all writers? multisegment / multiprocess stuff?

  5. Matt Chaput repo owner

    The thread/process doing the writing grabs a file lock for the duration of the write to prevent code from opening two writers at the same time. Before the writer releases the lock it writes the new table of contents (TOC) file. New readers wil see the new TOC and use the updated list of files in it. The writer will then try to delete old files no longer referenced by the new TOC, which is essentially garbage collection.

    I will try to explain this append-only MVCC design better in the writer documentation.

    It's possible to design a backend (codec) that doesn't use MVCC... for example, a native GAE implementation that uses AppEngine's native locking, and so readers don't have to refresh to see new changes. But a writer should always be atomic, e.g. the changes made in a writer shouldn't become visible until commit() finishes sucessfully.

  6. Thomas Waldmann reporter

    OK, i think the visibility of new index segments that UPDATE an existing index is now clear enough for me.

    But it is still unclear to me how I implement a "rebuild from scratch" NOT using a separate storage.

    Situation at the start: storage S has some (more or less) working index I, the app is using that index 24/7. app gets restarted now and then and shall keep using that index as long as there is nothing better.

    Now, there is somehow the suspicion (or knowledge) that the index might be not complete or a little damaged or whatever, so we'ld like to get a new index N rebuilt from scratch (in same storage S), starting from 0 (== NOT using I in any way).

    Situation at the end should be: a completely rebuilt-from-scratch new Index N shall be used by the app, the old index I should be disposed as soon as it is not used any more.

    How do I do that?

  7. Matt Chaput repo owner

    Sorry, I lost track of this thread when my work on Whoosh went off the rails for so long. Your last question is a good one. Deleting every document and then committing with optimize=True is the current answer but not very efficient... I can add some sort of "clear" functionality to the writer to support this.

  8. Log in to comment