Search index contains a lot of duplicates

Create issue
Issue #97 invalid
Lukasz Dziedzia created an issue

I have a problem with my whosh+django haystack installation.

I run haystack command update_index in a scheduled intervals and after each update my search index contains new duplicates of previously indexed entries. After X index updates I have X results in my search page for given entry.

Any idea about this problem?

I use whoosh 1.5.5

Comments (23)

  1. Matt Chaput repo owner

    It sounds like Whoosh might be getting the wrong document numbers somewhere. Is it possible to attach a (small) zipped index which demonstrates the problem? Thanks!

  2. Lukasz Dziedzia reporter

    What exactly do you need? My whoosh directory (after zipping) is about 30MB. Inside I have number of files: _MAIN_X




    where X is a number, current values of X: 1, 286, 371, 428, 541, 591, 646, 565, 661, 791, 821, 830.

    I don't know much about whoosh storage system but this ammount of files looks suspicious for me.

  3. Matt Chaput repo owner

    Is it possible to generate a new, small index that demonstrates the problem? And/or can you give me some details of the usage that leads to the problem? Thanks!

  4. Lukasz Dziedzia reporter

    I generated small index as you had suggested. I did also some tests: 1. I've tried to update index several times for 1 entry in database - that was OK. 2. I've tried to modify entry and update index, this was fine as well. 3. When I run script that loaded bigger number of entries (about 150), I've ntoiced that after second update_index call I can spot some duplicated entries. Index after this update is attached to this comment.

    Thanks for hints in advance

  5. Matt Chaput repo owner

    Removed SegmentReader.first_id() because it was stupid (creating own FilePostingReader bypassed checks). ref #97. Removed custom caching SegmentWriter.update_document(). It had too much duplicated code and should be replaced by a more robust implementation of unique caches. Fixed bugs in IndexReader.first_id(), Searcher.document_number().


  6. Matt Chaput repo owner
    • changed status to open

    Hi, I fixed some bugs related to updating and pushed the updated code to PyPI as Whoosh 1.5.6. I hope it will fix the issues you're seeing. Sorry about the problems!

  7. mikolune


    Unfortunately, I am experiencing this duplicate issue - this is both with Whoosh 2.4.1 and 2.3.2 (I haven't tried with others) ...

    Any ideas ?

  8. Thomas Waldmann

    mikolune, can you give some minimal example code that reproduces the problem?

    duplicate index entries can easily happen if you just add documents or if update_document() can't work correctly because none of unique field values match.

  9. mikolune

    Thanks Thomas.

    Before I go into code, I would like to say I am using Whoosh through Haystack, for indexing of Django. So it may be a problem with Haystack, but when I searched about this problem, all links came to here.

    Building the index from scratch works and there are no duplicates:

    ./ rebuild_index

    What does generate duplicates is when I just want to update the index:

    ./ update_index

    I will dig into the Haystack code to see if there is something obvious, but if you have additional comments I will be glad.

  10. Paul Nichols

    I'm having the exact same problem as mikolune above. Using Django, I am updating some fields on some records. If I update_index, I will have both the old and new records. Rebuild_index works fine. I'm on Whoosh 2.5.4, django 1.5.1, Python 2.7.2

    records_to_change = Item.objects.filter(**{role:old_owner}).update(**{role:new_owner})

  11. Matt Chaput repo owner

    Unfortunately I can't really diagnose problems with Haystack -- You'll have to either file a bug there, or try to reproduce the problem using Whoosh's API.

  12. Matt Chaput repo owner

    I'm going to shut this bug down because as I said above, I can't diagnose problems with Haystack.

  13. Paul Nichols

    Can I ask how you know it's Haystack? I'm not disagreeing, but being new to both I'm not sure how to go about this.

  14. Matt Chaput repo owner

    ETA: if anyone can come up with a reproducable test case, then please reopen and I'll try to find the problem whether it's in Whoosh or Haystack.

  15. Matt Chaput repo owner

    Paul, I don't know that it's in Haystack, but I have no way of even starting to deal with a bug that A. I can't reproduce, B. has a very vague description, and C. is only reported by users who aren't using my API.

  16. Edward Lee

    I had added an 'id' field in my SearchIndex, which was hiding HAYSTACK_ID_FIELD (defaults to 'id'), and that was causing the duplication whenever I updated the index through Haystack. Once I removed the field and rebuilt the index, I no longer get any duplicates.

  17. Log in to comment