Search index contains a lot of duplicates

Lukasz Dziedzia avatarLukasz Dziedzia created an issue

I have a problem with my whosh+django haystack installation.

I run haystack command update_index in a scheduled intervals and after each update my search index contains new duplicates of previously indexed entries. After X index updates I have X results in my search page for given entry.

Any idea about this problem?

I use whoosh 1.5.5

Comments (22)

  1. Matt Chaput

    It sounds like Whoosh might be getting the wrong document numbers somewhere. Is it possible to attach a (small) zipped index which demonstrates the problem? Thanks!

  2. Lukasz Dziedzia

    What exactly do you need? My whoosh directory (after zipping) is about 30MB. Inside I have number of files: _MAIN_X

    _MAIN_X.trm

    _MAIN_X.sto

    _MAIN_X.fln

    where X is a number, current values of X: 1, 286, 371, 428, 541, 591, 646, 565, 661, 791, 821, 830.

    I don't know much about whoosh storage system but this ammount of files looks suspicious for me.

  3. Matt Chaput

    Is it possible to generate a new, small index that demonstrates the problem? And/or can you give me some details of the usage that leads to the problem? Thanks!

  4. Lukasz Dziedzia

    I generated small index as you had suggested. I did also some tests: 1. I've tried to update index several times for 1 entry in database - that was OK. 2. I've tried to modify entry and update index, this was fine as well. 3. When I run script that loaded bigger number of entries (about 150), I've ntoiced that after second update_index call I can spot some duplicated entries. Index after this update is attached to this comment.

    Thanks for hints in advance

  5. Matt Chaput

    Removed SegmentReader.first_id() because it was stupid (creating own FilePostingReader bypassed checks). ref #97. Removed custom caching SegmentWriter.update_document(). It had too much duplicated code and should be replaced by a more robust implementation of unique caches. Fixed bugs in IndexReader.first_id(), Searcher.document_number().

    97fa36784fce

  6. Matt Chaput
    • changed status to open

    Hi, I fixed some bugs related to updating and pushed the updated code to PyPI as Whoosh 1.5.6. I hope it will fix the issues you're seeing. Sorry about the problems!

  7. mikolune

    Hi,

    Unfortunately, I am experiencing this duplicate issue - this is both with Whoosh 2.4.1 and 2.3.2 (I haven't tried with others) ...

    Any ideas ?

  8. Thomas Waldmann

    mikolune, can you give some minimal example code that reproduces the problem?

    duplicate index entries can easily happen if you just add documents or if update_document() can't work correctly because none of unique field values match.

  9. mikolune

    Thanks Thomas.

    Before I go into code, I would like to say I am using Whoosh through Haystack, for indexing of Django. So it may be a problem with Haystack, but when I searched about this problem, all links came to here.

    Building the index from scratch works and there are no duplicates:

    ./manage.py rebuild_index
    

    What does generate duplicates is when I just want to update the index:

    ./manage.py update_index
    

    I will dig into the Haystack code to see if there is something obvious, but if you have additional comments I will be glad.

  10. Paul Nichols

    I'm having the exact same problem as mikolune above. Using Django, I am updating some fields on some records. If I update_index, I will have both the old and new records. Rebuild_index works fine. I'm on Whoosh 2.5.4, django 1.5.1, Python 2.7.2

    records_to_change = Item.objects.filter(**{role:old_owner}).update(**{role:new_owner})

  11. Matt Chaput

    Unfortunately I can't really diagnose problems with Haystack -- You'll have to either file a bug there, or try to reproduce the problem using Whoosh's API.

  12. Matt Chaput

    ETA: if anyone can come up with a reproducable test case, then please reopen and I'll try to find the problem whether it's in Whoosh or Haystack.

  13. Matt Chaput

    Paul, I don't know that it's in Haystack, but I have no way of even starting to deal with a bug that A. I can't reproduce, B. has a very vague description, and C. is only reported by users who aren't using my API.

  14. Edward Lee

    I had added an 'id' field in my SearchIndex, which was hiding HAYSTACK_ID_FIELD (defaults to 'id'), and that was causing the duplication whenever I updated the index through Haystack. Once I removed the field and rebuilt the index, I no longer get any duplicates.

  15. Log in to comment
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.