- edited description
Piler handling diacritics
Hello,
I have noticed that my piler install has problems handling diacritics (example German umlauts: ö ü ä). The email search does not show any results if the search string contains diacritics. The search works perfectly fine if there are no diacritics in the search string.
Also the email subject which is shown in the search result list can not show diacritics, it only shows � instead of ä ö ü for example.
In the email detail box below the search result list, everything shows up correctly.
Fresh install
\$ piler -v
1.3.5 build 997
Best regards,
Plastikschnitzer
P.S. Piler is a great tool!
Comments (7)
-
reporter -
reporter - attached 400000005c603c6013e8d7040047775605b4.eml
- marked as major
I can rebuild the issue on demo.mailpiler.com – there are emails in the Archive which contain diacritics like ÅÄÖÆØåäöøæ – just search for them and you will get no result. The email where these characters are mentioned is attached to this ticket or just search in the mailpiler demo for “This UTF-8 business i becoming a subject for this years kernel” and you can see it on your own.
As the text is correct in the email itself but not in the search, I guess it must be some database related encoding issue or search issue.I could not find an email with diacritics in the header yet, but I tried to send a prepared email to the archive and I hope it will show up to show how it looks like in the subject line of the search result list.
Any ideas how to fix this? If you deal with English emails only, its not visible, but if you use Spanish, German, French, Swedish, Czech, Slowakian, Slowenian or others, its pretty obvious.
-
repo owner Try adding your utf-8 character set range to the index sections of sphinx.conf, eg.
charset_table = 0..9, english, U+00CF..U+10316
Then reindex the problematic emails.
Btw. on the demo site I’ve reindexed this specific email, so you can search for “ÅÄÖÆØåäöøæ” it returns this message.
-
repo owner Also, check out this page: http://sphinxsearch.com/wiki/doku.php?id=charset_tables
-
repo owner - changed status to resolved
Set the following for all index blocks:
charset_table = 0..9, english, _, \ U+C1->U+E1, U+C4->U+E4, U+C5->U+E5, U+C6->U+E6, U+C9->U+E9, U+CD->U+ED, U+D3->U+F3, U+D6->U+F6, U+D8->U+F8, \ U+DA->U+FA, U+DC->U+FC, U+0150->U+0151, U+0152->U+0153, U+0170->U+0171, U+01E2->U+E6, U+01E3->U+E6, U+01FC->U+E6, \ U+01FD->U+E6, U+1D01->U+E6, U+1D02->U+E6, U+1D2D->U+E6, U+1D46->U+E6, \ U+DF, U+E1, U+E4, U+E5, U+E6, U+E9, U+ED, U+00F3, U+F6, U+F8, U+FA, U+FC, U+0151, U+0153, U+0171
This should cover most latin based Europian languages.
-
reporter Thank you very much for your reply @Janos SUTO I looked into the sphinx.conf and found the sections commented out for CJK support which is also mentioned in the FAQ – I was not aware of the fact that this is basically more or less the same question if I understood correctly.
Looking at the charset table and my emails with languages all around the world and lots of special characters, I guess I need full UTF-8 support, not just European languages to be 100% bulletproof.
Is there a reason like performance, indexing speed or other downsides against just setting the charset blocks to:
U+0000..U+1000FF
If I am correct, this covers the whole UTF-8 range of characters. Please correct me if I misunderstood the concept of sphinx.conf / charset selection here.
-
repo owner Well, I’d say these issues are related. Anyway I don’t think (read: I’m not aware of ) any drawbacks if you allow the complete utf-8 space. With that being said, be sure to read http://sphinxsearch.com/docs/manual-2.3.2.html#conf-charset-table and perhaps it’s worth to discuss it on the sphinx forum: http://sphinxsearch.com/forum/forum.html?id=1.
- Log in to comment