Piler handling diacritics

Issue #993 resolved
Plastikschnitzer created an issue

Hello,

I have noticed that my piler install has problems handling diacritics (example German umlauts: ö ü ä). The email search does not show any results if the search string contains diacritics. The search works perfectly fine if there are no diacritics in the search string.

Also the email subject which is shown in the search result list can not show diacritics, it only shows � instead of ä ö ü for example.

In the email detail box below the search result list, everything shows up correctly.

Fresh install

\$ piler -v
1.3.5 build 997

Best regards,

Plastikschnitzer

P.S. Piler is a great tool!

Comments (7)

  1. Plastikschnitzer reporter

    I can rebuild the issue on demo.mailpiler.com – there are emails in the Archive which contain diacritics like ÅÄÖÆØåäöøæ – just search for them and you will get no result. The email where these characters are mentioned is attached to this ticket or just search in the mailpiler demo for “This UTF-8 business i becoming a subject for this years kernel” and you can see it on your own.
    As the text is correct in the email itself but not in the search, I guess it must be some database related encoding issue or search issue.

    I could not find an email with diacritics in the header yet, but I tried to send a prepared email to the archive and I hope it will show up to show how it looks like in the subject line of the search result list.

    Any ideas how to fix this? If you deal with English emails only, its not visible, but if you use Spanish, German, French, Swedish, Czech, Slowakian, Slowenian or others, its pretty obvious.

  2. Janos SUTO repo owner

    Try adding your utf-8 character set range to the index sections of sphinx.conf, eg.

    charset_table = 0..9, english, U+00CF..U+10316
    

    Then reindex the problematic emails.

    Btw. on the demo site I’ve reindexed this specific email, so you can search for “ÅÄÖÆØåäöøæ” it returns this message.

  3. Janos SUTO repo owner

    Set the following for all index blocks:

    charset_table = 0..9, english, _, \
                   U+C1->U+E1, U+C4->U+E4, U+C5->U+E5, U+C6->U+E6, U+C9->U+E9, U+CD->U+ED, U+D3->U+F3, U+D6->U+F6, U+D8->U+F8, \
                   U+DA->U+FA, U+DC->U+FC, U+0150->U+0151, U+0152->U+0153, U+0170->U+0171, U+01E2->U+E6, U+01E3->U+E6, U+01FC->U+E6, \
                   U+01FD->U+E6, U+1D01->U+E6, U+1D02->U+E6, U+1D2D->U+E6, U+1D46->U+E6, \
                   U+DF, U+E1, U+E4, U+E5, U+E6, U+E9, U+ED, U+00F3, U+F6, U+F8, U+FA, U+FC, U+0151, U+0153, U+0171
    

    This should cover most latin based Europian languages.

  4. Plastikschnitzer reporter

    Thank you very much for your reply @Janos SUTO I looked into the sphinx.conf and found the sections commented out for CJK support which is also mentioned in the FAQ – I was not aware of the fact that this is basically more or less the same question if I understood correctly.

    Looking at the charset table and my emails with languages all around the world and lots of special characters, I guess I need full UTF-8 support, not just European languages to be 100% bulletproof.

    Is there a reason like performance, indexing speed or other downsides against just setting the charset blocks to:

    U+0000..U+1000FF
    

    If I am correct, this covers the whole UTF-8 range of characters. Please correct me if I misunderstood the concept of sphinx.conf / charset selection here.

  5. Log in to comment