Character encoding bug affecting search terms

Issue #20 new
Björn Persson
created an issue

Search terms containing non-English letters get mangled. Try this query to see an example:

The query IRI is encoded in UTF-8, and the HTTP header, the XML declaration and the meta element all specify UTF-8, but the contents of the title element and the h1 element aren't valid UTF-8. Somewhere in the processing the input gets transformed by a routine that treats it as something other than UTF-8.

Comments (9)

  1. Kristian Fiskerstrand

    To follow up on this. My (granted a bit limited) knowledge of ocaml is that it doesn't have any built-in functionality to convert between charsets. the string used simply stores the byte array provided. Indeed, this gets messed up in the output as we specify UTF on that.

    If we were to get this correctly we'd probably have to use a library such as Camomile. So as this is not a bug in its functionality I'm pushing it to the bottom of my stack for now.

  2. Johan van Selst

    The patch fixes both the title and the h1 element.

    As I understand the code, most UTF-8 characters are currently not stored in searchable words in metadata in the key database. The same characters are also not used when searching. Correcting this would break the existing database format and would mean that every keyserver has to completely regenerate its database. I am testing a patch for this as well, but rolling it out may be non-trivial, because changing the lookup function without recreating the database would mean that some keys will no longer be found.

    This is unrelated to the patch I proposed earlier. That patch may be included regardless.

  3. Kim Minh Kaplan

    Right it also fixes the title. But this is not the important point.

    Your patch does not fix the real bug, it’s just an illusionist trick (see the example I gave earlier): it does not show the real words that were searched. When you’ll have upgraded SKS for UTF-8 your patch will not be needed anymore.

    Regarding the database upgrade, I once wrote a small program to do such upgrade. You may have a look at it to get an idea. You will probably want to truncate the word database and loop over the key database.

  4. Johan van Selst

    Right, trivial patch to preserve UTF-8 characters up on

    DB loaded with updated keys, see e.g.

    Note that even though it now returns proper results, it still does not make the backend UTF-8 aware. This means that functions like lowercase do not work for UTF-8 characters and spaces in non-ASCII encodings are not considered word separators. Rewriting the backend to actually use multibyte-characters and UTF-8-compatible parsing would be better.

    But imho the previous patch that prints the actual search query and this patch that returns better results are still a good improvement over what we had.

  5. Log in to comment