everything in ODB must be valid UTF-8 unicode

Create issue
Issue #215 resolved
dd1 created an issue

To avoid web browser errors, all JSON encoded strings returned by MIDAS must be valid UTF-8 unicode. By extension, this means that all ODB strings (TID_STRING values, key names, etc) must be valid UTF-8 unicode.

I believe ODB key names are already checked for this at creation time (db_create() & co), but I am not sure db_validate() checks for this.

I think ODB TID_STRING values are not checked right now. db_validate() should have this check. db_set_value() & co probably should have this check.

If invalid UTF-8 sequences are found, we should at least complain about it. But I am not sure if we can fix them automatically.

K.O.

Comments (9)

  1. dd1 reporter

    ODB now checks and will complain if TID_STRING and TID_LINK values are not valid UTF-8 unicode. K.O.

  2. dd1 reporter

    ODB check in db_validate_and_repair_db() has a check for ODB key name, if not valid UTF-8 unicode (plus some other checks), name is replaced by a unique hex number. This code has been there for some time now. So check and automatic repair confirmed. K.O.

  3. dd1 reporter

    db_create_key() and db_rename_key() check for valid utf8, I do not see any other function that can create ODB keys.

  4. dd1 reporter

    so all that remains

    • db_set_data(TID_STRING|TID_LINK) should complain or refuse if given non-utf8 string
    • midas json encoder and decoder does not care: json odb save files with invalid utf8 is ok
    • mjsonrpc should always return valid utf8 data from odb. add a “must be utf8” flag to json encoder?

    K.O.

  5. dd1 reporter

    This is what I will do:

    • db_set_data(TID_STRING|TID_LINK) will be changed to complain about non utf8 data, and leave it up to the user to fix it. (TODO)
    • midas json encoder and decoder does not care about utf8, json odb save files with invalid utf8 is ok. (already works this way)
    • mjsonrpc should always return valid utf8 data from odb. This is not needed. odb validation already complains about non-utf8 data and it is up to the user to fix it. if everything in odb is utf8 (as confirmed by the odb validation), no need to do anything in the mjsonrpc code. (already works this way).

    So only one TODO item remains.

    K.O.

  6. dd1 reporter

    Added utf8 checks:

    • db_set_value()
    • db_set_data(), db_set_link_data() (incomplete array check)
    • db_set_data_index() and db_set_value_index()

    K.O.

  7. dd1 reporter

    After refactoring of the db_set_xxx() code, utf8 checks are done everywhere, except for db_set_data() of a string array. K.O.

  8. Log in to comment