everything in ODB must be valid UTF-8 unicode
To avoid web browser errors, all JSON encoded strings returned by MIDAS must be valid UTF-8 unicode. By extension, this means that all ODB strings (TID_STRING values, key names, etc) must be valid UTF-8 unicode.
I believe ODB key names are already checked for this at creation time (db_create() & co), but I am not sure db_validate() checks for this.
I think ODB TID_STRING values are not checked right now. db_validate() should have this check. db_set_value() & co probably should have this check.
If invalid UTF-8 sequences are found, we should at least complain about it. But I am not sure if we can fix them automatically.
K.O.
Comments (9)
-
reporter -
reporter ODB check in db_validate_and_repair_db() has a check for ODB key name, if not valid UTF-8 unicode (plus some other checks), name is replaced by a unique hex number. This code has been there for some time now. So check and automatic repair confirmed. K.O.
-
reporter db_create_key() and db_rename_key() check for valid utf8, I do not see any other function that can create ODB keys.
-
reporter so all that remains
- db_set_data(TID_STRING|TID_LINK) should complain or refuse if given non-utf8 string
- midas json encoder and decoder does not care: json odb save files with invalid utf8 is ok
- mjsonrpc should always return valid utf8 data from odb. add a “must be utf8” flag to json encoder?
K.O.
-
reporter This is what I will do:
- db_set_data(TID_STRING|TID_LINK) will be changed to complain about non utf8 data, and leave it up to the user to fix it. (TODO)
- midas json encoder and decoder does not care about utf8, json odb save files with invalid utf8 is ok. (already works this way)
- mjsonrpc should always return valid utf8 data from odb. This is not needed. odb validation already complains about non-utf8 data and it is up to the user to fix it. if everything in odb is utf8 (as confirmed by the odb validation), no need to do anything in the mjsonrpc code. (already works this way).
So only one TODO item remains.
K.O.
-
reporter Added utf8 checks:
- db_set_value()
- db_set_data(), db_set_link_data() (incomplete array check)
- db_set_data_index() and db_set_value_index()
K.O.
-
reporter After refactoring of the db_set_xxx() code, utf8 checks are done everywhere, except for db_set_data() of a string array. K.O.
-
reporter commit e6050b7 adds utf8 check for string arrays. K.O.
-
reporter - changed status to resolved
fixed as of commit e6050b7 (branch odb-refactor). odb validation check existing data in odb, db_set_xxx() methods check new data. K.O.
- Log in to comment
ODB now checks and will complain if TID_STRING and TID_LINK values are not valid UTF-8 unicode. K.O.