Duplicated data returned by hs_read_buffer when using MySQL

Issue #189 resolved
Ben Smith created an issue

Symptom: calling mjsonrpc's hs_read_arraybuffer can sometimes result in “duplicated” data being returned - you will be told that there are twice as many datapoints as there should be for a variable, and you will get the data in the order T0,T1…TN,T0,T1…TN.

The problem appears for data coming from the “System” event (e.g. links defined in /History/Links/System), and I think it only affects MySQL history systems.

The main function to study is SchemaHistoryBase::hs_read_buffer in history_schema.cxx.

The underlying cause is that fSchema (populated by SqlHistoryBase::read_schema calling MysqlHistory::read_table_and_event_names) contains schemas where the event_name is system AND where the event_name is System (note the different capitalisation). The first of these is created directly in MysqlHistory::read_table_and_event_names , has a time_from of 0, and sets the event_name to be the same as the table_name; the latter is created in ReadMysqlTableNames, has a time_from that depends on when the schema was last changed, and sets the event_name to be the “real” midas event name.

The problem is that SchemaHistoryBase::hs_read_buffer reads data from BOTH of these schemas! Reading from multiple schemas normally makes sense, as if the schema changed during your period of interest, you still want all the data. But here the schemas have overlapping validity periods (and in my case we read the full lot of data twice).

I don't know enough about the MySQL history system to suggest the correct resolution. Are both versions of the schema required (the one based on table name and the one based on event name)? If not, then you could probably remove the call to ReadMysqlTableNames? If both are required, then perhaps some extra logic needs to be added to hs_read_buffer so that if a variable matches multiple schemas, we don't re-read data for periods we've already read? Or you could do a "de-duplication" at the end of hs_read_buffer if that’s easier?

Comments (11)

  1. dd1

    I think I see trouble when history events are renamed, i.e. /equipment/slow becomes /equipment/Slow.

    sqlite history completely broke from this because sqlite database name “mh_slow_slow.sqlite3” not case sensitive on a Mac and “mh_Slow_slow” is same as “mh_slow_slow”.

    But this does not seem to create duplicated data.

    So to return duplicate data, either there is duplicate data in the database or we read the same data twice.

    Also, in SchemaHistoryBase::hs_read_buffer(), we select the schema we will read, then we read them one at a time, but we do not keep track of time progress - so we trust that the schema are already time-ordered and we trust that we do not have duplicate/aliased schemas.

    I think I will add a check there - keep track of time going forward, and complain if data is not time-ordered. This will also catch duplicate/aliased schema.

    I think I can also catch aliased schema - if two schema refer to the same sql table and the time ranges overlap, we have aliasing.

    But no way to duplicate the problem to confirm it is fixed…

    K.O.

  2. dd1

    in history_schema.cxx there is confusion between case sensitive and case-non-sensitive things. in some places I use std::string operator=() which is case sensitive, in other places I use strcasecmp() which is case-insensitive.

    since we now use utf-8 strings, I think we should bite the bullet and ditch all the case-insensitive stuff. use case-sensitive string comparisons everywhere.

    K.O.

  3. dd1

    did the opposite, made all event name and variable names case-insensitive. This fixes problem with partially-case-insensitive sqlite and confusion with case sensitivity in mysql. K.O.

  4. dd1

    Initial attempt to implement protection against duplicate data was unsuccessful. Will try again… K.O.

  5. dd1

    Ok, I see the problem with js_hs_read_arraybuffer() - in history_schema.cxx, “class ReadBuffer” is protected against non-time-monotonic and duplicate data, but in mjsonrpc.cxx “class ReadBuffer” does not have this protection and duplicate data is possible. I think protection is best done in the history_schema base class where we know what is happening and are in a better position to complain about non-monotonic data and to detect duplicate data and duplicate schema. K.O.

  6. dd1

    mjsonrpc.cxx “class ReadBuffer” is now protected against duplicate and non-monotonous data by code in HsSqlSchema::read_data() and HsFileSchema::read_data(). K.O.

  7. Log in to comment