watchdog timeout during sync run transitions

Issue #192 closed
dd1 created an issue

we have a report of msequencer being killed by the watchdog timeout. msequencer is running synchronous transitions and if they take longer than the watchdog timeout, there will be trouble because cm_yield() is not called periodically. mhttpd, mlogger & co run multithreaded transitions and are not affected by this problem. K.O.

Comments (10)

  1. dd1 reporter

    The watchdog timeout is happening inside the RPC call: when waiting for the RPC reply, we are not updating the watchdog. So not just run transitions are affected, all RPC calls can cause a watchdog timeout if the RPC timeouts are longer than the watchdog timeout. One fix to this is to have the RPC “wait for reply” loop update the watchdog timeout periodically. This will not defeat the watchdog (“programs runs, but does not do anything”) because the wait loop will eventually finish when the RPC timeout is exhausted. K.O.

  2. dd1 reporter

    There are other places in MIDAS where we are in a loop and maybe we are not updating the watchdog timeout: write to event buffer waits for free space in the buffer, and read from event buffer waits for new data. (Both cases are NOT protected by a timeout). Any other places like this? K.O.

  3. Stefan Ritt

    Actually the problem lies inside cm_transition when called with both flags TR_MTHREAD | TR_SYNC, like the sequencer does. This function creates separate threads for all clients, and then does a while loop until all threads have finished (_trp.finished, midas.cxx, line 4949. Inside the loop we have only a ss_sleep(10). By replacing the ss_sleep(10) by cm_yield(10), we should fix the problem independent of the sequencer watchdog timeout. If KO does not see a problem there, I’m ready to commit this change.

    Stefan

  4. dd1 reporter

    Problem with cm_yield(), we have to catch the return value and if it is RPC_SHUTDOWN, SS_ABORT, etc and return it to the caller, and they have to check for it, etc. If we “eat” the return value, things like odbedit “sh client” stop working because we “eat” the shutdown message.

    Replacing cm_yield() with cm_periodic_tasks() works, too, I now tested it.

    K.O.

  5. dd1 reporter

    I do not like this solution - we put cm_yeild() or cm_periodic_tasks() inside an infinite loop - and defeat the purpose of the watchdog timeout. K.O.

  6. Log in to comment