watchdog timeout during sync run transitions
we have a report of msequencer being killed by the watchdog timeout. msequencer is running synchronous transitions and if they take longer than the watchdog timeout, there will be trouble because cm_yield() is not called periodically. mhttpd, mlogger & co run multithreaded transitions and are not affected by this problem. K.O.
Comments (10)
-
reporter -
reporter There are other places in MIDAS where we are in a loop and maybe we are not updating the watchdog timeout: write to event buffer waits for free space in the buffer, and read from event buffer waits for new data. (Both cases are NOT protected by a timeout). Any other places like this? K.O.
-
Actually the problem lies inside cm_transition when called with both flags TR_MTHREAD | TR_SYNC, like the sequencer does. This function creates separate threads for all clients, and then does a while loop until all threads have finished (_trp.finished, midas.cxx, line 4949. Inside the loop we have only a ss_sleep(10). By replacing the ss_sleep(10) by cm_yield(10), we should fix the problem independent of the sequencer watchdog timeout. If KO does not see a problem there, I’m ready to commit this change.
Stefan
-
- changed status to resolved
The change seems to work ok, so I close this issue.
-
reporter - changed status to open
The fix is in this commit, and I think it does fix the crash, but I think this fix is wrong - cm_yield() does too much stuff. Plus the watchdog timeout while waiting for the RPC reply is still unfixed. I will reopen this bug report while I think about it. https://bitbucket.org/tmidas/midas/commits/de001f0d8b2d70faebfaae5a416ea941044e4350
-
Sure, you can put cm_periodic_tasks() there if you like.
-
reporter Problem with cm_yield(), we have to catch the return value and if it is RPC_SHUTDOWN, SS_ABORT, etc and return it to the caller, and they have to check for it, etc. If we “eat” the return value, things like odbedit “sh client” stop working because we “eat” the shutdown message.
Replacing cm_yield() with cm_periodic_tasks() works, too, I now tested it.
K.O.
-
reporter I do not like this solution - we put cm_yeild() or cm_periodic_tasks() inside an infinite loop - and defeat the purpose of the watchdog timeout. K.O.
-
reporter This problem is fixed by fix to issue 207 https://bitbucket.org/tmidas/midas/issues/207/watchdog-timeout-during-rpc-calls
K.O.
-
reporter - changed status to closed
Fixed by fix to issue 207, commit 68c69a4. K.O.
- Log in to comment
The watchdog timeout is happening inside the RPC call: when waiting for the RPC reply, we are not updating the watchdog. So not just run transitions are affected, all RPC calls can cause a watchdog timeout if the RPC timeouts are longer than the watchdog timeout. One fix to this is to have the RPC “wait for reply” loop update the watchdog timeout periodically. This will not defeat the watchdog (“programs runs, but does not do anything”) because the wait loop will eventually finish when the RPC timeout is exhausted. K.O.