tmidas / midas / issues / #192 - watchdog timeout during sync run transitions — Bitbucket

Issue #192 closed

dd1 created an issue 2019-09-27

we have a report of msequencer being killed by the watchdog timeout. msequencer is running synchronous transitions and if they take longer than the watchdog timeout, there will be trouble because cm_yield() is not called periodically. mhttpd, mlogger & co run multithreaded transitions and are not affected by this problem. K.O.

Comments (10)

dd1 reporter
The watchdog timeout is happening inside the RPC call: when waiting for the RPC reply, we are not updating the watchdog. So not just run transitions are affected, all RPC calls can cause a watchdog timeout if the RPC timeouts are longer than the watchdog timeout. One fix to this is to have the RPC “wait for reply” loop update the watchdog timeout periodically. This will not defeat the watchdog (“programs runs, but does not do anything”) because the wait loop will eventually finish when the RPC timeout is exhausted. K.O.

‌
- 2019-09-27T16:54:57+00:00
dd1 reporter
There are other places in MIDAS where we are in a loop and maybe we are not updating the watchdog timeout: write to event buffer waits for free space in the buffer, and read from event buffer waits for new data. (Both cases are NOT protected by a timeout). Any other places like this? K.O.

‌
- 2019-09-27T16:57:12+00:00
Stefan Ritt
Actually the problem lies inside cm_transition when called with both flags TR_MTHREAD | TR_SYNC, like the sequencer does. This function creates separate threads for all clients, and then does a while loop until all threads have finished (_trp.finished, midas.cxx, line 4949. Inside the loop we have only a ss_sleep(10). By replacing the ss_sleep(10) by cm_yield(10), we should fix the problem independent of the sequencer watchdog timeout. If KO does not see a problem there, I’m ready to commit this change.

Stefan
- 2019-10-15T11:19:26+00:00
Stefan Ritt
- changed status to resolved
The change seems to work ok, so I close this issue.
- 2019-10-23T10:32:02+00:00
dd1 reporter
- changed status to open
The fix is in this commit, and I think it does fix the crash, but I think this fix is wrong - cm_yield() does too much stuff. Plus the watchdog timeout while waiting for the RPC reply is still unfixed. I will reopen this bug report while I think about it. https://bitbucket.org/tmidas/midas/commits/de001f0d8b2d70faebfaae5a416ea941044e4350
- 2019-11-30T03:02:49+00:00
Stefan Ritt
Sure, you can put cm_periodic_tasks() there if you like.
- 2019-12-02T12:27:32+00:00
dd1 reporter
Problem with cm_yield(), we have to catch the return value and if it is RPC_SHUTDOWN, SS_ABORT, etc and return it to the caller, and they have to check for it, etc. If we “eat” the return value, things like odbedit “sh client” stop working because we “eat” the shutdown message.

Replacing cm_yield() with cm_periodic_tasks() works, too, I now tested it.

K.O.
- 2019-12-02T19:38:21+00:00
dd1 reporter
I do not like this solution - we put cm_yeild() or cm_periodic_tasks() inside an infinite loop - and defeat the purpose of the watchdog timeout. K.O.
- 2019-12-02T19:48:55+00:00
dd1 reporter
This problem is fixed by fix to issue 207 https://bitbucket.org/tmidas/midas/issues/207/watchdog-timeout-during-rpc-calls

K.O.

‌
- 2019-12-02T19:50:21+00:00
dd1 reporter
- changed status to closed
Fixed by fix to issue 207, commit 68c69a4. K.O.
- 2019-12-02T19:51:01+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: closed

Votes: 0

Watchers: 1

Jira: the preferred issue tracker for Bitbucket. Join the team!