tmidas / midas / issues / #161 - alarm "logger is not running" does not trigger — Bitbucket

Issue #161 resolved

dd1 created an issue 2019-01-22

In the agdaq system, the alarm for "logger is not running" does not trigger (and so "auto restart" does not happen). The alarms for other programs, i.e. "feevb is not running" does trigger correctly and does cause the run to stop. Strange... K.O.

Comments (14)

Stefan Ritt
Have you checked "/Program/Logger/Alarm class" to be nonzero and "/Programs/Logger/Required = y" ?
- 2019-01-23T09:52:25+00:00
dd1 reporter
Sure. the alarms for "feevb not running" or "mserver is not running" work, but "mlogger ..." does not. Weird. K.O.
- 2019-01-25T18:44:28+00:00
dd1 reporter
- changed status to closed
I do not see this problem anymore. K.O.
- 2020-08-06T14:20:51+00:00
dd1 reporter
- changed status to open
- 2020-11-22T00:00:18+00:00
dd1 reporter
- assigned issue to
  
  dd1
- 2020-11-22T00:00:36+00:00
dd1 reporter
ok, I see the problem. this time with PWB_A_UDP. the program is running (fenudp.exe), but on the status page it shows red (“frontend stopped”), and it does not show up in “odbedit scl”. The “program not running” alarm is not triggering.

the reason the alarm is not triggering is because “first_failed” keeps changing, it is always within 7 seconds from current time.

“first_failed” is only referenced in alarm.cxx.

most likely it is fenudp.exe who keeps updating it (resetting it to zero) inside it’s own al_check().

so we have an inconstency. looks like PWB_A_UDP was removed from /System/Clients but not from ODB and not from it’s output event buffer (usually SYSTEM, but BUFUDP in this case).

K.O.

‌
- 2020-11-22T00:09:53+00:00
dd1 reporter
ok now understand the malfunction of the agdaq system. because PWB_A_UDP was removed from /System/Clients, it was not getting the run start transitions. K.O.
- 2020-11-22T00:13:24+00:00
dd1 reporter
somehow I now see this very often. A client’s record is removed from /System/Clients, but the program is still running, connected to ODB and the event buffer. I think as part of “am I alive” checks, in addition to “am I connected to ODB?” and “are my handles into the event buffer still valid?” also check that my entry in /System/Clients still exists. Hmm…. K.O.
- 2020-11-23T06:01:47+00:00
Stefan Ritt
Well, originally we had the watchdog check done via ss_alarm(). At those days, this problem you describe did not happen. Now you put this into cm_periodic_tasks(). If this function is not called for some reason, the above described behaviour will happen. So before curing the symptoms, I would rather check the cause, i.e. why is cm_periodic_tasks() not called periodically? I guess the logger gets stuck in some writing to NFS mounted disks, which sometimes can take very long, but this is just a suspicion.
- 2020-11-23T07:12:50+00:00
dd1 reporter
as I said, the offending programs (fenudp.exe) are still connected to ODB and to the event buffers, so cm_watchdog() is definitely running. I see an even more strange case, I see 4 copies of mhttpd running (odb “autorestart” is set to “yes”), I look with the debugger, and I see all threads are gone except for the cm_watchdog() thread. So the best I can tell, all this cm_preiodic_tasks() stuff and the cm_watchdog() thread seems to be working correctly. K.O.
- 2020-11-24T17:59:50+00:00
dd1 reporter
what I do not understand is this. when we detect a watchdog timeout, we remove the offending client from ODB and remove it from the event buffers and remove it from /System/Clients. As a chaser, we send them a SIGKILL signal as the proverbial wooden stake through the heart, so no zombie is left behind. What I see is clients removed from /System/Clients, but still running, still connected to ODB, etc. K.O.
- 2020-11-24T18:03:42+00:00
dd1 reporter
the problem is definitely connected to access to files in the home directory. I did not see these problems before the home directory was moved from a local SSD (failed, unexpectedly switched to read-only mode) to an NFS mounted SSD from another machine. What I see is periodic hangs of NFS, switching from NFSv4 to NFSv3 seems to help, but not 100%. K.O.
- 2020-11-24T18:06:58+00:00
Stefan Ritt
I had cases where a process writing to NFS blocked for 20-30 seconds, and sometimes in a way that even alarms were blocked. Moving everything to local storage fixed that. Since then, I never run for example elog on an NFS drive.
- 2020-11-26T16:35:14+00:00
dd1 reporter
- changed status to resolved
have not seen this problem in a while, closing this bug again. K.O.
- 2022-03-29T01:05:08+00:00
Log in to comment

Assignee: dd1

Type: bug

Priority: major

Status: resolved

Votes: 0

Watchers: 1

Jira: the preferred issue tracker for Bitbucket. Join the team!