alarm "logger is not running" does not trigger

Issue #161 resolved
dd1 created an issue

In the agdaq system, the alarm for "logger is not running" does not trigger (and so "auto restart" does not happen). The alarms for other programs, i.e. "feevb is not running" does trigger correctly and does cause the run to stop. Strange... K.O.

Comments (14)

  1. Stefan Ritt

    Have you checked "/Program/Logger/Alarm class" to be nonzero and "/Programs/Logger/Required = y" ?

  2. dd1 reporter

    Sure. the alarms for "feevb not running" or "mserver is not running" work, but "mlogger ..." does not. Weird. K.O.

  3. dd1 reporter

    ok, I see the problem. this time with PWB_A_UDP. the program is running (fenudp.exe), but on the status page it shows red (“frontend stopped”), and it does not show up in “odbedit scl”. The “program not running” alarm is not triggering.

    the reason the alarm is not triggering is because “first_failed” keeps changing, it is always within 7 seconds from current time.

    “first_failed” is only referenced in alarm.cxx.

    most likely it is fenudp.exe who keeps updating it (resetting it to zero) inside it’s own al_check().

    so we have an inconstency. looks like PWB_A_UDP was removed from /System/Clients but not from ODB and not from it’s output event buffer (usually SYSTEM, but BUFUDP in this case).

    K.O.

  4. dd1 reporter

    ok now understand the malfunction of the agdaq system. because PWB_A_UDP was removed from /System/Clients, it was not getting the run start transitions. K.O.

  5. dd1 reporter

    somehow I now see this very often. A client’s record is removed from /System/Clients, but the program is still running, connected to ODB and the event buffer. I think as part of “am I alive” checks, in addition to “am I connected to ODB?” and “are my handles into the event buffer still valid?” also check that my entry in /System/Clients still exists. Hmm…. K.O.

  6. Stefan Ritt

    Well, originally we had the watchdog check done via ss_alarm(). At those days, this problem you describe did not happen. Now you put this into cm_periodic_tasks(). If this function is not called for some reason, the above described behaviour will happen. So before curing the symptoms, I would rather check the cause, i.e. why is cm_periodic_tasks() not called periodically? I guess the logger gets stuck in some writing to NFS mounted disks, which sometimes can take very long, but this is just a suspicion.

  7. dd1 reporter

    as I said, the offending programs (fenudp.exe) are still connected to ODB and to the event buffers, so cm_watchdog() is definitely running. I see an even more strange case, I see 4 copies of mhttpd running (odb “autorestart” is set to “yes”), I look with the debugger, and I see all threads are gone except for the cm_watchdog() thread. So the best I can tell, all this cm_preiodic_tasks() stuff and the cm_watchdog() thread seems to be working correctly. K.O.

  8. dd1 reporter

    what I do not understand is this. when we detect a watchdog timeout, we remove the offending client from ODB and remove it from the event buffers and remove it from /System/Clients. As a chaser, we send them a SIGKILL signal as the proverbial wooden stake through the heart, so no zombie is left behind. What I see is clients removed from /System/Clients, but still running, still connected to ODB, etc. K.O.

  9. dd1 reporter

    the problem is definitely connected to access to files in the home directory. I did not see these problems before the home directory was moved from a local SSD (failed, unexpectedly switched to read-only mode) to an NFS mounted SSD from another machine. What I see is periodic hangs of NFS, switching from NFSv4 to NFSv3 seems to help, but not 100%. K.O.

  10. Stefan Ritt

    I had cases where a process writing to NFS blocked for 20-30 seconds, and sometimes in a way that even alarms were blocked. Moving everything to local storage fixed that. Since then, I never run for example elog on an NFS drive.

  11. Log in to comment