service not monitored on device without rtc and ntp due to unwanted state restore

Issue #1093 open
Alexander Dahl created an issue

We noticed monit putting a certain service foo into “not monitored” under certain conditions, which we did not want and did not expect. There’s no “onreboot” statement in our monitrc files, so this should be default behavior (start mode) and that’s what we want/expect. Every service is started on boot through its own init script and after all services have started monit is started and expected to monitor all those. The state file was configured to be stored in /var/lib/monit/state and saved to the rootfs persistently.

Now for a new device we do a special test procedure where we call monit stop foo, test some things which would interfere with the service foo, and shut off the device afterwards, in the expectation everything would be started as usual again on the next boot.

Alas after a power cycle (hard switch off, wait, switch on) that (and only that service) was shown by monit as “not monitored”. The service itself was running however, it was correctly started through init before. We think we found the cause for that and share our investigation below.

We are using monit 5.32.0 on an embedded device without RTC and without NTP. The system clock of the device always starts at 2012-01-01 00:00 UTC after a power on. (When doing a soft reboot the device keeps power and the clock does not jump back. Problem really only happens after a hard power cycle.)

When stopping the service through monit, monit writes a 0 to the monitor field of the particular array member in its state file and writes that file to disk. We proved that by using GNU poke. The pickle file we used for that is attached. You can also spot it with hexdump -C /var/lib/monit/state if you know where to look.

Then after a power cycle monit first remembers the current system time in the variable systeminfo.booted and later reads the previously recorded time from the state v4 file into the variable booted. On our system the duration of the boot is quite deterministic and because after a power cycle the system time always starts with the same value these two times happen to always have the same value: 1325376008 (2012-01-01 00:08).

Later in its startup process monit reads each service entry from the state file and for each service it calls _updateMonitor() ... also for our service which got that 0 stored previously. In that function monit tries to determine to which monitor state a service should be set now, which depends on the “onreboot” setting, whether the system actually rebooted, and the desired and actual monitor state. The problem in our case is the function State_reboot() which falsely detects no reboot here, because it only determines this based on the timestamp which happens to be the same in our case due to missing battery powered RTC and missing NTP, so it’s just like in Groundhog Day.

You can see this from the following debug output (I’ll attach the patch used for that):

[2012-01-01T00:00:28+0000] error    : Error reading id from file '/mnt/data/var/monit/id'
[2012-01-01T00:00:29+0000] debug    : do_init:330: systeminfo.booted set to: 1325376008
[2012-01-01T00:00:29+0000] debug    : pidfile '/run/monit.pid' does not exist
[2012-01-01T00:00:29+0000] info     : Starting Monit 5.32.0 daemon with http interface at [*]:2812
[2012-01-01T00:00:29+0000] debug    : _restoreV4:336: read 'booted' 1325376008 from state file
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : _updateMonitor:250: 'foo' setting monitor state to NOT
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : State_reboot:690: State_reboot() returning false
[2012-01-01T00:00:29+0000] debug    : do_default:600: startdelay: 0
[2012-01-01T00:00:29+0000] debug    : Starting Monit HTTP server at [*]:2812
[2012-01-01T00:00:29+0000] debug    : Monit HTTP server started
[2012-01-01T00:00:29+0000] info     : 'localhost' Monit 5.32.0 started
[2012-01-01T00:00:29+0000] debug    : Processing postponed events queue
[2012-01-01T00:00:29+0000] debug    : 'ubihealthd_bin' file exists
[2012-01-01T00:00:29+0000] debug    : 'ubihealthd_bin' is a regular file
[2012-01-01T00:00:29+0000] debug    : 'ubihealthd_bin' checksum 8d981c3dda989998b1ba9be0422ac30f
[2012-01-01T00:00:29+0000] debug    : 'ubihealthd_bin' checksum is valid
[2012-01-01T00:00:29+0000] debug    : 'ubihealthd_bin' permission test succeeded [current permission 0755]
[2012-01-01T00:00:29+0000] debug    : 'ubihealthd_bin' uid test succeeded [current uid = 0]
[2012-01-01T00:00:29+0000] debug    : 'ubihealthd_bin' gid test succeeded [current gid = 0]

Problem from my side: I have no good idea how to solve this in monit. Maybe someone else has? This is why I dropped all this information here (and because it took me some time to track it down). I can happily test any proposed patch for a solution however.

Our workaround is to put the state file to /run into a tmpfs now, so it is not persistent anymore.

Feel free to ask if the text above is not understandable enough or lacks details.

Comments (6)

  1. Alexander Dahl reporter

    are you interresting in doing some testing based on Monit 5.31.0

    Not sure if you addressed me here, but in general I can test things like patches or build from source.

  2. Lutz Mader

    Hello Alexander,
    suggested by issue #1089, Monit 5.31.0 should handle “onreboot” proper at a reboot without a RTC. I tried to do this also with no problems, but I have no system without a RTC.

    In short, I don't think there is a certain way to find out, if a Linux system has been booted/restarted.
    At least there won't be a general solution that always works, even on MacOS or AIX systems.
    The comparison with a saved restart/boot time only works reliably under normal conditions.

    Lutz

  3. Log in to comment