Monit not starting applications in manual mode after a forceful reboot

Issue #101 resolved
Jolf created an issue

Monit not starting applications in manual mode after a forceful reboot

To replicate:

  1. Start a linux VM in virtualbox
  2. Install and run monit configured to have one service with mode manual configured.
  3. monit [service] start
  4. Power off machine in virtualbox gui
  5. Once monit it started after a boot it will not attempt to bring up the services previously running.
  • When running a normal reboot it will attempt to bring up the services again
  • It has been tested on monit 5.7, 5.8.1, 5.9 with same behaviour observed
  • I was using centos 6.5 for this but I'm assuming other distros will behave the same
  • Same issue on physical machines but easier to test in virtualbox

Comments (8)

  1. Tildeslash repo owner

    The service monitoring state is stored to the statefile at the end of each cycle. When you enable monitoring of some manual-mode service and monit is stopped (regardless if gracefully or by SIGKILL), the monitoring will be enabled when you reboot the machine.

    The problem is, if you store the statefile on some volatile filesystem (the path can be changed using "set statefile" statement). The statefile is by default stored in the home directory of the user under which monit is running as ".monit.state" - if you store it to some temporary filesystem, it will be lost on reboot.

    I tried to reproduce the problem with the following config file:

    set daemon 5
    set httpd port 2812 allow monit:monit
    
    check filesystem rootfs with path /
            mode manual
    

    Two tests which simulate what will happen after reboot with manual-mode service (when Monit is stopped and started again):

    Graceful monit stop

    1.) Started Monit and verified the service is not monitored:

    $ ./monit -Ic ~/.monitrc_statefile summary
    ...
    Filesystem 'rootfs'                 Not monitored
    

    2.) enabled monitoring:

    $ ./monit -c ~/.monitrc_statefile monitor rootfs
    

    3.) verified it is monitored:

    $ ./monit -Ic ~/.monitrc_statefile summary
    ...
    Filesystem 'rootfs'                 Accessible
    

    4.) stopped monit gracefully:

    $ ./monit -Ic ~/.monitrc_statefile quit  
    Monit daemon with pid [5149] killed
    

    5.) started monit and verified the manual-mode service is monitored after monit restart:

    $ ./monit -Ic ~/.monitrc_statefile summary
    ...
    Filesystem 'rootfs'                 Accessible
    

    Abnormal stop

    1.) killed monit:

    $ pkill -9 monit
    

    2.) removed the statefile before starting it again to start with clean table:

    rm -f ~/.monit.state
    

    3.) started monit and verified the service is not monitored:

    $ ./monit -Ic ~/.monitrc_statefile summary
    ...
    Filesystem 'rootfs'                 Not monitored
    

    4.) enabled monitoring:

    $ ./monit -c ~/.monitrc_statefile monitor rootfs
    

    5.) verified it is monitored:

    $ ./monit -Ic ~/.monitrc_statefile summary
    ...
    Filesystem 'rootfs'                 Accessible
    

    6.) killed monit:

    $ pkill -9 monit
    

    7.) started monit again and verified the service is still monitored after monit restart:

    $ ./monit -Ic ~/.monitrc_statefile summary
    ...
    Filesystem 'rootfs'                 Accessible
    

    Please check your monit log (enabled with "set logfile" statement) and if you are able to reproduce the problem, please provide short configuration with which we can reproduce the problem and list list of steps (when you start monit, when you enable monitoring of the given service, service status in the meantime, when you reboot the system + location of the statefile and make sure it's on persistent storage).

  2. Jolf reporter

    The issue is not caused by an unclean shutdown of the service but rather an unclean shutdown of the operating system. It is replicable using your config above in a VM environment (I have not been able to test on physical machines).

    set daemon 5
    set httpd port 2812 allow monit:monit
    
    check filesystem rootfs with path /
            mode manual
    
    $ monit monitor rootfs
    $ monit summary
    The Monit daemon 5.9 uptime: 3m 
    Filesystem 'rootfs'                 Accessible
    System 'c65.vagrant'                Running
    

    Shutdown virtualbox VM via Rightclick -> Close -> Power Off (Important to do unclean shutdown, It works as expected after a normal "shutdown -r now" reboot). After starting the OS again the item is not monitored.

    $ monit summary
    The Monit daemon 5.9 uptime: 0m 
    Filesystem 'rootfs'                 Not monitored
    System 'c65.vagrant'                Running
    

    If I disable autostart of monit I also noted the statefile is empty (0 bytes) after a unclean shutdown of the OS until I start the monit service where it goes up to 584 bytes.

  3. Jolf reporter

    Pasting logs from monit running in -vv but I realize they are not very helpful:

    ## After unclean shutdown
    
    $ ls -l /root/.monit.state
    -rw-------. 1 root root 0 Oct 13 11:38 /root/.monit.state
    
    $ service monit start
    Control file syntax OK
    Starting monit: OK
    
    $ ls -l /root/.monit.state
    -rw-------. 1 root root 584 Oct 13 11:40 /root/.monit.state
    
    [UTC Oct 13 11:40:17] debug    : pidfile '/var/run/monit.pid' does not exist
    [UTC Oct 13 11:40:17] info     : Starting Monit 5.9 daemon with http interface at [*:2812]
    [UTC Oct 13 11:40:17] info     : Starting Monit HTTP server at [*:2812]
    [UTC Oct 13 11:40:17] info     : Monit HTTP server started
    [UTC Oct 13 11:40:17] info     : 'c65.vagrant' Monit started
    
  4. Jolf reporter

    Seems like a simple fix in theory, this patch does not have this issue:

    diff --git a/src/state.c b/src/state.c
    index 38b27e1..e203f87 100644
    --- a/src/state.c
    +++ b/src/state.c
    @@ -229,6 +229,7 @@ void State_save() {
                             if (write(file, &state, sizeof(state)) != sizeof(state))
                                     THROW(IOException, "Unable to write service state");
                     }
    +                sync();
             }
             ELSE
             {
    
  5. Tildeslash repo owner

    Yes, if the system suddenly dies before the modified pages were flushed to disk, the monitoring state won't be saved - it's kind of corner case, but real. Using sync() would sync the whole filesystem, which will be overkill - using fsync() for the statefile only has lower overhead. As the State_save() is called at the end of the monitoring cycle (which may be variable depending on how long the tests take), the state should be updated immediately when the monitoring state was changed.

    Thank you for data, the problem is fixed in the development branch, you can get snapshot here: https://bitbucket.org/tildeslash/monit/get/master.tar.gz

    To compile:

    tar -xzf master.tar.gz
    cd tildeslash*
    ./bootstrap
    ./configure
    make
    
  6. alan somers

    IMHO this change is incorrect. fsync is a powerful and expensive tool. It should not be used merely to prevent a few seconds' worth of data loss after a power failure; data loss is expected in that case. Rather, it should be used to prevent an application from misbehaving in the event of power loss. Most often, it's used as a kind of write barrier. For example, a database may use multiple files for its backing store. It may need to ensure that a write to one file is persisted before a write to another, or else the entire database may be corrupted after a power loss. That's when you use fsync.

    The problem with fsync is that it's expensive. On a laptop, it forces a spun-down disk to spin up, defeating power savings. On busy servers, it defeats write scheduling, reducing performance. With some filesystems, it causes huge write amplification. In my case I'm using monit on an embedded system and thanks to that fsync, monit is using more disk bandwidth than every other program combined.

    I believe that the fsync part of this commit should be reverted, or at least made conditional on the "mode manual" feature being in use and the state file having changed since it was last saved.

  7. Tildeslash repo owner

    It is critical that the service state is persistent and will resume even when the machine crashes => the fsync() shouldn't be reverted.

    We'll reduce the overhead though - will call the fsync only if the state changed.

  8. Log in to comment