msequencer aborting after a couple hours

Issue #263 wontfix
Thomas Lindner created an issue

I have been using msequencer for the last couple days for long sequences. The msequencer always fails after ~3-6 hours.

The error from the terminal when msequencer fails is:

[Sequencer,INFO] Run #333 started
[Sequencer,INFO] Run #333 stopped
[Sequencer,INFO] Run #334 started
[Sequencer,ERROR] [midas.cxx:5535:bm_validate_client_index,ERROR] My client index 2 in buffer 'SYSMSG' is invalid: client name '', pid 0 should be my pid 1137
[Sequencer,ERROR] [midas.cxx:5538:bm_validate_client_index,ERROR] Maybe this client was removed by a timeout. See midas.log. Cannot continue, aborting...
Aborted

This is the reports in midas messages:

00:23:45.840 2020/08/05 [febrb,ERROR] [midas.cxx:14444:rpc_server_receive,ERROR] recv_net_command() returned 0, abort
00:23:45.837 2020/08/05 [febrb,ERROR] [midas.cxx:12824:recv_net_command,ERROR] rpc connection from 'Sequencer' on 'mpmt-daq.triumf.ca' unexpectedly closed
00:23:45.551 2020/08/05 [mserver,INFO] Client 'Sequencer' (PID 1137) on database 'ODB' removed by db_cleanup called by cm_periodic_tasks (idle 10.1s,TO 10s)
00:23:45.548 2020/08/05 [mserver,INFO] Client 'Sequencer' on buffer 'SYSMSG' removed by cm_periodic_tasks (idle 10.1s, timeout 10s)
00:23:45.407 2020/08/05 [jsroot_server,TALK] Program 'Sequencer' on host 'mpmt-daq' aborted
00:23:45.404 2020/08/05 [jsroot_server,ERROR] [midas.cxx:14444:rpc_server_receive,ERROR] recv_net_command() returned 0, abort
00:23:45.401 2020/08/05 [jsroot_server,ERROR] [midas.cxx:12824:recv_net_command,ERROR] rpc connection from 'Sequencer' on 'mpmt-daq.triumf.ca' unexpectedly closed
00:23:45.364 2020/08/05 [feudp,TALK] Program 'Sequencer' on host 'mpmt-daq' aborted
00:23:45.363 2020/08/05 [Logger,TALK] Program 'Sequencer' on host 'mpmt-daq' aborted
00:23:45.362 2020/08/05 [feudp,ERROR] [midas.cxx:14444:rpc_server_receive,ERROR] recv_net_command() returned 0, abort
00:23:45.361 2020/08/05 [Logger,ERROR] [midas.cxx:14444:rpc_server_receive,ERROR] recv_net_command() returned 0, abort
00:23:45.360 2020/08/05 [feudp,ERROR] [midas.cxx:12824:recv_net_command,ERROR] rpc connection from 'Sequencer' on 'mpmt-daq.triumf.ca' unexpectedly closed
00:23:45.358 2020/08/05 [Logger,ERROR] [midas.cxx:12824:recv_net_command,ERROR] rpc connection from 'Sequencer' on 'mpmt-daq.triumf.ca' unexpectedly closed

I attach the sequencer script, in case that matters. The failures usually seem to happen when the sequencer is doing a long wait (~2400s) while accumulating statistics.

Any ideas where I should look for debugging this further?

Comments (9)

  1. dd1

    Looks like a timeout on the SYSMSG buffer. You can monitor the timeouts on the “buffers” page, normally “age” should never go beyound 1 second. You can also monitor the timeout using “mdump -s -d”.

    Also we should confirm that the “wait” command does update the timeouts. sleep(100) is no go, instead, do loop(100 times) { cm_periodic_tasks(); sleep(1); }. Actual implementation should also keep track of the wall clock time.

    K.O.

  2. Stefan Ritt

    Of course the sequencer does not do sleep(2400). For the wait loop, it sets

    /Sequencer/State/Wait limit = 2400

    Then on each call it sets

    /Sequencer/State/Wait value = current time - /Sequencer/State/Start time

    Once the value exceeds the limit, it advances.

    This evaluation is done instantaneously in a few us, then it does a sleep of 100ms, so the condition above is evaluated ten times per second.

    The messages you show above indicates that cm_yield() is not called fo 10 seconds. The main loop looks like

    /* main loop */
    do {
       sequencer();
       status = cm_yield(0);
    
       ch = 0;
       while (ss_kbhit()) {
          ch = ss_getchar(0);
          if (ch == -1)
             ch = getchar();
    
          if ((char) ch == '!')
             break;
       }
    
    } while (status != RPC_SHUTDOWN && ch != '!');
    

    So it’s hard not to call cm_yield(). The only thing would be an endless loop inside sequencer(), but as I wrote above it exits after 100 ms.

    So I’m bit at a loss here right now.

  3. dd1

    I agree. The above code example looks correct. Next thing to try is to watch the actual timeout as it happens. K.O.

  4. Thomas Lindner reporter

    I checked the age of the timeout for msequencer (using the buffers webpage). The age is always below 1 second (as KO said it should be). The age of sequencer is similar to the Logger, mhttpd and my frontends. No evidence that the sequencer age is consistently higher than the other programs.

    So I guess the problem must be fairly rare. My home directory (including midas.log) is on a network mounted drive; so it is possible that there is some rare network glitch that is causing the sequencer to get stuck for a while and then timeout. But it is odd that there are other programs (mhttpd, mserver) which also have 10s timeouts but don’t experience this problem.

    I increased the timeout for msequencer to 60s; this ‘solved’ the problem in the sense that I could complete my 12 hour sequence without any errors. But still confusing why only msequencer is affected.

  5. dd1

    Yes, I have seen such timeouts if home directory is NFS mounted and the network path is not all-gige, is there is a 100-mbit switch in the middle or one of the gige links dropped from gige to 100mbit (usually happens between two switches). This is why I monitor network connection speeds here (OK and OK-1000 is 1gige connections): http://ladd00.triumf.ca/~olchansk/pingmon/triumf.html

    K.O.

  6. dd1

    What happens is all linux programs are “demand paged”, which means the code (“T” symbols) and pre-initialized data (“B” and “D”) normally reside on disk and are only fetched into RAM as needed. So a running program that does not read or write any files will still generate disks accesses to fetch the code pages if they are not already in memory (i.e. a rarely called subroutine) and can stall if there is a disk delay (i.e. an NFS timeout from a lost packet at the 1gige->100mbit network speed bump). K.O.

  7. Log in to comment