rpc timeout in bm_receive_event()

Issue #346 resolved
dd1 created an issue

in alpha-g we see rpc timeouts from bm_receive_event(). details in elog.

Comments (7)

  1. dd1 reporter

    timeout is in manalyzer running remotely. manalyzer calls tmfe TMBuffer::ReceiveEvent() calls bm_receive_event_vec() calls bm_receive_event_rpc(). with timeout value 100 milliseconds. RPC timeout is much longer, so we should not see any. unless mserver is getting stuck somehow. K.O.

  2. dd1 reporter

    duplicated on agmini@daq16:

    (gdb) bt

    #0  0x00007f19e52ee387 in raise () from /lib64/libc.so.6

    #1  0x00007f19e52efbb8 in abort () from /lib64/libc.so.6

    #2  0x0000000000499f91 in rpc_call (routine_id=routine_id@entry=11111) at /home/agmini/packages/midas/src/midas.cxx:13864

    #3  0x000000000049c053 in bm_receive_event_rpc (buffer_handle=buffer_handle@entry=2, buf=buf@entry=0x0, buf_size=buf_size@entry=0x0, ppevent=ppevent@entry=0x0, 

    pvec=pvec@entry=0x7ffdd19b90d0, timeout_msec=timeout_msec@entry=100) at /home/agmini/packages/midas/src/midas.cxx:10508

    #4  0x00000000004a7c43 in bm_receive_event_vec (buffer_handle=2, pvec=pvec@entry=0x7ffdd19b90d0, timeout_msec=timeout_msec@entry=100) at /home/agmini/packages/midas/src/midas.cxx:10792

    #5  0x0000000000457fde in TMEventBuffer::ReceiveEvent (this=this@entry=0x10f2200, e=e@entry=0x7ffdd19b90d0, timeout_msec=timeout_msec@entry=100)

    at /home/agmini/packages/midas/src/tmfe.cxx:312

    #6  0x0000000000442eb4 in ReceiveEvent (b=b@entry=0x10f2200, e=e@entry=0x7ffdd19b9090, timeout_msec=timeout_msec@entry=100) at /home/agmini/packages/midas/manalyzer/manalyzer.cxx:1411

    #7  0x0000000000446ab5 in ProcessMidasOnlineTmfe (args=std::vector of length 0, capacity 0, progname=<optimized out>, hostname=<optimized out>, exptname=<optimized out>, 

    bufname=<optimized out>, event_id=event_id@entry=-1, trigger_mask=trigger_mask@entry=-1, sampling_type_string=<optimized out>, num_analyze=num_analyze@entry=0, 

    writer=writer@entry=0x0, multithread=multithread@entry=false, profiler=profiler@entry=true, queue_interval_check=queue_interval_check@entry=100)

    at /home/agmini/packages/midas/manalyzer/manalyzer.cxx:1534

    #8  0x00000000004490f2 in manalyzer_main (argc=<optimized out>, argv=<optimized out>) at /home/agmini/packages/midas/manalyzer/manalyzer.cxx:2860

    #9  0x00007f19e52da555 in __libc_start_main () from /lib64/libc.so.6

    #10 0x000000000040f0aa in _start ()

    (gdb)

  3. dd1 reporter

    reproduced on agmini. see true rpc timeout, no rpc response for 10 second. on the mserver, see 90-100 msec delays in bm_receive_event(), but do not see any 10 second timeout. somehow rpc reply from mserver to agana is delayed by 10 seconds? (or lost?). strange. K.O.

  4. dd1 reporter

    Unpack Event 14000

    ss_recv_net_command timeout 10000 msec, elapsed 0.093940 and 0.000007 total 0.093947

    ss_recv_net_command status 1, timeout 10000 msec, elapsed 0.093973 sec

    [ana,ERROR] [system.cxx:5131:ss_recv_net_command,ERROR] timeout receiving network command header

    ss_recv_net_command status 412, timeout 10000 msec, elapsed 10.007886 sec

    [ana,ERROR] [midas.cxx:13870:rpc_call,ERROR] routine "bm_receive_event": timeout waiting for reply, program abort

    Aborted (core dumped)

  5. dd1 reporter

    at first it appeared that this fails inside ss_suspend()

    upd_socket ipc_socket;
    int ss_suspend(...) {
    select(read from ipc_socket)
    if (ipc_socket has data) {
       read(ipc_socket)
       // read sleeps forever because ipc_socket has not data
       // this suggests a linux kernel malfunction, inconsistency between select() and read()
    }
    }
    

    but the actual ss_suspend() code has more:

    upd_socket ipc_socket;
    int ss_suspend(...) {
    select(read from ipc_socket)
    READ_SOME_OTHER_SOCKETS();
    if (ipc_socket has data) {
       read(ipc_socket)
       // read sleeps forever because ipc_socket has not data
       // this suggests a linux kernel malfunction, inconsistency between select() and read()
    }
    }
    

    the READ_SOME_OTHER_SOCKETS() code can call ss_suspend() again recursively, it will select() and read() the ipc_socket() (ipc_socket is now empty), eventually return back to this level, where our “ipc_socket has data” is now out of date (ipc_socket is ow empty). we call read() on the empty socket and go to sleep forever. (this is the mserver, so the attached client program, manalyzer in this case, will eventually report an RPC timeout).

    below is the full stack trace of ss_suspend() that recursed and will later try to read from the empty ipc socket and go to sleep forever.

    (gdb) bt
    #0  0x00007f922c53f9e0 in __nanosleep_nocancel () from /lib64/libc.so.6
    #1  0x00007f922c53f894 in sleep () from /lib64/libc.so.6
    #2  0x0000000000451922 in ss_suspend (millisec=millisec@entry=100, msg=msg@entry=1) at /home/agmini/packages/midas/src/system.cxx:4433
    #3  0x0000000000411d53 in bm_wait_for_more_events_locked (pbuf_guard=..., pc=pc@entry=0x7f920639b93c, timeout_msec=timeout_msec@entry=100, 
        unlock_read_cache=unlock_read_cache@entry=1) at /home/agmini/packages/midas/src/midas.cxx:9429
    #4  0x00000000004238c3 in bm_fill_read_cache_locked (timeout_msec=100, pbuf_guard=...) at /home/agmini/packages/midas/src/midas.cxx:9003
    #5  bm_read_buffer (pbuf=pbuf@entry=0xdf8b50, buffer_handle=buffer_handle@entry=2, bufptr=bufptr@entry=0x0, buf=buf@entry=0x7f9203d75020, 
        buf_size=buf_size@entry=0x7f920639aa20, vecptr=vecptr@entry=0x0, timeout_msec=timeout_msec@entry=100, convert_flags=0, 
        dispatch=dispatch@entry=0) at /home/agmini/packages/midas/src/midas.cxx:10279
    #6  0x0000000000424161 in bm_receive_event (buffer_handle=2, destination=0x7f9203d75020, buf_size=0x7f920639aa20, timeout_msec=100)
        at /home/agmini/packages/midas/src/midas.cxx:10649
    #7  0x0000000000406ae4 in rpc_server_dispatch (index=11111, prpc_param=0x7ffcad70b7a0) at /home/agmini/packages/midas/progs/mserver.cxx:575
    #8  0x000000000041ce9c in rpc_execute (sock=10, buffer=buffer@entry=0xe11570 "g+", convert_flags=0)
        at /home/agmini/packages/midas/src/midas.cxx:15003
    #9  0x000000000041d7a5 in rpc_server_receive_rpc (idx=idx@entry=0, sa=0xde6ba0) at /home/agmini/packages/midas/src/midas.cxx:15958
    #10 0x0000000000451455 in ss_suspend (millisec=millisec@entry=1000, msg=msg@entry=0) at /home/agmini/packages/midas/src/system.cxx:4575
    #11 0x000000000041deb2 in rpc_server_loop () at /home/agmini/packages/midas/src/midas.cxx:15907
    #12 0x0000000000405266 in main (argc=9, argv=<optimized out>) at /home/agmini/packages/midas/progs/mserver.cxx:390
    (gdb) 
    

  6. dd1 reporter

    testing final solution successful: switch the IPC socket to non-blocking-read mode. this is safe: (a) the only place we read it is ss_suspend_process_ipc(), (b) the ipc socket delivers notifications of event buffer and odb changes, midas never has to wait for this. K.O.

  7. Log in to comment