Infinite recursion when /History/Links/xxx points to a sub-directory

Issue #325 resolved
Ben Smith created an issue

If you create a link in /History/Links that points to a sub-directory, then change one of the keys in the target directory, you get infinite recursion and the history file growing at > 1MB/s.

Steps to reproduce:

  • odbedit -c "create DOUBLE /ExampleInf/Data"
  • odbedit -c "ln /ExampleInf /History/Links/ExampleInf"
  • Restart mlogger
  • odbedit -c "set /ExampleInf/Data 3"
  • Watch mlogger CPU usage become 100% and the history grow very quickly

Note that creating a regular subdirectory with links in it does NOT exhibit the same behaviour:

  • odbedit -c "create KEY /History/Links/ExampleInf"
  • odbedit -c "ln /ExampleInf/Data /History/Links/ExampleInf/Data"

To me, it looks like the culprit is the call to db_notify_clients() in log_system_history(), as log_system_history() should already be being called due to a notification…

But the extra call has been there a long time, so there is presumably a reason for it. Can KO or Stefan comment on the history of this?

Comments (7)

  1. dd1

    can you capture and post a stack trace of this infinite recursion? (gdb mlogger, ctrl-c when it gets into it, bt). K.O.

  2. Ben Smith reporter

    I stopped, backtraced and continued a few times. Here’s some example traces.

    To me it looks like mlogger keeps sending itself the same message that the value has changed, as it does manage to escape the log_history_system() function (see 3rd trace), but then jumps straight back in…

    I also note that the “last modified” time of the ODB key in question does NOT keep incrementing - it shows only the time when I manually changed the value.

    * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
      * frame #0: 0x00007fff6e6c737e libsystem_kernel.dylib`__sendto + 10
        frame #1: 0x00000001000b46fa mlogger`ss_resume(port=55197, message="O 1 209296 209296 -1") at system.cxx:4658:17
        frame #2: 0x00000001000791e1 mlogger`db_notify_clients_locked(pheader=0x0000000104800000, hDB=1, hKeyMod=209296, index=-1, bWalk=NO, msg=0x00007ffeefbfe6b0) at odb.cxx:12368:19
        frame #3: 0x000000010008a638 mlogger`db_notify_clients(hDB=1, hKeyMod=209296, index=-1, bWalk=NO) at odb.cxx:12410:7
        frame #4: 0x0000000100010551 mlogger`log_system_history(hDB=1, hKey=0, info=0x0000000000000001) at mlogger.cxx:5389:7
        frame #5: 0x000000010008b860 mlogger`db_update_record_local(hDB=1, hKeyRoot=209296, hKey=209296, index=-1) at odb.cxx:13369:16
        frame #6: 0x000000010002fbda mlogger`cm_dispatch_ipc(message="O 1 209296 209296 -1", message_size=21, client_socket=0) at midas.cxx:5438:17
        frame #7: 0x00000001000b45dd mlogger`ss_suspend_process_ipc(millisec=100, msg=0, ipc_recv_socket=3) at system.cxx:4275:4
        frame #8: 0x00000001000b3c9f mlogger`ss_suspend(millisec=100, msg=0) at system.cxx:4606:19
        frame #9: 0x0000000100030b44 mlogger`cm_yield(millisec=100) at midas.cxx:5719:16
        frame #10: 0x0000000100013c63 mlogger`main(argc=1, argv=0x00007ffeefbff060) at mlogger.cxx:6341:13
        frame #11: 0x00007fff6e582cc9 libdyld.dylib`start + 1
    

    * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
      * frame #0: 0x00007fff6e6c6be6 libsystem_kernel.dylib`write + 10
        frame #1: 0x00000001000da677 mlogger`xwrite(fn="/Users/ben/DAQ/test_mpet/220331.hst", fh=11, buf=0x0000000101404620, count=8) at history.cxx:65:17
        frame #2: 0x00000001000dc38b mlogger`hs_write_event(event_id=101, data=0x0000000101404620, size=8) at history.cxx:810:9
        frame #3: 0x00000001000d1bfb mlogger`MidasHistory::hs_write_event(this=0x0000000101005e10, event_name="ExampleInf", timestamp=1648773013, data_size=8, data="") at history.cxx:2754:14
        frame #4: 0x0000000100010505 mlogger`log_system_history(hDB=1, hKey=0, info=0x0000000000000001) at mlogger.cxx:5385:14
        frame #5: 0x000000010008b860 mlogger`db_update_record_local(hDB=1, hKeyRoot=209296, hKey=209296, index=-1) at odb.cxx:13369:16
        frame #6: 0x000000010002fbda mlogger`cm_dispatch_ipc(message="O 1 209296 209296 -1", message_size=21, client_socket=0) at midas.cxx:5438:17
        frame #7: 0x00000001000b45dd mlogger`ss_suspend_process_ipc(millisec=100, msg=0, ipc_recv_socket=3) at system.cxx:4275:4
        frame #8: 0x00000001000b3c9f mlogger`ss_suspend(millisec=100, msg=0) at system.cxx:4606:19
        frame #9: 0x0000000100030b44 mlogger`cm_yield(millisec=100) at midas.cxx:5719:16
        frame #10: 0x0000000100013c63 mlogger`main(argc=1, argv=0x00007ffeefbff060) at mlogger.cxx:6341:13
        frame #11: 0x00007fff6e582cc9 libdyld.dylib`start + 1
    

    * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
      * frame #0: 0x00007fff6e6c7396 libsystem_kernel.dylib`__recvfrom + 10
        frame #1: 0x00000001000b407d mlogger`ss_suspend_process_ipc(millisec=100, msg=0, ipc_recv_socket=3) at system.cxx:4192:19
        frame #2: 0x00000001000b3c9f mlogger`ss_suspend(millisec=100, msg=0) at system.cxx:4606:19
        frame #3: 0x0000000100030b44 mlogger`cm_yield(millisec=100) at midas.cxx:5719:16
        frame #4: 0x0000000100013c63 mlogger`main(argc=1, argv=0x00007ffeefbff060) at mlogger.cxx:6341:13
        frame #5: 0x00007fff6e582cc9 libdyld.dylib`start + 1
    

  3. Stefan Ritt

    I confirm the problem.

    First let me explain the reason to reduce the min_period for ODB linked data to zero: Several experiments use run statistics, like /Equipment/Trigger/Statistics/... At the end of the run, all of them are updated. Previously, only after the first value the history was written to disk. When the other values were changed immediately afterwards, the original 10 seconds period was not over and none of the values was written to the history file except the first one. This led to inaccurate statistics (like the event per sec. was kept to the last nonzero value from the last run forever, and people were confused looking at the history and saw nonzero events per sec. even if the run was stopped. Therefore I set the period to zero to effectively “remove” the period check, so linked events are immediately written when any of the values change.

    I haven’t seen that problem before since we do not use directories under /History/Links, only simple links. So sorry for causing this bad side effect.

    Now the problem indeed is the db_notify_clients() leading to the infinite loop. I tried hard to recover histories from 10-20 years ago, but was not really successful in tracing the origins. Might be that before we had ->hs_write_event(), we did writing events on a different routine. To trigger that routine, I had to add db_notify_clients(), which caused a callback then effectively writing the event to history. Now as ->hs_write_event() is directly put into log_system_history, we do not need the db_notify_clients() any more, and it is much cleaner to call ->hs_write_event() directly and not through yet another hotlink. So I removed the db_notify_clients() now, and things still look ok for me here at PSI. Please do the same at your experiments and report back. Once everyone is happy we can close this issue.

  4. Stefan Ritt

    On a separate note: With the previous bug, tons of data were written to the history system, basically with always the same value. This might lead to long response times if you try to display your history. So better check your history data files and remove any excess data. Unfortunately we don’t have any tool to remove double history data or recover inconsistent history data, so all you can do is to delete whole files, which then might however also discard some good data. So please be careful.

  5. Stefan Ritt

    Removed unnecessary db_notify_clients() from log_system_history and restored removal of time check for linked history variables. See issue #325 for details.

    → <<cset 439a693e6ad7>>

  6. Log in to comment