ODB should be saved to disk periodically.

Issue #367 resolved
dd1 created an issue

See discussion on midas forum: https://daq00.triumf.ca/elog-midas/Midas/2539

Current saving of ODB on each db_close_database() slows down scripts that use odbedit (if ODB is big and disk is slow). It also stops all other ODB users for the time required to finish writing ODB to disk (open/atomic write/close).

Originally intended saving of ODB after shutdown of last client results in never saving ODB to disk for most long running experiments, and ODB content/changes getting lost on crash, power loss or system reboot.

Ideally we should save ODB periodically (every 1 minute?) and do writing to disk without holding the ODB lock. I.e. lock ODB, copy ODB shared memory to local memory buffer, unlock ODB, write local memory buffer to disk.

Challenges: two or more programs try to do this at the same time (additional semaphore?), memory allocation failure for ODB copy buffer) (if ODB is big and memory is low, i.e. 100 Mbyte ODB on a 512 Mbyte RAM machine, typical for ARM SoCs, but even 2 GB ODB on 64 GB machine if 60 GBytes happen to be used to run some simulation or reconstruction or just a memory leak). If writing ODB is done in a thread, it introduces multithreading into otherwise single-threaded programs, like odbedit, mserver and mlogger.

Perhaps instead of a thread one can use fork()/vfork(), but not sure how this will work on macos.

K.O.

Comments (8)

  1. Stefan Ritt

    I’m tempted to put the flushing at the EOR of the logger. This way we have it in one well defined location, and we know when it happens (after the end of each run). This would solve the race condition between processes. Problem: If we don’t start/stop runs, no flush will happen. Concerning the memory allocation I think it should be first copy - then release lock - then write. If the memory allocation fails, then we write directly without the copy. Takes longer, but best we can do (and the same as we have now).

    Stefan

  2. dd1 reporter

    concur on flush at EOR (and BOR?). concur on write-without-copy if memory allocation fails. also add flush on exit from odbedit (with a --no-flush-odb flag), a periodic flush in mhttpd and we have our bases covered. experiments that do not start/stop runs, do not run mhttpd and do not use odbedit form an empty set. manual changes to odb are done by odbedit (saved-to-disk on exit, as expected) and by mhttpd web interface (saved-to-disk periodically). K.O.

  3. Stefan Ritt

    Ok I implemented some periodic flushing. Here is what I did:

    • Created

      /System/Flush/Flush period : TID_UINT32 /System/Flush/Last flush : TID_UINT32

    which control the flushing to disk. The default value for “Flush period” is 60 seconds or one minute.

    • All clients call db_flush_database() through their cm_yield() function
    • db_flush_database() checks the “Last flush” and only flushes the ODB when the period has expired. This test is done inside the ODB semaphore so that we don’t get a race condigiton
    • If the period has expired, db_flush_database() calls ss_shm_flush()
    • ss_shm_flush() tries to allocate a buffer of the shared memory. If the allocation is not successful (out of memory), ss_shm_flush() writes directly to the binary file as before.
    • If the allocation is successful, ss_shm_flush() copies the share memory to a buffer and passes this buffer to a dedicated thread which writes the buffer to the binary file. This causes ss_shm_flush() to return immediately and not block the calling program during the disk write operation.
    • Added back the “if (destroy_flag) ss_shm_flush()” so that the ODB is flushed for sure before the shared memory gets deleted.

    This means now that under normal circumstances, exiting programs like odbedit do NOT flush the ODB. This allows to call many “odbedit -c” in a row without the flush penalty. Nevertheless, the ODB then gets flushed by other clients latest 60 seconds (or whatever the flush period is) after odbedit exits.

    Please note that ODB flushing has two purposes:

    1. When all programs exit, we need a persistent storage for the ODB. In most experiments this only happens very seldom. Maybe at the end of a beam time period.
    2. If the computer crashes, a recent version of the ODB is kept on disk to simplify recovery after the crash.

    Since crashes are not so often (during production periods we have maybe one hardware failure every few years) the flushing of the ODB too often does not make sense and just consumes resources. Flushing does also not help from corrupted ODBs, since the binary image will also get corrupted. So the only reason for periodic flushes is to ease recovery after a total crash. I put the default to 60 seconds, but if people are really paranoid they can decrease it to 10 seconds or so. Or increase it to 600 seconds if their system does not crash every week and disks are slow.

    I made a dedicated branch feature/periodic_odb_flush so people can test the new functionality. If there are no complaints within the next few days, I will merge that into develop.

    Stefan

  4. Stefan Ritt

    After quite some testing of the periodic flushing of the ODB shared memory, I merged this feature branch into develop today.

  5. dd1 reporter

    there was one bug - a race condition between the flush thread and program (i.e. odbedit) exiting. when program exits without waiting/reaping/joining all threads, they are silently killed. best I can tell, normally, the flush thread will be killed while it is inside the main write(), so odb contents would get written to disk. but this is not for sure. correct way is to wait for the thread to finish before exiting the program. two small buglets - a thread leak (no join/reap) and missing write lock for updating “last flushed” timestamp in ODB. K.O.

  6. Stefan Ritt

    From your message it’s not clear if you really fixed the race condition. I see some code of yours there. Do you consider this now fixes? If so, can you close this isee?

    Thanks,
    Stefan

  7. Log in to comment