stuck semaphore of SYSTEM buffer

Issue #331 resolved
Former user created an issue

in the ALPHA-2 system at CERN we ran into a strange failure of MIDAS: we start all the midas programs and frontends, see them start, run for a few minutes then all of them die from ODB watchdog timeout and SYSMSG watchdog timeout. cannot start runs, cannot take data.

Only when we noticed that mdump does not start, gets stuck waiting for the semaphore of the SYSTEM event buffer, we realized that this semaphore was stuck.

I removed SYSTEM.SHM, this caused the semaphore handle to change (it is keyed to SYSTEM.SHM file inode number via sysv ipc abracadabra), a new (non-stuck) semaphore was allocated and midas started to work again.

Very confusing, definitely needs better diagnostics.

To remember, if some program locks the semaphore and dies, the semaphore remains locked forever until somebody else unlocks it (or the universe ends, via a computer reboot).

MIDAS uses a special sysv semaphore feature called SEM_UNDO to tell the linux kernel to unlock the semaphore automatically if programs dies. This works 99% of the time, but apparently there is a bug in the linux kernel that causes it to fail rarely (rarely enough for us to never implement a fix, but not rarely enough to cause trouble about once per a couple of years).

There is a timeout on the event buffer semaphores, but it is set for too long, about 5 minutes, to enable debugging of midas. Event buffer are never meant to be locked for longer than about 1 second, so perhaps there should be a quick timeout, say 5 seconds, then we complain "could not lock SYSTEM buffer for 5 seconds!", then the 5 minute timeout, then crash.

K.O.

Comments (20)

  1. dd1

    just hit same problem in alpha-g daq. same solution worked, delete .SYSTEM.SHM. Why is this happening so often now? K.O.

  2. dd1

    there is definitely a problem with SYSV semaphore “SEM_UNDO” in Linux. we see it rarely, maybe once every 5 years, but we do see it, linux somehow fails to apply SEM_UNDO on program crash/exit under unknown but rare conditions. (as result ODB/SYSTEM/etc is stuck locked, requiring manual recovery). K.O.

  3. dd1

    I propose this scheme to make recovery (semi-)automatic:

    • try to lock for 5-10 seconds
    • if timeout, peek inside the shared memory (without holding a lock)
    • look at every registered client, get their pid
    • probe all these pids, if they are all dead (not running anymore), it means there is no valid client holding the semaphore
    • unlock the semaphore
    • (maybe call bm_cleanup()/odb_cleanup() to remove dead/invalid clients)
    • recovery is complete

    K.O.

  4. dd1

    normally, nobody should lock ODB for longer than 1 second or so. so timeout could be short.

    except when we are debugging MIDAS and set a breakpoint inside “locked” code. In this case, we do not want everybody to die out while we are debugging, so an infinite timeout is good for this case.

    right now we are half-way between these two cases, timeout is set to around 5 minutes, long enough to allow debugging, but short enough to kill midas if ODB/SYSTEM semaphore gets stuck for any reason. (this is an improvement over previous situation: midas just stops and there is no error or alarm or any other indication that something is wrong until experiment shift person wake up and notice that event counters are not incrementing. somehow this always happened at night on saturday or sunday).

    K.O.

  5. dd1

    if we want to change this and always use the short 1-5-10 second timeout, we could have an ODB flag “disable timeouts” (this would want to disable ODB/SYSTEM and RPC timeouts). K.O.

  6. dd1

    only remaining case is “slow core dump” if a program crashes while holding an ODB/SYSTEM semaphore, core dump can take an arbitrary long time (content of SYSTEM buffer is dumped to the core file, so core file can be pretty big). because the semaphore is locked, all the other midas programs will die, see bug 324.

    one solution to this is limit the core dump size to (say) 1 Gbyte. core dump to NFS file on 1gige network should take 10 seconds, well within a reasonable semaphore timeout (use 10-20 seconds timeout).

    K.O.

  7. dd1

    looked at all the core dumps in alpha-g. do not see the core dump with crash inside “locked” code. all stack traces are timeout aborts from bm_lock_buffer() and bm_lock_buffer_mutex(). (by mistake, mutex timeout is shorter than the semaphore timeout). K.O.

  8. dd1

    I now suspect a locking bug was introduced. I will have to look at midas.cxx and confirm every bm_lock_buffer() has a corresponding unlock through all code paths. (there is no std::semaphore_guard, unfortunately!) K.O.

  9. dd1

    first bm_lock_buffer() is called by bm_open_buffer() before shared memory is mapped, we cannot look at the list of registered clients. bummer! K.O.

  10. Stefan Ritt

    There is not std::semaphore_guard, but you can make one: Create a “wrapper” object around the semaphore. The constructor locks a semaphore passed as a parameter, and the destructor of the object frees the semaphore. This way you ensure the semaphore gets released in all execution paths. I would loved to have this functionality 30 years ago when I wrote the semaphore code, but now it’s there and should be used.

  11. dd1

    for us it would be odb_unlock_guard and bm_unlock_guard, returned by db_lock() and bm_lock() respectively. Hmm… I will take a look at this. K.O.

  12. dd1

    found first locking bug. bm_open_buffer(), an obvious typo, double unlock if MAX_CLIENTS is reached. (this is not the crasher in alpha-g). K.O.

  13. Stefan Ritt

    Do we really complicated code for that? Have a look at this simple thing:

    class mlock {
    private:
       HNDLE msem;
    public:
       mlock(HNDLE sem, INT timeout) {
          msem = sem;
          ss_semaphore_wait_for(sem, timeout);
       }
    
       ~mlock() {
          ss_semaphore_release(msem);
       }
    };
    
    int main()
    {
    
       HNDLE sem;
       ss_semaphore_create("test", &sem);
       mlock testlock(sem, 1000);
    
       return 0;
    }
    

    When the program leaves the main() routine, the destructor gets called and releases the semaphore.

  14. dd1

    in addition to the semaphore (lock between different programs), there is a mutex (lock between different threads) and a read cache mutex and a write cache mutex. they all have to be locked in the correct order (see comments in midas.h). As extra complication, bm_wait_for_xxx_locked() have to drop and reacquire all the locks (in the correct order) while they are doing the waiting. I just finished writing bm_lock_buffer_guard() and locking is much simplified now. next is to test it in alpha-g. I did not see any locking bugs converting the code to a lock_guard, I hope the crasher bug is fixed by accident. In the worst case, something more subtelt is going on… K.O.

  15. Log in to comment