in the ALPHA-2 system at CERN we ran into a strange failure of MIDAS: we start all the midas programs and frontends, see them start, run for a few minutes then all of them die from ODB watchdog timeout and SYSMSG watchdog timeout. cannot start runs, cannot take data.
Only when we noticed that mdump does not start, gets stuck waiting for the semaphore of the SYSTEM event buffer, we realized that this semaphore was stuck.
I removed SYSTEM.SHM, this caused the semaphore handle to change (it is keyed to SYSTEM.SHM file inode number via sysv ipc abracadabra), a new (non-stuck) semaphore was allocated and midas started to work again.
Very confusing, definitely needs better diagnostics.
To remember, if some program locks the semaphore and dies, the semaphore remains locked forever until somebody else unlocks it (or the universe ends, via a computer reboot).
MIDAS uses a special sysv semaphore feature called SEM_UNDO to tell the linux kernel to unlock the semaphore automatically if programs dies. This works 99% of the time, but apparently there is a bug in the linux kernel that causes it to fail rarely (rarely enough for us to never implement a fix, but not rarely enough to cause trouble about once per a couple of years).
There is a timeout on the event buffer semaphores, but it is set for too long, about 5 minutes, to enable debugging of midas. Event buffer are never meant to be locked for longer than about 1 second, so perhaps there should be a quick timeout, say 5 seconds, then we complain "could not lock SYSTEM buffer for 5 seconds!", then the 5 minute timeout, then crash.