recursive db_lock_database() is broken
On MacOS, the DRAGON experiment is seeing crashes of mserver with frequency of 1 crash every 2 days when idle and about 1 crash per day when taking data. We captured a stack trace. It shows a crash due to attempt to lock an already locked mutex. This is not permitted for non-recursive mutexes. This simplest test to see if this is a real bug or not is to call db_lock_database() then call cm_watchdog() which will call db_lock_database() again. If there is a crash, then the odb mutex is not working correctly. Second test should be to lock the database, then call cm_watchdog() from a different thread, this time it should crash because now we are in a different thread and pthread_mutex_lock() should wait for the lock in the normal way.
Here is the stack trace.
K.O.
(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff8dc74d46 libsystem_kernel.dylib__kill + 10, stop reason = signal SIGSTOP
frame #0: 0x00007fff8dc74d46 libsystem_kernel.dylib
__kill + 10
frame #1: 0x00007fff9604af83 libsystem_c.dylibabort + 177
frame #2: 0x000000010865d5ce mserver
ss_mutex_wait_for() + 206 at system.c:2796
frame #3: 0x000000010866175c mserverdb_lock_database() + 92 at odb.c:1752
frame #4: 0x00000001086457b1 mserver
cm_watchdog() + 769 at midas.c:6043
frame #5: 0x00007fff95ff390a libsystem_c.dylib_sigtramp + 26
frame #6: 0x00007fff9600b005 libsystem_c.dylib
pthread_mutex_trylock + 484
frame #7: 0x000000010865d538 mserverss_mutex_wait_for(mutex=0x00007fa018403a90, timeout=1000) + 56 at system.c:2781
frame #8: 0x000000010866175c mserver
db_lock_database() + 92 at odb.c:1752
frame #9: 0x00000001086646f6 mserverdb_find_key(hDB=1, hKey=0, key_name=0x00007fff575c1090, subhKey=0x00007fff575c108c) + 230 at odb.c:2482
frame #10: 0x0000000108667b41 mserver
db_get_value(hDB=1, hKeyRoot=0, data=0x00007fff575c143c, buf_size=0x00007fff575c17fc, type=7) + 497 at odb.c:3646
frame #11: 0x000000010867fe9b mserveral_check + 539 at alarm.c:562
frame #12: 0x000000010865051b mserver
cm_yield(millisec=5000) + 91 at midas.c:5120
frame #13: 0x000000010863e8aa mservermain() + 1578 at mserver.c:341
frame #14: 0x00007fff9876c7e1 libdyld.dylib
start + 1
(lldb) down
Comments (6)
-
reporter -
reporter Try again.
(lldb) bt * thread #1: tid = 0x0000, 0x00007fff8dc74d46 libsystem_kernel.dylib`__kill + 10, stop reason = signal SIGSTOP frame #0: 0x00007fff8dc74d46 libsystem_kernel.dylib`__kill + 10 frame #1: 0x00007fff9604af83 libsystem_c.dylib`abort + 177 frame #2: 0x000000010865d5ce mserver`ss_mutex_wait_for() + 206 at system.c:2796 frame #3: 0x000000010866175c mserver`db_lock_database() + 92 at odb.c:1752 frame #4: 0x00000001086457b1 mserver`cm_watchdog() + 769 at midas.c:6043 frame #5: 0x00007fff95ff390a libsystem_c.dylib`_sigtramp + 26 frame #6: 0x00007fff9600b005 libsystem_c.dylib`pthread_mutex_trylock + 484 frame #7: 0x000000010865d538 mserver`ss_mutex_wait_for(mutex=0x00007fa018403a90, timeout=1000) + 56 at system.c:2781 frame #8: 0x000000010866175c mserver`db_lock_database() + 92 at odb.c:1752 frame #9: 0x00000001086646f6 mserver`db_find_key(hDB=1, hKey=0, key_name=0x00007fff575c1090, subhKey=0x00007fff575c108c) + 230 at odb.c:2482 frame #10: 0x0000000108667b41 mserver`db_get_value(hDB=1, hKeyRoot=0, data=0x00007fff575c143c, buf_size=0x00007fff575c17fc, type=7) + 497 at odb.c:3646 frame #11: 0x000000010867fe9b mserver`al_check + 539 at alarm.c:562 frame #12: 0x000000010865051b mserver`cm_yield(millisec=5000) + 91 at midas.c:5120 frame #13: 0x000000010863e8aa mserver`main() + 1578 at mserver.c:341 frame #14: 0x00007fff9876c7e1 libdyld.dylib`start + 1 (lldb)
-
reporter This is on a Mac. That why we see pthread_mutex_trylock() instead of pthread_mutex_lock(). MULTI_THREAD_ENABLE is unconditionally defined in msystem.h
K.O.
-
reporter Reproduced on both MacOS and Linux: if cm_watchdog() fires at the wrong moment and issues db_lock_database(), we crash. The wrong moment is when we are in db_lock_database() holding the "am" lock, i.e. here:
#ifdef MULTI_THREAD_ENABLE /* obtain access mutex in multi-thread applications */ if (ss_mutex_wait_for(_database[hDB - 1].am, 1000) != SS_SUCCESS) { cm_msg(MERROR, "db_lock_database", "internal error: cannot obtain access mutex, aborting..."); abort(); } #endif printf("HERE!\n"); db_lock_database(hDB); // crash crash crash printf("OUT!\n");
The bottom line is cm_watchdog() runs in the context of our thread and if it were to take odb locks, one must use "recursive" mutexes. K.O.
-
reporter fixed on branch feature/midas-2017-10 as of commit 60090d3. K.O.
-
reporter - changed status to resolved
fixed on branch feature/midas-2017-10 as of commit 60090d3. K.O.
- Log in to comment
The stack trace was garbled by bitbucket, try again:
<pre> (lldb) bt * thread
#1: tid = 0x0000, 0x00007fff8dc74d46 libsystem_kernel.dylib__kill + 10, stop reason = signal SIGSTOP frame #0: 0x00007fff8dc74d46 libsystem_kernel.dylib
__kill + 10 frame#1: 0x00007fff9604af83 libsystem_c.dylibabort + 177 frame #2: 0x000000010865d5ce mserver
ss_mutex_wait_for() + 206 at system.c:2796 frame#3: 0x000000010866175c mserverdb_lock_database() + 92 at odb.c:1752 frame #4: 0x00000001086457b1 mserver
cm_watchdog() + 769 at midas.c:6043 frame#5: 0x00007fff95ff390a libsystem_c.dylib_sigtramp + 26 frame #6: 0x00007fff9600b005 libsystem_c.dylib
pthread_mutex_trylock + 484 frame#7: 0x000000010865d538 mserverss_mutex_wait_for(mutex=0x00007fa018403a90, timeout=1000) + 56 at system.c:2781 frame #8: 0x000000010866175c mserver
db_lock_database() + 92 at odb.c:1752 frame#9: 0x00000001086646f6 mserverdb_find_key(hDB=1, hKey=0, key_name=0x00007fff575c1090, subhKey=0x00007fff575c108c) + 230 at odb.c:2482 frame #10: 0x0000000108667b41 mserver
db_get_value(hDB=1, hKeyRoot=0, data=0x00007fff575c143c, buf_size=0x00007fff575c17fc, type=7) + 497 at odb.c:3646 frame#11: 0x000000010867fe9b mserveral_check + 539 at alarm.c:562 frame #12: 0x000000010865051b mserver
cm_yield(millisec=5000) + 91 at midas.c:5120 frame#13: 0x000000010863e8aa mservermain() + 1578 at mserver.c:341 frame #14: 0x00007fff9876c7e1 libdyld.dylib
start + 1 (lldb) </pre>