recursive db_lock_database() is broken

Issue #68 resolved

dd1 created an issue 2017-10-05

On MacOS, the DRAGON experiment is seeing crashes of mserver with frequency of 1 crash every 2 days when idle and about 1 crash per day when taking data. We captured a stack trace. It shows a crash due to attempt to lock an already locked mutex. This is not permitted for non-recursive mutexes. This simplest test to see if this is a real bug or not is to call db_lock_database() then call cm_watchdog() which will call db_lock_database() again. If there is a crash, then the odb mutex is not working correctly. Second test should be to lock the database, then call cm_watchdog() from a different thread, this time it should crash because now we are in a different thread and pthread_mutex_lock() should wait for the lock in the normal way.

Here is the stack trace.

K.O.

(lldb) bt * thread #1: tid = 0x0000, 0x00007fff8dc74d46 libsystem_kernel.dylib__kill + 10, stop reason = signal SIGSTOP frame #0: 0x00007fff8dc74d46 libsystem_kernel.dylib__kill + 10 frame #1: 0x00007fff9604af83 libsystem_c.dylibabort + 177 frame #2: 0x000000010865d5ce mserverss_mutex_wait_for() + 206 at system.c:2796 frame #3: 0x000000010866175c mserverdb_lock_database() + 92 at odb.c:1752 frame #4: 0x00000001086457b1 mservercm_watchdog() + 769 at midas.c:6043 frame #5: 0x00007fff95ff390a libsystem_c.dylib_sigtramp + 26 frame #6: 0x00007fff9600b005 libsystem_c.dylibpthread_mutex_trylock + 484 frame #7: 0x000000010865d538 mserverss_mutex_wait_for(mutex=0x00007fa018403a90, timeout=1000) + 56 at system.c:2781 frame #8: 0x000000010866175c mserverdb_lock_database() + 92 at odb.c:1752 frame #9: 0x00000001086646f6 mserverdb_find_key(hDB=1, hKey=0, key_name=0x00007fff575c1090, subhKey=0x00007fff575c108c) + 230 at odb.c:2482 frame #10: 0x0000000108667b41 mserverdb_get_value(hDB=1, hKeyRoot=0, data=0x00007fff575c143c, buf_size=0x00007fff575c17fc, type=7) + 497 at odb.c:3646 frame ~~#11~~: 0x000000010867fe9b mserveral_check + 539 at alarm.c:562 frame #12: 0x000000010865051b mservercm_yield(millisec=5000) + 91 at midas.c:5120 frame ~~#13~~: 0x000000010863e8aa mservermain() + 1578 at mserver.c:341 frame #14: 0x00007fff9876c7e1 libdyld.dylibstart + 1 (lldb) down

Comments (6)

dd1 reporter
The stack trace was garbled by bitbucket, try again:

<pre> (lldb) bt * thread #1: tid = 0x0000, 0x00007fff8dc74d46 libsystem_kernel.dylib__kill + 10, stop reason = signal SIGSTOP frame #0: 0x00007fff8dc74d46 libsystem_kernel.dylib__kill + 10 frame #1: 0x00007fff9604af83 libsystem_c.dylibabort + 177 frame #2: 0x000000010865d5ce mserverss_mutex_wait_for() + 206 at system.c:2796 frame #3: 0x000000010866175c mserverdb_lock_database() + 92 at odb.c:1752 frame #4: 0x00000001086457b1 mservercm_watchdog() + 769 at midas.c:6043 frame #5: 0x00007fff95ff390a libsystem_c.dylib_sigtramp + 26 frame #6: 0x00007fff9600b005 libsystem_c.dylibpthread_mutex_trylock + 484 frame #7: 0x000000010865d538 mserverss_mutex_wait_for(mutex=0x00007fa018403a90, timeout=1000) + 56 at system.c:2781 frame #8: 0x000000010866175c mserverdb_lock_database() + 92 at odb.c:1752 frame #9: 0x00000001086646f6 mserverdb_find_key(hDB=1, hKey=0, key_name=0x00007fff575c1090, subhKey=0x00007fff575c108c) + 230 at odb.c:2482 frame #10: 0x0000000108667b41 mserverdb_get_value(hDB=1, hKeyRoot=0, data=0x00007fff575c143c, buf_size=0x00007fff575c17fc, type=7) + 497 at odb.c:3646 frame ~~#11~~: 0x000000010867fe9b mserveral_check + 539 at alarm.c:562 frame #12: 0x000000010865051b mservercm_yield(millisec=5000) + 91 at midas.c:5120 frame ~~#13~~: 0x000000010863e8aa mservermain() + 1578 at mserver.c:341 frame #14: 0x00007fff9876c7e1 libdyld.dylibstart + 1 (lldb) </pre>
- 2017-10-05T00:50:04+00:00

dd1 reporter

Try again.

(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff8dc74d46 libsystem_kernel.dylib`__kill + 10, stop reason = signal SIGSTOP
    frame #0: 0x00007fff8dc74d46 libsystem_kernel.dylib`__kill + 10
    frame #1: 0x00007fff9604af83 libsystem_c.dylib`abort + 177
    frame #2: 0x000000010865d5ce mserver`ss_mutex_wait_for() + 206 at system.c:2796
    frame #3: 0x000000010866175c mserver`db_lock_database() + 92 at odb.c:1752
    frame #4: 0x00000001086457b1 mserver`cm_watchdog() + 769 at midas.c:6043
    frame #5: 0x00007fff95ff390a libsystem_c.dylib`_sigtramp + 26
    frame #6: 0x00007fff9600b005 libsystem_c.dylib`pthread_mutex_trylock + 484
    frame #7: 0x000000010865d538 mserver`ss_mutex_wait_for(mutex=0x00007fa018403a90, timeout=1000) + 56 at system.c:2781
    frame #8: 0x000000010866175c mserver`db_lock_database() + 92 at odb.c:1752
    frame #9: 0x00000001086646f6 mserver`db_find_key(hDB=1, hKey=0, key_name=0x00007fff575c1090, subhKey=0x00007fff575c108c) + 230 at odb.c:2482
    frame #10: 0x0000000108667b41 mserver`db_get_value(hDB=1, hKeyRoot=0, data=0x00007fff575c143c, buf_size=0x00007fff575c17fc, type=7) + 497 at odb.c:3646
    frame #11: 0x000000010867fe9b mserver`al_check + 539 at alarm.c:562
    frame #12: 0x000000010865051b mserver`cm_yield(millisec=5000) + 91 at midas.c:5120
    frame #13: 0x000000010863e8aa mserver`main() + 1578 at mserver.c:341
    frame #14: 0x00007fff9876c7e1 libdyld.dylib`start + 1
(lldb)

2017-10-05T00:50:46+00:00

dd1 reporter
This is on a Mac. That why we see pthread_mutex_trylock() instead of pthread_mutex_lock(). MULTI_THREAD_ENABLE is unconditionally defined in msystem.h

K.O.
- 2017-10-05T00:51:55+00:00

dd1 reporter

Reproduced on both MacOS and Linux: if cm_watchdog() fires at the wrong moment and issues db_lock_database(), we crash. The wrong moment is when we are in db_lock_database() holding the "am" lock, i.e. here:

#ifdef MULTI_THREAD_ENABLE
   /* obtain access mutex in multi-thread applications */
   if (ss_mutex_wait_for(_database[hDB - 1].am, 1000) != SS_SUCCESS) {
      cm_msg(MERROR, "db_lock_database", "internal error: cannot obtain access mutex, aborting...");
      abort();
   }
#endif
   printf("HERE!\n");
   db_lock_database(hDB); // crash crash crash
   printf("OUT!\n");

The bottom line is cm_watchdog() runs in the context of our thread and if it were to take odb locks, one must use "recursive" mutexes. K.O.

2017-10-13T16:15:58+00:00

dd1 reporter
fixed on branch feature/midas-2017-10 as of commit 60090d3. K.O.
- 2017-10-13T17:50:47+00:00
dd1 reporter
- changed status to resolved
fixed on branch feature/midas-2017-10 as of commit 60090d3. K.O.
- 2017-10-13T17:51:05+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: resolved

Votes: 0

Watchers: 1