recursive db_lock_database() is broken

Issue #68 resolved
dd1 created an issue

On MacOS, the DRAGON experiment is seeing crashes of mserver with frequency of 1 crash every 2 days when idle and about 1 crash per day when taking data. We captured a stack trace. It shows a crash due to attempt to lock an already locked mutex. This is not permitted for non-recursive mutexes. This simplest test to see if this is a real bug or not is to call db_lock_database() then call cm_watchdog() which will call db_lock_database() again. If there is a crash, then the odb mutex is not working correctly. Second test should be to lock the database, then call cm_watchdog() from a different thread, this time it should crash because now we are in a different thread and pthread_mutex_lock() should wait for the lock in the normal way.

Here is the stack trace.

K.O.

(lldb) bt * thread #1: tid = 0x0000, 0x00007fff8dc74d46 libsystem_kernel.dylib__kill + 10, stop reason = signal SIGSTOP frame #0: 0x00007fff8dc74d46 libsystem_kernel.dylib__kill + 10 frame #1: 0x00007fff9604af83 libsystem_c.dylibabort + 177 frame #2: 0x000000010865d5ce mserverss_mutex_wait_for() + 206 at system.c:2796 frame #3: 0x000000010866175c mserverdb_lock_database() + 92 at odb.c:1752 frame #4: 0x00000001086457b1 mservercm_watchdog() + 769 at midas.c:6043 frame #5: 0x00007fff95ff390a libsystem_c.dylib_sigtramp + 26 frame #6: 0x00007fff9600b005 libsystem_c.dylibpthread_mutex_trylock + 484 frame #7: 0x000000010865d538 mserverss_mutex_wait_for(mutex=0x00007fa018403a90, timeout=1000) + 56 at system.c:2781 frame #8: 0x000000010866175c mserverdb_lock_database() + 92 at odb.c:1752 frame #9: 0x00000001086646f6 mserverdb_find_key(hDB=1, hKey=0, key_name=0x00007fff575c1090, subhKey=0x00007fff575c108c) + 230 at odb.c:2482 frame #10: 0x0000000108667b41 mserverdb_get_value(hDB=1, hKeyRoot=0, data=0x00007fff575c143c, buf_size=0x00007fff575c17fc, type=7) + 497 at odb.c:3646 frame #11: 0x000000010867fe9b mserveral_check + 539 at alarm.c:562 frame #12: 0x000000010865051b mservercm_yield(millisec=5000) + 91 at midas.c:5120 frame #13: 0x000000010863e8aa mservermain() + 1578 at mserver.c:341 frame #14: 0x00007fff9876c7e1 libdyld.dylibstart + 1 (lldb) down

Comments (6)

  1. dd1 reporter

    The stack trace was garbled by bitbucket, try again:

    <pre> (lldb) bt * thread #1: tid = 0x0000, 0x00007fff8dc74d46 libsystem_kernel.dylib__kill + 10, stop reason = signal SIGSTOP frame #0: 0x00007fff8dc74d46 libsystem_kernel.dylib__kill + 10 frame #1: 0x00007fff9604af83 libsystem_c.dylibabort + 177 frame #2: 0x000000010865d5ce mserverss_mutex_wait_for() + 206 at system.c:2796 frame #3: 0x000000010866175c mserverdb_lock_database() + 92 at odb.c:1752 frame #4: 0x00000001086457b1 mservercm_watchdog() + 769 at midas.c:6043 frame #5: 0x00007fff95ff390a libsystem_c.dylib_sigtramp + 26 frame #6: 0x00007fff9600b005 libsystem_c.dylibpthread_mutex_trylock + 484 frame #7: 0x000000010865d538 mserverss_mutex_wait_for(mutex=0x00007fa018403a90, timeout=1000) + 56 at system.c:2781 frame #8: 0x000000010866175c mserverdb_lock_database() + 92 at odb.c:1752 frame #9: 0x00000001086646f6 mserverdb_find_key(hDB=1, hKey=0, key_name=0x00007fff575c1090, subhKey=0x00007fff575c108c) + 230 at odb.c:2482 frame #10: 0x0000000108667b41 mserverdb_get_value(hDB=1, hKeyRoot=0, data=0x00007fff575c143c, buf_size=0x00007fff575c17fc, type=7) + 497 at odb.c:3646 frame #11: 0x000000010867fe9b mserveral_check + 539 at alarm.c:562 frame #12: 0x000000010865051b mservercm_yield(millisec=5000) + 91 at midas.c:5120 frame #13: 0x000000010863e8aa mservermain() + 1578 at mserver.c:341 frame #14: 0x00007fff9876c7e1 libdyld.dylibstart + 1 (lldb) </pre>

  2. dd1 reporter

    Try again.

    (lldb) bt
    * thread #1: tid = 0x0000, 0x00007fff8dc74d46 libsystem_kernel.dylib`__kill + 10, stop reason = signal SIGSTOP
        frame #0: 0x00007fff8dc74d46 libsystem_kernel.dylib`__kill + 10
        frame #1: 0x00007fff9604af83 libsystem_c.dylib`abort + 177
        frame #2: 0x000000010865d5ce mserver`ss_mutex_wait_for() + 206 at system.c:2796
        frame #3: 0x000000010866175c mserver`db_lock_database() + 92 at odb.c:1752
        frame #4: 0x00000001086457b1 mserver`cm_watchdog() + 769 at midas.c:6043
        frame #5: 0x00007fff95ff390a libsystem_c.dylib`_sigtramp + 26
        frame #6: 0x00007fff9600b005 libsystem_c.dylib`pthread_mutex_trylock + 484
        frame #7: 0x000000010865d538 mserver`ss_mutex_wait_for(mutex=0x00007fa018403a90, timeout=1000) + 56 at system.c:2781
        frame #8: 0x000000010866175c mserver`db_lock_database() + 92 at odb.c:1752
        frame #9: 0x00000001086646f6 mserver`db_find_key(hDB=1, hKey=0, key_name=0x00007fff575c1090, subhKey=0x00007fff575c108c) + 230 at odb.c:2482
        frame #10: 0x0000000108667b41 mserver`db_get_value(hDB=1, hKeyRoot=0, data=0x00007fff575c143c, buf_size=0x00007fff575c17fc, type=7) + 497 at odb.c:3646
        frame #11: 0x000000010867fe9b mserver`al_check + 539 at alarm.c:562
        frame #12: 0x000000010865051b mserver`cm_yield(millisec=5000) + 91 at midas.c:5120
        frame #13: 0x000000010863e8aa mserver`main() + 1578 at mserver.c:341
        frame #14: 0x00007fff9876c7e1 libdyld.dylib`start + 1
    (lldb) 
    
  3. dd1 reporter

    This is on a Mac. That why we see pthread_mutex_trylock() instead of pthread_mutex_lock(). MULTI_THREAD_ENABLE is unconditionally defined in msystem.h

    K.O.

  4. dd1 reporter

    Reproduced on both MacOS and Linux: if cm_watchdog() fires at the wrong moment and issues db_lock_database(), we crash. The wrong moment is when we are in db_lock_database() holding the "am" lock, i.e. here:

    #ifdef MULTI_THREAD_ENABLE
       /* obtain access mutex in multi-thread applications */
       if (ss_mutex_wait_for(_database[hDB - 1].am, 1000) != SS_SUCCESS) {
          cm_msg(MERROR, "db_lock_database", "internal error: cannot obtain access mutex, aborting...");
          abort();
       }
    #endif
       printf("HERE!\n");
       db_lock_database(hDB); // crash crash crash
       printf("OUT!\n");
    

    The bottom line is cm_watchdog() runs in the context of our thread and if it were to take odb locks, one must use "recursive" mutexes. K.O.

  5. Log in to comment