make test crashes

Issue #19 closed
Pete Zaitcev created an issue

.............. ok 256 - test_verify_stripe_metadata_magic_mismatch: liberasurecode_rs_vand (idx=3) ok 257 - test_verify_stripe_metadata_be_id_mismatch: liberasurecode_rs_vand (idx=3) ok 258 - test_verify_stripe_metadata_be_ver_mismatch: liberasurecode_rs_vand (idx=3) Makefile:908: recipe for target 'test' failed make: *** [test] Segmentation fault (core dumped) [zaitcev@guren liberasurecode-1.0.9]$

[zaitcev@guren liberasurecode-1.0.9]$ dmesg | tail [106048.473301] show_signal_msg: 15 callbacks suppressed [106048.473307] alg_sig_test[10484]: segfault at 0 ip (null) sp 00007ffe87765618 error 14 in alg_sig_test[400000+2000]

Ulimit -c is 1000000, but the core file is not created. The abrt is not running. V.mysterious.

Comments (19)

  1. Tushar Gohad repo owner

    Hi Pete, how do we reproduce this? What platform, Fedora rev etc? Thanks for the report.

  2. Pete Zaitcev reporter

    All I know is that I run "make test".

    The core was not collected due to systemd. I reconfigured the kernel and got this:

    [New LWP 28606]
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib64/libthread_db.so.1".
    Core was generated by `./test/alg_sig_test'.
    Program terminated with signal SIGSEGV, Segmentation fault.
    #0  0x0000000000000000 in ?? ()
    Missing separate debuginfos, use: dnf debuginfo-install gf-complete-1.03-3.20150316git.fc21.x86_64 glibc-2.21-7.fc22.x86_64 jerasure-devel-2.0-1.fc20.x86_64
    (gdb) where
    #0  0x0000000000000000 in ?? ()
    #1  0x000000000040153b in destroy_alg_sig (alg_sig_handle=0xce0c10)
        at utils/chksum/alg_sig.c:241
    #2  0x0000000000400a17 in basic_xor_test_16_32 ()
        at utils/chksum/test_alg_sig.c:230
    #3  main (argc=<optimized out>, argv=<optimized out>)
        at utils/chksum/test_alg_sig.c:243
    (gdb) 
    
  3. Tushar Gohad repo owner

    @kmgreen2, this looks like a double-free issue and @zaitcev, I am still not able to reproduce this on Fedora 22. Below is the gcc version I am using. Would you be able to tell us more about your test environment? If it is a KVM instance, we can try to mimic it on this side.

    # gcc --version
    gcc (GCC) 5.1.1 20150618 (Red Hat 5.1.1-4)
    Copyright (C) 2015 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
    
  4. Kevin Greenan

    @tsg- I cannot repro on my RHEL 6 box or Mac... Let me see if I can find the issue by inspecting the code ;)

  5. Kevin Greenan

    @zaitcev What version of Jerasure do you have installed? This might be a Jerasure/gf-complete bug. Could you build Jerasure/gf-complete with debug symbols and re-run test/alg_sig_test? That will fill in the rest of that backtrace.

    To do that just add '-g' to AM_CFLAGS in src/Makefile.am in both libraries.

  6. Tushar Gohad repo owner

    @kmgreen2 good call - the original report showing the core dump during "liberasurecode_rs_vand" test run confused me and went looking in local alg_sig implementation.

  7. Kevin Greenan

    @zaitcev Just confirmed that the version of Jerasure you are using is pretty old... Between your version and the versions as of about a year ago, we added 'galois_uninit_field' to jerasure.

    From the backtrace of running 'gdb ./alg_sig_test':

    (gdb) bt
    #0  0x0000000000000000 in ?? ()
    #1  0x00000000004012bf in destroy_alg_sig (alg_sig_handle=0x603b90) at utils/chksum/alg_sig.c:241
    #2  0x0000000000400ba6 in basic_xor_test_16_32 (argc=<value optimized out>, argv=<value optimized out>) at utils/chksum/test_alg_sig.c:230
    #3  main (argc=<value optimized out>, argv=<value optimized out>) at utils/chksum/test_alg_sig.c:243
    #1  0x00000000004012bf in destroy_alg_sig (alg_sig_handle=0x603b90) at utils/chksum/alg_sig.c:241
    241   alg_sig_handle->mult_routines.galois_uninit_field(alg_sig_handle->gf_w);
    (gdb) p alg_sig_handle
    $1 = (alg_sig_t *) 0x603b90
    (gdb) p *alg_sig_handle
    $2 = {gf_w = 16, sig_len = 32, mult_routines = {galois_single_multiply = 0x7ffff71960d0 <galois_single_multiply>, galois_uninit_field = 0}, jerasure_sohandle = 0x603070, 
      tbl1_l = 0x603c40, tbl1_r = 0x604050, tbl2_l = 0x0, tbl2_r = 0x0, tbl3_l = 0x0, tbl3_r = 0x0}
    

    Notice that the function pointer galois_uninit_field = 0

    This is because that specific function does not exist in the version of jerasure you are using. That version technically does not exist anymore, as the repo has been taken down (old bitbucket repo). If you use the version at jerasure.org, this problem will go away.

  8. Kevin Greenan

    @tsg- It seems that there is a bug in liberasurecode as well. We do not check the existence of uninit when loading the function from jerasure:

    int load_gf_functions(void *sohandle, struct jerasure_mult_routines *routines)
     {
          routines->galois_single_multiply = get_galois_multi_func(sohandle);
          routines->galois_uninit_field = get_galois_uninit_func(sohandle);
          if (NULL == routines->galois_single_multiply) {
            return -1;
          }
          return 0;
      }
    
  9. Kevin Greenan

    Fixing Issue #19 by stubbing out Jerasures uninit function.

    If the underlying jerasure implementation is old (pre-jerasure.org), then it will not contain an uninit function for the underlying GF object. Since this is only used in alg_sig, which is not used by anything else at the moment, we stub it out if it does not exist.

    Once we make the change to have alg_sig use the internal GF functions, this whole problem goes away.

    → <<cset b80d5b947b37>>

  10. Pete Zaitcev reporter

    I don't have any underlying jerasure/gf-complete at all at this system, neither old nor an up-to-date one. I presumed that build-time checks did not need any.

  11. Tushar Gohad repo owner

    @zaitcev interesting .. can you please check with liberasurecode master if you can reproduce the issue?

  12. Log in to comment