crash in cm_transition

Issue #321 resolved
Former user created an issue

we see trouble with run transitions in alpha-g, especially if runs fail to start because of mlogger timeouts (history takes a long time to initialize data from HDD ZFS). I captured one core dump of mhttpd. all that code needs to be converted from custom memory management to std::vector & co and protected by std::mutex. K.O.

Core was generated by `mhttpd -D'. Program terminated with signal 11, Segmentation fault. #0 0x00000000004a4994 in cm_transition_call (param=0x7f3b8807da50) at /home/agdaq/packages/midas/src/midas.cxx:4229 4229 if (tr_client->pred[i]->status == 0) { Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.26-23.el7.x86_64 glibc-2.17-317.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-50.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libcurl-7.29.0-59.el7_9.1.x86_64 libgcc-4.8.5-44.el7.x86_64 libidn-1.28-4.el7.x86_64 libselinux-2.5-15.el7.x86_64 libssh2-1.8.0-4.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 nspr-4.25.0-2.el7_9.x86_64 nss-3.53.1-3.el7_9.x86_64 nss-softokn-freebl-3.53.1-6.el7_9.x86_64 nss-util-3.53.1-1.el7_9.x86_64 openldap-2.4.44-22.el7.x86_64 openssl-libs-1.0.2k-21.el7_9.x86_64 pcre-8.32-17.el7.x86_64 sqlite-3.7.17-8.el7_7.1.x86_64 unixODBC-2.3.1-14.el7.x86_64 zlib-1.2.7-18.el7.x86_64 (gdb) bt #0 0x00000000004a4994 in cm_transition_call (param=0x7f3b8807da50) at /home/agdaq/packages/midas/src/midas.cxx:4229 #1 0x00007f3c47edaea5 in start_thread () from /lib64/libpthread.so.0 #2 0x00007f3c468b796d in clone () from /lib64/libc.so.6 (gdb) l 4224 if (tr_client->async_flag & TR_MTHREAD && tr_client->pred) { 4225 while (1) { 4226 int wait_for = -1; 4227
4228 for (i = 0; i < tr_client->n_pred; i++) { 4229 if (tr_client->pred[i]->status == 0) { 4230 wait_for = i; 4231 break; 4232 } 4233
(gdb) p i $1 = 0 (gdb) p tr_client $2 = (TR_CLIENT ) 0x7f3b8807da50 (gdb) p tr_client $3 = {transition = 1, run_number = 5707, async_flag = 8, debug_flag = 0, sequence_number = 910, pred = 0x7f3b8800a430, n_pred = 26, host_name = "alphagdaq.cern.ch", '\000' <repeats 238 times>, client_name = "fectrl", '\000' <repeats 25 times>, port = 46741, key_name = "7144", '\000' <repeats 27 times>, status = 0, errorstr = '\000' <repeats 1023 times>, init_time = 4170940539, waiting_for_client = "feevb", '\000' <repeats 26 times>, connect_timeout = 0, connect_start_time = 0, connect_end_time = 0, rpc_timeout = 0, rpc_start_time = 0, rpc_end_time = 0, end_time = 0} (gdb) p *tr_client->pred $4 = (PTR_CLIENT) 0x0

Comments (2)

  1. dd1

    this crash is fixed, but there still could be trouble if thread do not finish at the end of transition and try to access stale TrClient pointers. K.O.

  2. Log in to comment