history plot 100% cpu usage

Issue #236 resolved
dd1 created an issue

(this is a tracker issue for the problem I see in the agmini system with “midas-2020-03-a-98”. I need to confirm this problem still exists in midas-2020-07).

go to history page for trigger rates https://daq16.triumf.ca/?cmd=History&group=Trigger&panel=SC32A and press the “-” button many times without waiting, the page will go unresponsive, task manager shows 100% cpu usage. If you wait for the data to update between pressing “-” buttons, the page remains responsive, but at the time scale roughly 10-14 days, CPU usage is 14-25%. (this plot has 32 variables)

similar situation with https://daq16.triumf.ca/?cmd=History&group=Trigger&panel=Clk625 (plot with 1 variable).

K.O.

Comments (17)

  1. Stefan Ritt

    I tried the same, and actually made mhttpd crash. Here is the stack dump:

    Program received signal SIGABRT, Aborted.
    [Switching to Thread 0x7fffeb7fe700 (LWP 24377)]
    0x00007ffff56ef377 in raise () from /lib64/libc.so.6
    Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.26-23.el7.x86_64 glibc-2.17-292.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_7.2.x86_64 libcom_err-1.42.9-16.el7.x86_64 libcurl-7.29.0-54.el7_7.1.x86_64 libgcc-4.8.5-39.el7.x86_64 libidn-1.28-4.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libssh2-1.8.0-3.el7.x86_64 libstdc++-4.8.5-39.el7.x86_64 mariadb-libs-5.5.64-1.el7.x86_64 nspr-4.21.0-1.el7.x86_64 nss-3.44.0-7.el7_7.x86_64 nss-softokn-freebl-3.44.0-8.el7_7.x86_64 nss-util-3.44.0-4.el7_7.x86_64 openldap-2.4.44-21.el7_6.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 sqlite-3.7.17-8.el7.x86_64 zlib-1.2.7-18.el7.x86_64
    (gdb) where
    #0  0x00007ffff56ef377 in raise () from /lib64/libc.so.6
    #1  0x00007ffff56f0a68 in abort () from /lib64/libc.so.6
    #2  0x00000000004fbd45 in HsFileSchema::read_data (this=0x7fffac57b970, start_time=1578276575, 
        end_time=1586570975, num_var=5, var_schema_index=0x7fffd8001940, var_index=<optimized out>, 
        debug=0, buffer=0x7fffd8003220) at /home/meg/midas/src/history_schema.cxx:2206
    #3  0x0000000000504205 in SchemaHistoryBase::hs_read_buffer (this=0x7fffac0019f0, 
        start_time=1578276575, end_time=1586570975, num_var=5, event_name=<optimized out>, 
        var_name=<optimized out>, var_index=0x7fffd8001510, buffer=0x7fffd8003220, 
        hs_status=0x7fffd8002d80) at /home/meg/midas/src/history_schema.cxx:2983
    #4  0x00000000004e4c44 in js_hs_read_arraybuffer (params=<optimized out>)
        at /home/meg/midas/src/mjsonrpc.cxx:2359
    #5  0x00000000004e2b93 in mjsonrpc_handle_request (request=<optimized out>)
        at /home/meg/midas/src/mjsonrpc.cxx:4221
    #6  0x00000000004e30b5 in mjsonrpc_decode_post_data (post_data=<optimized out>)
        at /home/meg/midas/src/mjsonrpc.cxx:4314
    #7  0x000000000045caf6 in thread_mjsonrpc (w=0xc15cd0, nc=0xc16330, this=<optimized out>)
        at /home/meg/midas/progs/mhttpd.cxx:18096
    #8  thread_work_function (w=0xc15cd0, nc=0xc16330) at /home/meg/midas/progs/mhttpd.cxx:18172
    #9  mongoose_thread (to=0xc164c0) at /home/meg/midas/progs/mhttpd.cxx:18929
    #10 0x00007ffff6054070 in ?? () from /lib64/libstdc++.so.6
    #11 0x00007ffff7746ea5 in start_thread () from /lib64/libpthread.so.0
    #12 0x00007ffff57b78cd in clone () from /lib64/libc.so.6
    (gdb) 
    

    So looks like there is a problem in js_hs_read_arraybuffer()

  2. Stefan Ritt

    I traced this problem more on my local Mac, and found a severe memory leak. In history.cxx:1574 we jump to label “nextday”, but then the file handles “fh”, “fhd” and “fhi” are not closed. If we do this too often, mhttpd has hundreds of files open and crashes running out of file descriptors. I fixed this and committed to develop

    This of course only happens only with the “MIDAS” history, which I tried for this test, while the crash on the previous post happened on our megon linux system using the “FILE” history.

  3. Stefan Ritt

    I put code on the history page which prevents pressing the ‘-' button if a previous update has not yet finished. Please check if that prevents the page from freezing up. But I still would like to understand the crash above.

  4. dd1 reporter

    The protection of “-” button works! But I cannot fully test it because of the DWORD crash, bug 239. K.O.

  5. dd1 reporter

    Ok, mhttpd bottleneck fixed, now I get back to the initial problem. If I zoom out far enough, the history display page goes unresponsive, shows 100% cpu image. On the PWB p2v page, this happens roughly on the 1 year time scale (Oct 2019 to today July 2020). After grinding away for some time the data eventually loaded. https://daq16.triumf.ca/?cmd=History&group=PWB&panel=v_p2

    The old history plots did not have any trouble looking at history at 1 year, 2 year, etc time scales, while the new history plots just slows down until it becomes unresponsive. I think I looked at this in the past, and found the bottleneck in copying the newly received data into the front of the internal data arrays.

    One solution could be to add a “cancel zoom out” button - I zoom out too far, there is too much data, the browser gets bogged down, I press the “cancel” button, and all is good again.

    K.O.

  6. dd1 reporter

    But the bottleneck (the place with 100% cpu usage) is after the RPC request finished - after that, for a very long time, the history plot code moves the data around - shifts the array forward, inserts new data at the front, etc. This is very easy to see with the javascript profiler.

    And that should be fixable. For example, there could be a timer - if copying the data takes longer than 10 seconds, give up.

    I still think a better solution is to load the data in smaller chunks, then the web page does not go unresponsive for as long time, and the user can see “progress” as more and more data gets loaded and becomes visible.

    But I am not sure even that is quite scalable. When I zoom out to a time scale of 1-2 years, things like panning becomes slow (and for me panning always happens by accident when the mouse points at the plot and the mouse button is unintentionally become pressed).

    We should reduce the amount of data handled by the browser by using binned data (which does have the minimum and maximum values for each bin, so spikes do not disappear).

    K.O.

  7. dd1 reporter

    Can zoom out to Apr 2019: https://daq16.triumf.ca/?cmd=History&group=PWB&panel=v_p2&A=1551991031&B=159622781

    (at home) Load time takes forever, network transfers between 2 and 20 Mbytes/sec. CPU load is high, but no longer pegged to 100%. Memory use is high and I see QC activity - grows to 1.2 GB, drops to 800 Mbytes, etc. Next zoom-out crashes the browser tab.

    (update) nope, the linked page does not fully load. it loads all data fully visible on the plot, “waiting for data” is off, spinning wheel is off, memory use is 1.6 GB, then a browser tab crash. I guess the page continues to load more data at that point.

    K.O.

  8. Stefan Ritt

    Cannot confirm. Easy to go April 2019, CPU never goes above 50-60%. Memory 1.9 GB:

    Is that the well-known “buy a bigger laptop” issue?

  9. dd1 reporter

    Same thing in the det fac.

    Can zoom out to December 2019, past that, browser tab crashes. (actually there is an “out of memory” exception).

    Data load speed 20-30 Mbytes/sec, memory use 4 GBytes (per google chrome “task manager”). Actual memory heap snapshot difficult to attach here, it does not cut-and-paste cleanly.

    K.O.

  10. dd1 reporter

    Bigger laptop cannot buy, 64GB RAM is about is big as they come.

    This is what I have here: main midas machine daq16 - 32 GB RAM, quad core 3.6 GHz CPU, browser machine daq01 - 64 GB RAM, quad core 3.6 GHz CPU, 1gige network between them. K.O.

    This is what I have at home: macbook air 2020, 16 GB RAM, fastest residential internet money can buy in my part of Vancouver (about 10 Mbytes/sec max).

    K.O.

  11. dd1 reporter

    memory use reported by the google chrome task manager (NOT by “top” or by macos performance monitor):

    • MHistoryGraph: 3GB = 1.3 GB “data” array, 4x 0.4 GB “x/y/t/v” arrays
    • unaccounted for under “arrays”: 30x 80 Mbytes (around 30, too many to count by hand).
    • total reported by google chrome task manager: about 4 Gbytes

    K.O.

  12. dd1 reporter

    after restoring the code to reuse t1, v1 in receiveData(), zoom to Dec 2019, memory use 2.4 GBytes

    • MHistoryGraph: 2.4 GB = 1.3 GB “data” array, 4x 0.281 GB “x/y/t/v” arrays, 2x 4.6 Mbytes “t1/v1” arrays
    • unaccounted for under “arrays”, still some number (smaller number) of 80 Mbytes and 35 Mbytes objects.
    • total 2.4 GBytes.

    K.O.

  13. dd1 reporter

    conclusion for now after looking at the heap profiles and stack traces - unaccounted arrays are result of a.concat() used to prepend newly loaded data to the front of the “data” array. K.O.

  14. Log in to comment