history plot 100% cpu usage
(this is a tracker issue for the problem I see in the agmini system with “midas-2020-03-a-98”. I need to confirm this problem still exists in midas-2020-07).
go to history page for trigger rates https://daq16.triumf.ca/?cmd=History&group=Trigger&panel=SC32A and press the “-” button many times without waiting, the page will go unresponsive, task manager shows 100% cpu usage. If you wait for the data to update between pressing “-” buttons, the page remains responsive, but at the time scale roughly 10-14 days, CPU usage is 14-25%. (this plot has 32 variables)
similar situation with https://daq16.triumf.ca/?cmd=History&group=Trigger&panel=Clk625 (plot with 1 variable).
K.O.
Comments (17)
-
-
I traced this problem more on my local Mac, and found a severe memory leak. In history.cxx:1574 we jump to label “nextday”, but then the file handles “fh”, “fhd” and “fhi” are not closed. If we do this too often, mhttpd has hundreds of files open and crashes running out of file descriptors. I fixed this and committed to develop
This of course only happens only with the “MIDAS” history, which I tried for this test, while the crash on the previous post happened on our megon linux system using the “FILE” history.
-
I put code on the history page which prevents pressing the ‘-' button if a previous update has not yet finished. Please check if that prevents the page from freezing up. But I still would like to understand the crash above.
-
reporter The protection of “-” button works! But I cannot fully test it because of the DWORD crash, bug 239. K.O.
-
reporter Ok, DWORD crash fixed, now I have problem with bug 243. K.O.
-
reporter Ok, mhttpd bottleneck fixed, now I get back to the initial problem. If I zoom out far enough, the history display page goes unresponsive, shows 100% cpu image. On the PWB p2v page, this happens roughly on the 1 year time scale (Oct 2019 to today July 2020). After grinding away for some time the data eventually loaded. https://daq16.triumf.ca/?cmd=History&group=PWB&panel=v_p2
The old history plots did not have any trouble looking at history at 1 year, 2 year, etc time scales, while the new history plots just slows down until it becomes unresponsive. I think I looked at this in the past, and found the bottleneck in copying the newly received data into the front of the internal data arrays.
One solution could be to add a “cancel zoom out” button - I zoom out too far, there is too much data, the browser gets bogged down, I press the “cancel” button, and all is good again.
K.O.
-
Problem is that an AJAX request cannot be cancelled. All I can offer is a dialog box: “This request will take really really long, do you really want that?”
But first please check that the history reading is the bottleneck: https://bitbucket.org/tmidas/midas/issues/241/history-display-stuck-in-updating-data
Why are there now two issues about the same problem???
-
reporter But the bottleneck (the place with 100% cpu usage) is after the RPC request finished - after that, for a very long time, the history plot code moves the data around - shifts the array forward, inserts new data at the front, etc. This is very easy to see with the javascript profiler.
And that should be fixable. For example, there could be a timer - if copying the data takes longer than 10 seconds, give up.
I still think a better solution is to load the data in smaller chunks, then the web page does not go unresponsive for as long time, and the user can see “progress” as more and more data gets loaded and becomes visible.
But I am not sure even that is quite scalable. When I zoom out to a time scale of 1-2 years, things like panning becomes slow (and for me panning always happens by accident when the mouse points at the plot and the mouse button is unintentionally become pressed).
We should reduce the amount of data handled by the browser by using binned data (which does have the minimum and maximum values for each bin, so spikes do not disappear).
K.O.
-
reporter Can zoom out to Apr 2019: https://daq16.triumf.ca/?cmd=History&group=PWB&panel=v_p2&A=1551991031&B=159622781
(at home) Load time takes forever, network transfers between 2 and 20 Mbytes/sec. CPU load is high, but no longer pegged to 100%. Memory use is high and I see QC activity - grows to 1.2 GB, drops to 800 Mbytes, etc. Next zoom-out crashes the browser tab.
(update) nope, the linked page does not fully load. it loads all data fully visible on the plot, “waiting for data” is off, spinning wheel is off, memory use is 1.6 GB, then a browser tab crash. I guess the page continues to load more data at that point.
K.O.
-
reporter Can zoom out to December 2019, past that, browser tab crashes.
https://daq16.triumf.ca/?cmd=History&group=PWB&panel=v_p2&A=1574453333&B=1596571733
CPU use is on the high side, memory use is much less now (460 Mbytes). But still see many GC activity - memory use goes up to 1+GBytes, then drops to 400-600 MBytes, etc. I think big memory use is happening when we prepend old data in front of the big data array. K.O.
-
Cannot confirm. Easy to go April 2019, CPU never goes above 50-60%. Memory 1.9 GB:
Is that the well-known “buy a bigger laptop” issue?
-
reporter Same thing in the det fac.
Can zoom out to December 2019, past that, browser tab crashes. (actually there is an “out of memory” exception).
Data load speed 20-30 Mbytes/sec, memory use 4 GBytes (per google chrome “task manager”). Actual memory heap snapshot difficult to attach here, it does not cut-and-paste cleanly.
K.O.
-
reporter Bigger laptop cannot buy, 64GB RAM is about is big as they come.
This is what I have here: main midas machine daq16 - 32 GB RAM, quad core 3.6 GHz CPU, browser machine daq01 - 64 GB RAM, quad core 3.6 GHz CPU, 1gige network between them. K.O.
This is what I have at home: macbook air 2020, 16 GB RAM, fastest residential internet money can buy in my part of Vancouver (about 10 Mbytes/sec max).
K.O.
-
reporter memory use reported by the google chrome task manager (NOT by “top” or by macos performance monitor):
- MHistoryGraph: 3GB = 1.3 GB “data” array, 4x 0.4 GB “x/y/t/v” arrays
- unaccounted for under “arrays”: 30x 80 Mbytes (around 30, too many to count by hand).
- total reported by google chrome task manager: about 4 Gbytes
K.O.
-
reporter after restoring the code to reuse t1, v1 in receiveData(), zoom to Dec 2019, memory use 2.4 GBytes
- MHistoryGraph: 2.4 GB = 1.3 GB “data” array, 4x 0.281 GB “x/y/t/v” arrays, 2x 4.6 Mbytes “t1/v1” arrays
- unaccounted for under “arrays”, still some number (smaller number) of 80 Mbytes and 35 Mbytes objects.
- total 2.4 GBytes.
K.O.
-
reporter conclusion for now after looking at the heap profiles and stack traces - unaccounted arrays are result of a.concat() used to prepend newly loaded data to the front of the “data” array. K.O.
-
reporter - changed status to resolved
closing this bug. if history plot page memory use becomes a problem again, open a new bug. K.O.
- Log in to comment
I tried the same, and actually made mhttpd crash. Here is the stack dump:
So looks like there is a problem in js_hs_read_arraybuffer()