midas status page memory leak

Issue #158 resolved
dd1 created an issue

I have the midas status page (https://daq16.triumf.ca/) open in macos google chrome 71.0.3578.98 and I watch in the "task manager" how the memory use is 246 Mbytes and growing at around 1 Mbyte every 2-3 seconds. CPU use is around 3-5%, network use is 47 kBytes/sec. The slowly growing memory use indicates that we have a memory leak. (Note that javascript uses "automatic garbage collection" memory management, which does not eliminate memory leaks. Only capability to explicitly free unused memory is eliminated). K.O.

Comments (35)

  1. dd1 reporter

    Actual memory use goes up to around 250-something MBytes, then drops down to 240-something, them slowly grows back up, drops down, rinse, repeat. This is the javascript garbage collection in action. So there is no memory leak on the status page, but still why do we generate around 1 Mbyte/sec of javascript memory allocations? As comparison, the NYTimes front page consumes 270 Mbytes. One would expect the midas front page to be much more light weight... K.O.

  2. dd1 reporter

    Then there is a question of memory use by the "message" page. This page does grow infinitely large by design - as new messages are added to midas.log - as as the user keeps scrolling the messages back in time. Perhaps we should somehow limit the total memory use there... K.O.

  3. Stefan Ritt

    I see the same behaviour. The relatively large memory allocation by Chrome probably comes from some bitmap caching. The browser prints the page contents into some temporary bitmap and then flushes it to the screen. That can easily take a few MB. I monitor such behaviour since several years now (for other processes) and concluded that I don't need to worry about JavaScript memory consumption.

    Concerning the messages page: One line takes about 100 Bytes. If you scroll really fast, you can do maybe 30 lines per second, thus 3kB. If we allow the browser to consume another 100 MB (should be easily possible these days), you have to continuously scroll for 100000kb/3kb=30000 seconds or eight hours. Good luck!

    Closing this topic if no complaints.

  4. dd1 reporter

    See high memory use from long running (days-weeks) web pages:

    • status page of my test experiment - 953 MB - 155 MB after reload
    • odb editor - 661 MB - 80 MB after reload
    • programs page - 602 MB - 64 MB after reload
    • sequencer - 253 MB - 151 MB after reload
    • sequencer - ??? (very big) - reloaded before I wrote it down

    I think we are leaking memory somewhere. Or causing unnecessary allocations that the javascript garbage collector does not keep up with or does not cleanup correctly. K.O.

  5. dd1 reporter

    I am suspicious of memory use trouble from periodic-update code that keeps setting innerHTML to the same value as it was before, unnecessarily. (this also causes other problems - cannot cut-and-paste affected parts of the web page, high cpu use to redraw the (unchanged) page). K.O.

  6. Stefan Ritt

    For setting innerHTML we should always use

    if (text !== control.inner HTML)

    control.innerHTML = text;

    I thought I caught most of the cases, but I might have missed some. Please add as needed.

    Stefan

  7. dd1 reporter

    Strange things continue. Just say huge CPU usage from 3 midas web pages (odb editor, programs page and the new sequencer page). All 3 pages are tabs in an iconized browser window. Suddenly machine feels slow, and I see all 3 use 25% CPU each (by the chrome-browser task manager window). Opened the browser window, sent to the offending tabs, nothing looks amiss, CPU usage went back to 0%. WTH? (all 3 pages have 100 Mbyte memory use, all 3 pages update at 1 Hz). K.O.

  8. dd1 reporter

    looked at the “programs” page. learned how to use the google-chrome “performance” tool. I was definitely leaking html nodes. The leak was in an unexpected place - innerHTML with a link was miscomparing because of unexpected string transformation:

    xbad: "<a href='?cmd=odb&odb_path=System/Clients/" + key + "'>" + host + "</a>";
    good: "<a href=\"?cmd=odb&amp;odb_path=System/Clients/" + key + "\">" + host + "</a>";
    

    Now node leak from my periodic update went from 35 nodes to 2 nodes per update. The performance tool fails to identify where these last 2 nodes are coming from.

    K.O.

  9. dd1 reporter

    Forgot to add - the periodic update from mhttpd_init() is also leaking nodes. I will look at it some other time. K.O.

  10. dd1 reporter

    after improvement to the “programs” page, the tab is staying at 50-60 Mbytes. promising… K.O.

  11. dd1 reporter

    Fixed node leak in mhttpd_refresh(): the alarm display was setting e.innerHTML even if it did not change.

    There only remains an unavoidable node leak with “mheader_last_updated” where we set the current time every 1 second. If I comment this out, there is no node leak on the “programs” page.

    K.O.

  12. dd1 reporter

    implemented this for the timestamp update and (i.e.) the “programs” page now leaks 0 nodes. memory use for all pages sits around 40-60 Mbytes. K.O.

  13. dd1 reporter

    see problem of high cpu usage again, after google-chrome restarted after an update to latest version. for example, “program” page is 65 Mbytes, uses 20% CPU. (in an inactive tab). If I open this tab, for maybe 10 seconds, it goes to 100+ Mbytes with big CPU usage (>100%), then drops down to 90 Mbytes, 0% CPU usage. I do not see any other web pages or tabs doing this. Only our midas pages. WTH!?! K.O.

  14. dd1 reporter

    figured out high cpu usage reported as “rendering”. Open “devtools”, goto “performance”, press Command-Shift-P, start typing “rendering”, select “fps meter”. A black square will open in top-left, showing graphics activity (frame rate, GPU usage, etc).

    Now wait for new message to appear in the top status bar. It will be “yellow” at first, that it will fade to “gray”. During this fading, GPU use is 100% during about 1 second, FPS is about 50 frames/sec.

    K.O.

  15. dd1 reporter

    quick google search shows much discussion about css animations using “too much CPU”, i.e. google “css pulsing background”, but no clear way to tell the browser to slow down. It looks to me like the background-color animation tries to run at maximum possible frame-rate, as if electricity is free. (Since I am debugging high-cpu and high-memory use of inactive tabs, there is nobody looking at these animations). K.O.

  16. Stefan Ritt

    New messages are displayed with a yellow background and fade to grey after 5 seconds. This is handeled in mhttpd.js around line 2144. You can try to remove the lines

    d.style.setProperty("-webkit-transition", "background-color 3s", "");
    d.style.setProperty("transition", "background-color 3s", "");
    

    and see if the CPU load goes down.

  17. dd1 reporter

    captured another trace of midas page using 20% cpu in an inactive tab, iconized browser window. capturing is difficult, requires very fast mousing to: select the right tab, right-click to “inspect”, select “performance” tab, click on “start capture”, and hope that by this time the web page activity does not complete. this time I got the last 200 ms or so.

    what I see is again is “media activity” (only identified as “task”), GPU activity (only identified as “GPU activity”) and main thread activity (identified as an infinitely repeating sequence of “receive response 206 audio/mpeg”, “receive data 39287 bytes”, “finish loading”, then the same sequence again. 39287 is the file size of resources/beep.mp3. There is no corresponding network activity, so the loading of beep.mp3 must be coming from cache. On the javascript console, there are the usual “not allowed to play audio because user did not interact” messages repeating about every 1-2-3 minutes.

    I read this as: for reasons unknown, a huge number of audio requests becomes queued (the tab was inactive/iconized for many days) then they start trying to play (load beep.mp3, do not play it because “not allowed”, move on to the next audio object, load … etc). This is consistent with the cpu use, with the captured traces and with the quick growth in memory size (beep.mp3 objects are created, consume memory, cannot be free’d until garbage collector runs later. much later).

    The above scenario is impossible with how the current audio playing code is written (only one audio object can exist at a time, new audio object can only be created after the previous one finished playing).

    Two possible explanations: (a) the code running in the web page is not the same code as in mhttpd.js (running an old version from cache) or (b) the code “one audio object at time” is not working correctly if javascript code is throttled /delayed/stopped in inactive tabs.

    Following code will have this problem:

    var only_one = null;
    function foo() { /* runs from periodic timer */
     if (!only_one) {
       a = new Audio(“beep.mp3”);
       /* throttled/suspended/delayed here */
       /* multiple Audio objects created because "only_one" is still null */
       only_one = a;
     }
    }
    

    K.O.

  18. dd1 reporter

    fading background - yes, I found the code. pretty neat. I moved it around to remove the timer - I am suspicious of how the timers run in inactive tabs. but no time to study it.

    but the current problem is clearly with audio objects, and the only audio we have is the periodic playing of beep.mp3. who knew there will be so much trouble.

    there is still the unexplained use of GPU, but maybe playing/decoding mp3 files uses the GPU.

    I am also puzzled why the status page from midas-2019-03 does not show any of these problems. it just sits there using no memory (50 Mbytes) and no CPU. perhaps we changed something in the playing of audio files since last March (when midas-2019-03 was tagged).

    K.O.

  19. dd1 reporter

    For the first time I saw my message “mhttpd_alarm_play: Cannot play alarm sound: previous alarm sound did not finish playing yet” reported on the javascript console. This confirms my guess that playing of audio is actually delayed and indeed we need to check that the previous audio finished playing before creating new audio objects. But the check in the current code has a race condition. If the delay/stall is inside “new Audio()”, we will create multiple audio objects as “last_audio” is still in the “finished playing” state, we only change it after the return from “new Audio()”. K.O.

  20. dd1 reporter

    see big improvement. now inactive tabs grow from 50-ish Mbytes to 170-ish Mbytes, then when I open them, there is some cpu use (GC, I guess) and memory use drops back to 50-ish Mbytes. So we are not leaking any memory anymore. Looking at the console messages, I see that my fixes are helping - there is messages about attempts to create new Audio() when previous one did not finish yet. K.O.

  21. dd1 reporter

    I guess, inactive tabs are throttled by google-chrome so much that their GC (memory garbage collection) does not keep up with our 1/sec data updates. I do not think we need to keep updating inactive tabs at this high frequency, but I am not sure how to detect if we are active or inactive. Maybe I can detect the throttling instead. K.O.

  22. dd1 reporter

    see consistent behaviour from google-chrome:

    • have all these midas tabs open, inactive, window iconized, typical tab size is 50-ish Mbytes.
    • google-chrome update arrives
    • update is installed, all windows and tabs automatically closed, then reopened.
    • the midas tabs are still inactive, window is iconized
    • after a few days, see behaviour as described before:
    • midas tabs use 20-30% CPU, size is 100-ish Mbytes
    • if I open one of these tabs, it’s cpu usage goes up to 160%, size grows to 250-ish Mbytes, then within 5-10 seconds drops to 100 Mbytes, CPU usage goes from 160% to zero.
    • when looking at this, if I am quick enough, I can right-click “inspect”, go to the “performance” tab, and press the “start collecting data” button and I capture the very tail end of all this strange activity. This is the traces I have been describing so far.

    K.O.

  23. dd1 reporter

    see big blob of activity:

    • timer activation
    • mhttpd_message()
    • first call to mhttpd_fit_message()
    • a long cycle of (maybe 10-20) “recalculate size”, “layout”, “parse html”
    • second call to mhttpd_fit_message()
    • same long cycle of …

    The way I understand this, mhttpd_fit_message() changes the size of some html element that causes the whole window to be re-layed-out.

    K.O.

  24. dd1 reporter

    trying to figure out what triggers a long run of the “rasterizer” thread. see a very strange call sequence

    • timer fires
    • mhttpd_alarm_play()
    • mhttpdConfig()
    • “Layout”

    mhttpd_alarm_play() calls mhttpdConfig() (3 times) to find out if alarm sound is enabled, the period, the file name, etc. so far so good. but mhttpdConfig() does not touch any DOM objects, so why is it shown as calling “Layout”?!?

    other than this trace, I see nothing else that would trigger the rasterizer thread…

    (note) this time, mhttpd_alarm_play() does not call mhttpd_alarm_play_now(), so “new Audio” and stuff does not enter this picture.

    K.O.

  25. dd1 reporter

    in the early part of the trace, where I think the meat of “tab is using cpu and memory” is, I see the audio events firing in rapid sequence: loadeddata, canplay, canplaythrough, rinse, repeat.

    It turns out that promise rejection from audio.play() does not stop the loading of the sound file. This is easy to see by attaching the event handlers to these events and by observing these event handlers print something to the javascript console.

    If that is what is happening, it explains what I see: all my previous attempts to prevent the piling up of sound files are unsuccessful, and when I open the previously inactive tab, all the queued sound files start loading (and not playing per “user did not interact” policy).

    google docs suggest using audio.src=”” to cancel loading of sound files and it does seem to work. testing it now.

    K.O.

  26. dd1 reporter

    gotcha. came back home, found one tab using about 10% cpu. audio.src=”” is commented out, javascript console is full of this:

    Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_loadeddata: counter 234
    mhttpd.js:2763 Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_canplay: counter 234
    mhttpd.js:2767 Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_canplaythrough: counter 234
    mhttpd.js:2759 Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_loadeddata: counter 235
    mhttpd.js:2763 Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_canplay: counter 235
    mhttpd.js:2767 Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_canplaythrough: counter 235
    mhttpd.js:2759 Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_loadeddata: counter 236
    mhttpd.js:2763 Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_canplay: counter 236
    mhttpd.js:2767 Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_canplaythrough: counter 236
    mhttpd.js:2759 Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_loadeddata: counter 237
    mhttpd.js:2763 Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_canplay: counter 237
    mhttpd.js:2767 Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_canplaythrough: counter 237
    mhttpd.js:2759 Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_loadeddata: counter 238
    mhttpd.js:2763 Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_canplay: counter 238
    mhttpd.js:2767 Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_canplaythrough: counter 238
    mhttpd.js:2759 Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_loadeddata: counter 239
    mhttpd.js:2763 Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_canplay: counter 239
    mhttpd.js:2767 Thu Nov 07 2019 22:17:30 GMT-0800 (Pacific Standard Time): mhttpd_audio_canplaythrough: counter 239
    

    the timestamp is exactly when I opened this tab. so confirmed, a whole bunch of audio files got queued, when I open the tab they all try to play. (there is no actual sound, all tabs are muted).

    now I uncomment audio.src=”” and see what happens.

    K.O.

  27. dd1 reporter

    looks good. an update to google-chrome came in and after installing, I no longer see midas tabs show high cpu usage or high memory use. I think the audio.src=”” fix is it. I will be committing these fixes to midas. K.O.

  28. Stefan Ritt

    The loop in mhttpd_fit_message() is there for a good reason: I want to display the message in a single line. If it’s too long, I want first to cut the time stamp and then display it. If it’s still too long, I want to truncate the message and display “…” at the end. The problem is what “too long” is. Nobody can tell you how much pixel a message on your browser take, because this depends on the installed fonts, the exact character spacing of your browser and so on. So the only way I could make this happen is to add one char at a time, until we get close to the maximum allowed space. If course this requires a re-layout of the page for 10-20 times, but when your window is in the foreground this is not a problem, since a browser can do this with small CPU load. The “scope” application I use does 70 frames per second at 30% CPU load. So one could make the loop a bit smarter, like binary search, which would drop the 10-20 iterations to log2(10-20) ~ 4-5, but still there would be a loop.

    How that the update of the messages in the background is suppressed with the hidden API, do you still have that problem or can we consider it fixed?

    Stefan

  29. dd1 reporter

    see new behaviour - after many days, inactive page size is ~180 Mbytes. 0% CPU use (an improvement from before where there was large CPU use). activate the tab, nothing much happens, 0% CPU use (again an improvement from before). after about 30 seconds, memory use drops down to the normal 50-70-80 Mbytes. I think what we see is the garbage collector is throttled down and does not keep up with our allocations. Stefan’s new fix reducing polling in inactive pages from 1/sec to 1/10sec should help with this. K.O.

  30. dd1 reporter

    mhttpd_fit_message() - confirmed. I was confused about the function argument.

    I thought it is passed an array of messages. no, it is one message string and the loop is over the message string length. The loop is done twice (second time, with the time/date stamp removed). google-chrome debugger does show that this uses large amount of CPU, mainly to compute d.offsetWidth.

    I think I will refactor these loops - instead of growing the message, I will shrink it.

    K.O.

  31. dd1 reporter

    rewrote mhttpd_fit_message() to reduce CPU use: try to fit complete message, if too long, try to fit message without timestamp, if too long, guess the desired length assuming all chars have same width, then grow or shrink the message until the size is right. K.O.

  32. Log in to comment