mhttpd high cpu usage

Issue #80 resolved
dd1 created an issue

In the agdaq system, I see mhttpd CPU usage reach 50% as reported by top. Not clear what is happening. I think I need to add a capability to log all rpc requests to see if somebody is requesting something sllly or at too high rate. K.O.

Comments (13)

  1. Stefan Ritt

    I only see high CPU load on mhttpd when several people have history pages open. mhttpd then scans the history files continuously, some people put the auto reload on those pages to a few seconds, which then causes some load to mhttpd. So check the accesses to your history files.

    Once I finished the rewrite of history panels, this will be gone, since I will only as for "updates since last update" which then will only be a few numbers.

    Stefan

  2. dd1 reporter

    When looking at the status page, every 1 second, 3 requests go to mhttpd (I am looking at the google-chrome network inspector):

    1 - {"jsonrpc":"2.0","method":"db_ls","params":{"paths":["Equipment"]},"id":1510596326945}
    2 - [{"jsonrpc":"2.0","method":"db_get_values","params":{"paths":[]},"id":1510596326959},{"jsonrpc":"2.0","method":"get_alarms","params":null,"id":1510596326959},{"jsonrpc":"2.0","method":"cm_msg_retrieve","params":{"facility":"midas","time":0,"min_messages":100},"id":1510596326959},{"jsonrpc":"2.0","method":"cm_msg_retrieve","params":{"facility":"chat","time":0,"min_messages":100},"id":1510596326959}]
    3 - [{"jsonrpc":"2.0","method":"db_get_values","params":{"paths":["/Experiment/Status items","/Logger/Channels","/System/Clients","/Runinfo","Sequencer/State","/Logger","/Lazy","/Experiment/Name"]},"id":1510596326976},{"jsonrpc":"2.0","method":"get_alarms","params":null,"id":1510596326976},{"jsonrpc":"2.0","method":"cm_exist","params":{"name":"Logger"},"id":1510596326976},{"jsonrpc":"2.0","method":"cm_msg_retrieve","params":{"min_messages":1},"id":1510596326976},{"jsonrpc":"2.0","method":"db_get_values","params":{"paths":["/Equipment/CDM/statistics","/Equipment/CTRL/statistics","/Equipment/ALPHA16/statistics","/Equipment/UDP/statistics","/Equipment/EVB/statistics","/Equipment/WienerVME/statistics","/Equipment/FEAM_tpc01/statistics","/Equipment/FEAM_tpc02/statistics","/Equipment/CAEN_hvps01/statistics","/Equipment/LVDB/statistics","/Equipment/TpcCooling/statistics","/Equipment/WIENER_vmeps01/statistics","/Equipment/WIENER_lvps01/statistics"]},"id":1510596326976},{"jsonrpc":"2.0","method":"db_get_values","params":{"paths":["/Equipment/CDM/common","/Equipment/CTRL/common","/Equipment/ALPHA16/common","/Equipment/UDP/common","/Equipment/EVB/common","/Equipment/WienerVME/common","/Equipment/FEAM_tpc01/common","/Equipment/FEAM_tpc02/common","/Equipment/CAEN_hvps01/common","/Equipment/LVDB/common","/Equipment/TpcCooling/common","/Equipment/WIENER_vmeps01/common","/Equipment/WIENER_lvps01/common"]},"id":1510596326976}]
    

    K.O.

  3. dd1 reporter

    strace of mhttpd shows a call to select with a zero timeout:

    select(1024, [3 5], NULL, NULL, {0, 0}) = 0 (Timeout)
    

    K.O.

  4. dd1 reporter

    ok, this select() is in poll_mg(), there is a 10ms sleep there, so this is not the culprit... K.O.

  5. dd1 reporter

    added tracing of HTTP requests, see that the big batch request from the status page takes 30 msec to process (on my mac at home). Do not know which request in the batch is taking all this time. K.O.

  6. dd1 reporter

    added timing to RPC requests, the big time consumer (and CPU eater) is db_get_values() taking 70..100ms to compute:

    request took 0.099 seconds, method [db_get_values]
    request took 0.003 seconds, method [get_alarms]
    request took 0.000 seconds, method [cm_exist]
    request took 0.001 seconds, method [cm_msg_retrieve]
    request took 0.014 seconds, method [db_get_values]
    request took 0.073 seconds, method [db_get_values]
    request took 0.012 seconds, method [db_ls]
    request took 0.108 seconds, method [db_get_values]
    request took 0.003 seconds, method [get_alarms]
    request took 0.000 seconds, method [cm_exist]
    request took 0.001 seconds, method [cm_msg_retrieve]
    request took 0.014 seconds, method [db_get_values]
    request took 0.070 seconds, method [db_get_values]
    request took 0.011 seconds, method [db_ls]
    request took 0.074 seconds, method [db_get_values]
    request took 0.004 seconds, method [get_alarms]
    request took 0.000 seconds, method [cm_exist]
    request took 0.001 seconds, method [cm_msg_retrieve]
    request took 0.014 seconds, method [db_get_values]
    request took 0.070 seconds, method [db_get_values]
    

    K.O.

  7. dd1 reporter

    there is a performance bug in the json decoder, have a fix, but not committed yet. maybe same bug is in the encoder... K.O.

  8. dd1 reporter

    ok. figured it out. was puzzled for a while because output of gprof was making no sense. but it was correct. the culprit was the high overhead of locking and unlocking odb in the json encoder. if I lock odb at the top level of the encoder, the low level odb functions do not have to repeatdly lock and unlock odb, and the locking overhead is greatly reduced. cpu use of mhttpd is now down to 3-5%. K.O.

  9. dd1 reporter

    new timing profile of rpc calls. most expensive call is now "get_alarms", probably for the same reason - it makes too many odb calls, each odb call has to lock and unlock odb.

    request took 0.001 seconds, method [db_ls]
    request took 0.003 seconds, method [db_get_values]
    request took 0.007 seconds, method [get_alarms]
    request took 0.000 seconds, method [cm_exist]
    request took 0.002 seconds, method [cm_msg_retrieve]
    request took 0.004 seconds, method [db_get_values]
    request took 0.004 seconds, method [db_get_values]
    request took 0.001 seconds, method [db_ls]
    request took 0.007 seconds, method [db_get_values]
    request took 0.020 seconds, method [get_alarms]
    request took 0.000 seconds, method [cm_exist]
    request took 0.004 seconds, method [cm_msg_retrieve]
    request took 0.008 seconds, method [db_get_values]
    request took 0.006 seconds, method [db_get_values]
    request took 0.001 seconds, method [db_ls]
    request took 0.007 seconds, method [db_get_values]
    request took 0.016 seconds, method [get_alarms]
    request took 0.000 seconds, method [cm_exist]
    request took 0.004 seconds, method [cm_msg_retrieve]
    request took 0.008 seconds, method [db_get_values]
    request took 0.006 seconds, method [db_get_values]
    request took 0.000 seconds, method [db_ls]
    

    K.O.

  10. Stefan Ritt

    I'm puzzled why ODB locking causes any overhead. If you enter "t1" in odbedit, you start a test which measures the number of ODB accesses per second. On my Mac I get 1.2 million accesses per second, so less than on micro second per access. If you look at al_get_alarms(), you see 5 db_get_value per alarm, so if you have let's say 100 alarms, this are 500 accesses to the ODB, or half a millisecond. Does not explain your 20 milliseconds above.

    Another remark: Indeed too many individual accesses can cause overhead. That was especially true in the 90's when CPUs were much slower. That's why I invented db_get_record(), which does everything in a single access. Now you are converting everything back to db_get_value() calls, which of course will be slower. I understand that there is a good reason for not using db_get_record(), but using individual db_get_values() is maybe not the best solution. Maybe we should implement a new function which retrieves a subdirectory in the ODB in a single call, but without the requirement that the subdirectory matches exactly a C structure.

    Stefan

  11. Log in to comment