Performance regression got introduced in 5.15.0 or 5.16.0

Issue #266 resolved
Jóhann B. Guðmundsson created an issue

I finally had an an chance to inspect what had caused noticeable consisted increased in cpu usage on our jira servers in april and have traced that back to updates that where performed on the TTS plugin when it got updated from 5.14.0 to 5.16.0 so something you introduced or fixed in those version of the plugin is causing increased cpu load on jira servers.

The attached screenshots of cpu load from our test servers shows when the cpu load started as well as well as how it dropped when we disabled the tts plugin on the test server.

Comments (28)

  1. Tuncay Senturk repo owner

    Hi Johann

    Thanks for keeping us notified about the performance problem.
    The charts say there is huge difference between 5.14.0 and 5.16.0 versions. However there is no code changes that will cause such performance issue.
    However the caching mechanism of the add on was broken in prior versions and we have a fix for that in 5.20.0 version.

    I will look into it deeply, but is there any chance to upgrade to the latest version and see the results?

  2. Jóhann B. Guðmundsson reporter

    In 5.15 and 5.16 you did bug fixes and introduced new features some of which caused the spike which is quite obvious by the load graph.

    It was TTS Version 5.20.0 that got disabled on the test server and we cant enable the plugin again on the test server ( it does not load the plugin menu to enable the plugin again o_O ) and seems to have deleted the tts custom fields it uses in that process ( something that should only happen when plugins get uninstalled not disabled ) and the log is getting spammed with " Field layout contains non-orderable field with id 'customfield_XXXX'."

    I'm currently re-indexing the site ( with the indexing alone taking one and an half hour, this instance has over 500k issues and over thousand projects ) then i'm going to run some integrity checks to see if it fixes it and if not I will have to restore the db from backup which is about 3 hour process.

  3. Jóhann B. Guðmundsson reporter

    So I found out what was causing the plugin not to being able to be enabled and it was EFF's privacy badger plugin for browsers in anycase I update the plugin to latest version both on jira 6 and jira 7, enabled it and the cpu load went up again so this is not fixed with latest versions.load-tts.png

  4. Tuncay Senturk repo owner

    Hi Johann,

    Thanks again for the comprehensive inputs.
    Your test server's CPU increases immediately after the add-on installation, and the load remains same (high) until disabling add-on. Right?
    If there is not any background process or any manual user test it should be because of a scheduled task (in TTS).

    I am just asking to understand the root cause. Please give me as much info as you can.

    Thanks in advance
    Tuncay

  5. Jóhann B. Guðmundsson reporter

    There are no scheduled tasks in TTS on the servers and the load of the cpu increased after the update from 5.14.0 to 5.16.0 and stops immediately once you disable the TTS plugin.

    I have no problem installing an debug/beta plugin on the test server.which might shed some light what's going on there.

  6. Tuncay Senturk repo owner

    Great,

    So, could you please open DEBUG log for "com.tuncaysenturk.jira.plugins.timetosla" package and send me the logs after 5 minutes?

    In the meantime I'll be generating a special binary.
    Thanks
    Tuncay

  7. Tuncay Senturk repo owner

    Hi Johann,

    As I mentioned, I commented out some codes and see that performance problem gone. Those codes were not introduced with either 5.15.0 or 5.16.0, they were a part of 5.17.0 version.

    Are you sure about the versions?

    Those commented out codes are the scheduled tasks of TTS. They run every two minutes. There are two scheduled tasks: One is for SLA notifiers, and the latter is to update SLA data (for issues waiting in Paused status to postpone SLA target date)

    Here is from the logs

    2016-08-22 11:02:00,002 Starting Time to SLA Plugin Scheduled Tasks scheduled task...
    2016-08-22 11:02:00,003     Starting Time to SLA Plugin SLA Notifier scheduled task...
    2016-08-22 11:02:00,005     Finished Time to SLA Plugin SLA Notifier scheduled task successfully
    
    2016-08-22 11:02:00,005     Starting Time to SLA Plugin SLA Data Updater scheduled task...
    2016-08-22 11:02:31,843     Finished Time to SLA Plugin SLA Data Updater scheduled task successfully
    2016-08-22 11:02:31,843 Finished Time to SLA Plugin Scheduled Tasks scheduled task successfully
    

    As seen above notifier starts at 11:02:00,003 and finishes at 11:02:00,005
    SLA Updater starts at 11:02:00,005 and finishes at 11:02:31,843 updating around 75 - 80 issues.

    Below is just one issue that was updated through this scheduled task.

    2016-08-22 11:02:31,674 grabbed issueSla with ID 514851 to update SLA data
    2016-08-22 11:02:31,676 calculating durations for issueSla with ID 514851
    2016-08-22 11:02:31,695 updated expected target date of issue LOGSK-461 from 2016-04-11 14:26:57.302 to Mon Apr 11 14:28:56 GMT 2016
    2016-08-22 11:02:31,696 reindexing issue LOGSK-461
    2016-08-22 11:02:31,696 reindexing issue: LOGSK-461
    2016-08-22 11:02:31,764 reindexing issue: LOGSK-461 finished
    2016-08-22 11:02:31,764 updated SLA data for issue LOGSK-461
    

    It takes around 100 ms to update a single issue.

    The next scheduled tasks starts at 11:04:00,002 and finishes at 11:04:31,596
    It takes around 30 seconds for each scheduled tasks suite.

    Actually I was not expecting those CPU loads. One option is to change interval to 5 minutes rather than 2 minutes, or even more.

    I just wanted to update the issue, and just wanted to make a brain-storming. Your ideas may shed a light as well.

    Cheers
    Tuncay

  8. Jóhann B. Guðmundsson reporter

    This started when we upgraded to 5.16.0 from 5.14.0 which leaves just 5.15.0 or 5.16.0 as affected version and since you mentioned notifier based on the release notes you where hacking on it in 5.15.0 ( Multiple SLA Notification support: ... ). We dont use the sla notifier et all so it should be disabled and hence the scheduled tasks for it never be run but can you create two releases.

    One where the schedule task for SLA notifiers is enabled and the scheduled task for updating the SLA data is disabled/commented out and another one where the schedule task for SLA notifiers is disabled/commented out and sheduled task for updating the SLA data is enabled.

    That way we can isolate which one of those schedule task is the root cause for this since since I would expect a much higher cpu spike on the production machine if updating the number of issues was a factor here since on the production machine we have a rotation ( created/paused/closed ) of 1500 - 2000 issues per week while on the test machine it's just whatever issues we create + which one where open/paused from the db backup we are testing against yet the cpu spikes seems somewhat similar on both the test machine and the production machine

  9. Tuncay Senturk repo owner

    Yeah, sure I can generate those binaries.
    Actually I do not expect notifier because it takes 2 miliseconds. That's why 5.15.0 is not a suspect for me.
    So I'll produce the latter first.

  10. Jóhann B. Guðmundsson reporter

    The SLA Notifier code is not the cause for this load and yes we run both the postgresql database and jira are running on the same instance unfortunately we dont measure the database on the test instance ( since there is no point in doing that )and there is no difference in the db activity on the production server when we updated it to 5.16.0 but what we are seeing here is heavier disk io activity after the plugin upgrade and weirdly enough higher firewall throughput as well.

    You can see how things drop when I installed the 6.15.3 test plugin yesterday

    tts-cpu.png
    tts-disk-io.png
    tts-disk-throughput.png
    tts-firewall.png

  11. Tuncay Senturk repo owner

    All good things should have some minor side effects; )

    The difference is reindexing issue. I only removed indexing issue code.
    The next step may be the interval, increasing it from 2 minutes to 4-5 minutes will lower the load.

    Thanks

  12. Log in to comment