JIRA hang when resolving an issue with workflow

Issue #423 resolved
Simon Ferrett created an issue

Ref: JIRA ticket JSP-275896

I opened a ticket with JIRA support due to our jira-server hanging when resolving tickets. They determined the issue was with some lookup/searches being done as part of the workflow.

I'm looking to see what info we can provide, or what next steps should be taken to determine what can be done to optimise or fix whatever the issue is making these calls take >60s to complete when resolving a ticket.

Here's what the JIRA support rep had to say: {quote} Hey Simon, Thanks for providing those thread dumps! In reviewing and comparing the sets you provided, I can indeed see a difference in the thread dumps gathered when nothing was happening in JIRA, vs when you were processing a hanging ticket resolve. From the thread dumps captured while the ticket was resolving, we can see just the one thread which was consistently taking a large amount of CPU%: "http-bio-10.11.1.36-8443-exec-1660" #24855 daemon prio=5 tid=0x00007f5b74a76800 nid=0x4a22 runnable [0x00007f5e5e6d6000] java.lang.Thread.State: RUNNABLE at org.apache.lucene.search.FieldComparator.binarySearch(FieldComparator.java:982) at org.apache.lucene.search.FieldComparator.binarySearch(FieldComparator.java:961) at com.atlassian.jira.issue.search.parameters.lucene.sort.StringSortComparator$StringOrdValComparator.setBottom(StringSortComparator.java:169) at org.apache.lucene.search.TopFieldCollector$OutOfOrderMultiComparatorNonScoringCollector.collect(TopFieldCollector.java:506) at org.apache.lucene.search.IndexSearcher.searchWithFilter(IndexSearcher.java:572) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:532) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:463) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:433) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:356) at com.atlassian.jira.index.DelegateSearcher.search(DelegateSearcher.java:165) at com.atlassian.jira.issue.search.providers.LuceneSearchProvider.runSearch(LuceneSearchProvider.java:574) at com.atlassian.jira.issue.search.providers.LuceneSearchProvider.getHits(LuceneSearchProvider.java:294) at com.atlassian.jira.issue.search.providers.LuceneSearchProvider.search(LuceneSearchProvider.java:420) at com.atlassian.jira.issue.search.providers.LuceneSearchProvider.search(LuceneSearchProvider.java:98) at com.atlassian.jira.issue.search.providers.LuceneSearchProvider.search(LuceneSearchProvider.java:87) at com.atlassian.jira.bc.issue.search.DefaultSearchService.search(DefaultSearchService.java:117) ... at com.fca.jira.plugins.workflowToolbox.shared.ReflectionHelper.invokeMethod(ReflectionHelper.java:179) at com.fca.jira.plugins.workflowToolbox.shared.JIRACompatibilityHelper.search(JIRACompatibilityHelper.java:71) at com.fca.jira.plugins.workflowToolbox.JQLCondition.jqlQueryResultContainsIssue(JQLCondition.java:173) at com.fca.jira.plugins.workflowToolbox.JQLCondition.jqlQueryConditionEvaluation(JQLCondition.java:85) at com.fca.jira.plugins.workflowToolbox.JQLValidator.validate(JQLValidator.java:43) at com.atlassian.jira.workflow.SkippableValidator.validate(SkippableValidator.java:52) at com.opensymphony.workflow.AbstractWorkflow.verifyInputs(AbstractWorkflow.java:1512) at com.opensymphony.workflow.AbstractWorkflow.transitionWorkflow(AbstractWorkflow.java:1203) at com.opensymphony.workflow.AbstractWorkflow.doAction(AbstractWorkflow.java:564) at com.atlassian.jira.workflow.OSWorkflowManager.doWorkflowActionInsideTxn(OSWorkflowManager.java:998) at com.atlassian.jira.workflow.OSWorkflowManager.doWorkflowAction(OSWorkflowManager.java:950) at com.atlassian.jira.bc.issue.DefaultIssueService.transition(DefaultIssueService.java:596) at com.atlassian.jira.bc.issue.DefaultIssueService.transition(DefaultIssueService.java:572) at com.atlassian.jira.web.action.issue.CommentAssignIssue.doExecute(CommentAssignIssue.java:193) at webwork.action.ActionSupport.execute(ActionSupport.java:165) at com.atlassian.jira.action.JiraActionSupport.execute(JiraActionSupport.java:88) ... at java.lang.Thread.run(Thread.java:745) I imagine that this is a mess to read, but what I can identify from this thread is that it appears that upon resolving an issue, you call a fairly expensive validation implemented by the 3rd party add-on JIRA Workflow Toolbox: at com.fca.jira.plugins.workflowToolbox.JQLCondition.jqlQueryResultContainsIssue(JQLCondition.java:173) at com.fca.jira.plugins.workflowToolbox.JQLCondition.jqlQueryConditionEvaluation(JQLCondition.java:85) at com.fca.jira.plugins.workflowToolbox.JQLValidator.validate(JQLValidator.java:43) I couldn't be certain what exactly this is doing, as we aren't inherently experts on how 3rd party functionality behaves, but it appears that the validation involves running a JQL query and verifying that the results contain a certain issue (perhaps the one being resolved?). You can see that the rest of the stack trace involves searching through the indexes, presumably evaluating the validation before confirming that the transition may happen.


Given what we know, we can assume that the JIRA Workflow Toolbox is the origin of the problem. If you look at the workflows which this project use, can you confirm the existence of such a validator? Is this validator fairly new, or has it recently been modified, so that you're only seeing this problem now? Things to keep into consideration improving this behavior: 1. Searching the indexes is a fairly disk and database intensive process. You can improve the general health and speed of this task if you have fast disks (SSDs). 2. Based on how the validator is written and implemented, you might very well be encountering the scenario that the JQL query being run is just so complex that it warrants a delay this long in order to complete the query, and thus, the validation. Can we simplify the validation to improve the query performance? This might be better evaluated by the vendors of the add-on; you can reach their support contact here: https://bitbucket.org/fcarmario/jira-workflow-toolbox/issues?status=new&status=open


I hope this information helps fully diagnose the issue! It's not obvious what can be improved on the JIRA side, or if the validator can be optimized. Let us know how things look considering this feedback.

Comments (13)

  1. Simon Ferrett Account Deactivated reporter

    Here's the original request submitted to Atlassian:

    Starting earlier this week we started getting reports of users who had operations hanging when resolving issues. If they came back to look at the issue later, after the timeout, it would have completed. I see very slow responses to the POST call to CommentAssignIssue.jspa which is related to the hang: 10.10.1.113 700x64002x3 sferrett [06/Oct/2016:11:42:06 -0700] "POST /secure/CommentAssignIssue.jspa?atl_token=A6W7-4U8G-EVUP-Y8PX%7C4917e9b8d7ffa78d4cd3ff77de35bad4c86c2e07%7Clin HTTP/1.1" 200 86 69703 "https://jira.sdslacker.com/browse/CWS-2702" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36" "1k1jrkm" 10.10.1.113 701x64051x3 sferrett [06/Oct/2016:11:42:32 -0700] "POST /secure/CommentAssignIssue.jspa?atl_token=A6W7-4U8G-EVUP-Y8PX%7C4917e9b8d7ffa78d4cd3ff77de35bad4c86c2e07%7Clin HTTP/1.1" 200 7576 68567 "https://jira.sdslacker.com/browse/CWS-2702" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36" "1k1jrkm" Can you please give me some pointers as to where to start diagnosing this issue? Thanks- Simon.

    And:

    This issue looks to have started following an update to the Xray pluging. During the update, the jira java process started to do back-to-back full GC events (seen in the gc.log file) and the server went unresponsive. It was stopped and restarted at that time with a larger heap size setting. After coming back online, there was no option to update Xray anymore, so I am not sure if that may be related to this isssue or not. Also, during startup, it takes quite a while to come online (>10 minutes) although I am uncertain if this was the case proir to the Xray update (I think it was faster than that prior). The logs during startup show the time where the delay occurs: 2016-10-05 18:27:01,040 localhost-startStop-1 INFO [quisapps.jira.jss.PluginInitializer] Disabled QuickEdit support: com/atlassian/jira/quickedit/action/QuickCreateIssue 2016-10-05 18:27:01,045 localhost-startStop-1 INFO [quisapps.jira.jss.PluginInitializer] Disabled IssueNav support: com/atlassian/jira/plugin/issuenav/action/IssueAction 2016-10-05 18:37:01,992 localhost-startStop-1 INFO [atlassian.plugin.manager.DefaultPluginManager] Plugin system earlyStartup finished in 0:10:41.102 2016-10-05 18:37:01,995 localhost-startStop-1 INFO [atlassian.plugin.manager.DefaultPluginManager] Plugin system lateStartup begun 2016-10-05 18:37:01,995 localhost-startStop-1 INFO [atlassian.plugin.manager.PluginEnabler] Resolving 0 plugins Note the 10 minute of time elapsed between those messages.

  2. Fidel Castro Armario repo owner

    Hi Simon,

    Your issue might be related with #388, which was solved in version 2.2.16. Can you confirm the version of JIRA Workflow Toolbox you have installed?

    On the other hand, can you please, attach a screenshot of the configuration of the "Validation based on JQL query" you are using. I would like to see the JQL query you are running.

    Thanks

  3. Simon Ferrett Account Deactivated reporter

    From the JIRA startup logs: JIRA Workflow Toolbox : com.fca.jira.plugins.workflowToolbox.workflow-toolbox Version : 2.2.21 Status : enabled

    However, there's a curveball I just discovered which is that plugins were updated recently (coinciding with when this slow/hang behavior started) and during the upgrade process the jira JVM went into endless garbage collection loops and was stopped via kill. I am not sure if something may have not completed during the upgrade (indexes, etc?) - I dont see any way to do a 're-install' of plugins though. The prior version before update was 2.2.9.

    I was not able to find any transitions in the workflow that I think is being used, so I must be looking in the wrong place. I will get the jql and update when I have it.

    Thanks for your quick response.

  4. Fidel Castro Armario repo owner

    Issue #388 was due to the fact that the plugin was creating an object for each issue returned by the JQL query. When a JQL query returned a very big number of issues the RAM used might cause problems in the server.

    The fix in version 2.2.16 was to implement paginated access.

    I think that it's very likely that problem will not happen after upgrading to version 2.2.21. I suggest you to simply wait for the problem to reappear. If it doesn't, I would consider it has been solved with the plugin update.

    Anyway, if you want, you can attach the XML export of your workflow, and I will look for JQL validators, and will exam the configuration.

  5. Simon Ferrett Account Deactivated reporter

    The hang/slowness only started after the update to 2.2.21 which is what we are currently using and what was running when the stack trace included in the start of this ticket was generated.

    I attached the xml export of our workflow here also.

  6. Fidel Castro Armario repo owner

    Simon, can you please tell me how many issues is returning the following JQL query in your JIRA instance?

    ("PdM Rep" is not EMPTY AND "PdM Accepted?" = Yes) OR "PdM Rep" is EMPTY
    

    As a temporary workaround while we fix the problem, can we be more restrictive in the query, for example looking only for issues in current issue's project? We can do it with the following JQL query:

    project = %{00018} AND (("PdM Rep" is not EMPTY AND "PdM Accepted?" = Yes) OR "PdM Rep" is EMPTY)
    

    where %{00018} is field code for Project key.

  7. Simon Ferrett Account Deactivated reporter

    If I punch that jql into the JIRA search it returns a large number of issues (over 65k issues) so this could be related for sure.

    I looked in the workflow but I cannot find where this query is defined. Can you give me a step-by-step on how I would navigate to the configuration screen that would show where the JQL is being set, so that I can adjust it as you suggest?

    Thank you!

  8. Fidel Castro Armario repo owner

    It's the 3th validator in transition Resolved with ID = 801. Look at the screenshot:

    transition-801.png

    Edit the validator and enter the new JQL query.

    By the way, has the problem only happended once? Could you please try to reproduce the problem again?

  9. Simon Ferrett Account Deactivated reporter

    Prior to me adjusting the validator just now, the issue is consistently reproducible by resolving a ticket from any state.

    After changing the JQL in the validator the issue appears to be resolved, I cannot reproduce the hang following the change.

    So my takeaway from this is that the (probably poorly coded) validator JQL somehow became more inefficient with the update to the latest version and was causing it to hang for long enough to result in a timeout. Adjusting the JQL to limit to the current project looks to have made the query effecient enough to work. We should therefore examine all JQL based validators and ensure they dont select more tickets than needed for the validation. Does this sound accurate to you?

  10. Simon Ferrett Account Deactivated reporter

    OK thanks very much for the help - fixing the bogus JQL has resovled the issue. I'm not sure why it only started causing noticeable performance issues until now, but the query was clearly badly formed. We've refactored that workflow and all is good.

    Cheers!

  11. Fidel Castro Armario repo owner

    Simon, thank you for reporting the problem. I will investigate how to improve the performance of the validator, in order to cope with this kind of JQL queries.

    Anyway, each time you have to implement validations on field values, I recommend you to use "Boolean validator with math, date-time or text-string terms". It's a much more powerful validador. You have to learn a new syntax, but if you master it the effort definitely pays off, and it's also much more efficient than using JQL queries for this kind of validations.

    These are some boolean expressions fully logically equivalent to your JQL validation:

    1) Literal translation:

    %{aaaaa} != null AND %{bbbbb} = "Yes" OR %{aaaaa} = null
    

    2) Simplified logic:

    %{aaaaa} = null OR %{bbbbb} = "Yes" 
    

    3) A maybe more intuitive syntax:

    %{aaaaa} != null IMPLIES %{bbbbb} = "Yes" 
    

    where aaaaa is field code for PdM Rep custom field, and bbbbb is field code for PdM Accepted?. You will find field code injectors in the validator's configuration screen.

    Let me know if you have doubts about it.

  12. Log in to comment