A temporary database outage may cause clustered jobs to be permanently ignored in single node operation

Issue #13 resolved
Chris Fuller created an issue

CaesiumSchedulerService.executeClusteredJob(QueuedJob) is called when it is time to run the job. However, if clusteredJobDao.find(JobId) throws an exception, then the job does not run and it is not re-enqueued.

If there is a cluster refresh job (see CaesiumSchedulerConfiguration.refreshClusteredJobsIntervalInMinutes()), as there is in JIRA Data Center, then this will eventually self-heal when the refresh poll re-discovers the job entry. However, this does not normally happen in single node operation, so the job may not run again unless rescheduled or if the scheduler is restarted.

Comments (11)

  1. Andy Brook [Plugin People]

    Hi Chris, thanks for deftly locating the source of this, was going a little crazy :)

  2. Ngoc Dao

    Isn't it a misuse of Caesium in the first place, when an app may schedule clustered jobs, but it doesn't configure Caesium to refresh regularly from DB?

  3. Chris Fuller reporter

    Isn't it a misuse of Caesium in the first place, when an app may schedule clustered jobs, but it doesn't configure Caesium to refresh regularly from DB?

    No. The refresh is intended to provide robustness for the cluster when a node schedules a new job then goes down before the time comes to run it. Until another node "notices" the job during a refresh, the original node is the only one that knows it is in the queue and therefore also the only one that can pick it up. When you only actually have one node, there is no point in checking for this because there are no other nodes to worry about. It would just be wasteful, so we don't do it.

    Another point is that even in a cluster, you may have some other mechanism that you prefer to use instead of polling the database. For example, a cluster-wide broadcast message that informs them of what new job was added.

  4. Andy Brook [Plugin People]

    Hi Chris, great to hear that has been fixed, how/when will filter into JIRA (this goes back down the chain to 6.4.x which is where I'm having customers with issues, hopefully this warrants a point release??)

  5. Chris Fuller reporter

    @javahollic : that is... Not consistent with what I've found here.

    The short version is that Caesium is a library I wrote to replace the generic Quartz library that Atlassian products have historically used. JIRA first switches from Quartz to Caesium in version 7.0, so if you have seen this occur on earlier versions of JIRA then this particular oversight couldn't explain that and we would need to keep looking.

  6. Andy Brook [Plugin People]

    OK fine, I'll dig more on 6.4.x, but will this feed into a 7.0.x and 7.1.x release /how to know when/where?

  7. Chris Fuller reporter

    I have some follow up to do, as JIRA needs to bump what version of the library it consumes, and I'm not sure if there would be a 7.1.10, or it hits 7.2.0, or if I'm too late for that even and 7.2.1 is my earliest chance. I'll have a chat with the JIRA bug master on Monday to find out what the options are. The main JSD issue has already been ping-pinged back to the main JIRA team to dal with, so while I can't tell you exactly when it will hit just yet, it definitely shouldn't get dropped. I'll let you know more as soon as I know more.

  8. Chris Fuller reporter

    No worries. You should probably watch https://jira.atlassian.com/browse/JSD-3653 for updates. I don't think it will stay there since JIRA itself is where this currently sits, but that issue is the main one that I know of for this problem and therefore where I would expect progress to be coordinated (or delegated to a JRA, where it really belongs, now).

  9. Log in to comment