Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-66204

Bulk Operation can get stuck in JIRA Data Center

    XMLWordPrintable

Details

    • 6.04
    • 70
    • Severity 2 - Major
    • 335
    • Hide
      Atlassian Update – 06 Feb 2020

      Hi everyone,

      I’m glad to announce that Jira 8.5.4, 8.7.1, 8.8.0, and later contain a remedy for this issue.

      First, when a node starts up, it will remove any tasks that were assigned to it before the restart. It's a generalized version of the same mechanism Jira uses for cleaning up stuck re-index and user anonymization jobs.

      Secondly, there's a new scheduled service that detects offline nodes and clears tasks assigned to them. For this mechanism, a node is considered offline if it didn't write a heartbeat to the database in the past 30 minutes.

      You can disable the second mechanism by setting the "jira.dc.cleanup.cluser.tasks.disabled" feature flag. You can follow our documentation to learn how to set it.

      You can also customize its behavior (how often the service is run and how long it takes for a node to be considered offline) by modifying two properties:

      cluster.task.cleanup.run.interval
      
      cluster.task.cleanup.offline.node.threshold
      

      You should refer to the jpm.xml file for an up-to-date documentation. Currently, the first property defaults to 60 seconds and the second one to 30 minutes (and disallows values below 10 minutes).

      We also created a product improvement suggestion to notify the users (e.g. via email) that a task they submitted has been removed - JRASERVER-70584. Right now, only a log message like below is added:

      2020-02-06 02:58:35,292+1100 Caesium-1-4 ERROR ServiceRunner     [c.a.jira.cluster.ClusterTaskCleanupService] Removing stale 'Jira Indexing' task '10500' started on node 'node-id'.
      

      If you’re interested in the ability to remove stuck global tasks manually, please see the suggestion JRASERVER-66722.

      Thank you,
      Daniel Rauf
      Jira Server Developer

      Show
      Atlassian Update – 06 Feb 2020 Hi everyone, I’m glad to announce that Jira 8.5.4, 8.7.1, 8.8.0, and later contain a remedy for this issue. First, when a node starts up, it will remove any tasks that were assigned to it before the restart. It's a generalized version of the same mechanism Jira uses for cleaning up stuck re-index and user anonymization jobs. Secondly, there's a new scheduled service that detects offline nodes and clears tasks assigned to them. For this mechanism, a node is considered offline if it didn't write a heartbeat to the database in the past 30 minutes. You can disable the second mechanism by setting the "jira.dc.cleanup.cluser.tasks.disabled" feature flag. You can follow our documentation to learn how to set it. You can also customize its behavior (how often the service is run and how long it takes for a node to be considered offline) by modifying two properties: cluster.task.cleanup.run.interval cluster.task.cleanup.offline.node.threshold You should refer to the jpm.xml file for an up-to-date documentation. Currently, the first property defaults to 60 seconds and the second one to 30 minutes (and disallows values below 10 minutes). We also created a product improvement suggestion to notify the users (e.g. via email) that a task they submitted has been removed - JRASERVER-70584 . Right now, only a log message like below is added: 2020-02-06 02:58:35,292+1100 Caesium-1-4 ERROR ServiceRunner [c.a.jira.cluster.ClusterTaskCleanupService] Removing stale 'Jira Indexing' task '10500' started on node 'node-id' . If you’re interested in the ability to remove stuck global tasks manually, please see the suggestion JRASERVER-66722 . Thank you, Daniel Rauf Jira Server Developer

    Description

      Summary

      Assume some node in JIRA Datacenter is executing long running task with has cluster wide status. If at some point progress abnormally stops before the job is complete, then job will be stuck for whole cluster.
      The stuck job is stored in an in-memory cache and is replicated to other nodes when they start. All nodes must be shutdown at the same time in order for this job to be removed from cache.

      Environment

      • JIRA DataCenter with 2+ nodes

      Steps to Reproduce

      1. Performs change that causes "Bulk Operation" action
      2. Monitor "Bulk Operation Progress" bar.
      3. Restart the node executing job (or create database connection failure)

      Actual Results

      • Progress bar appears in stuck state on each node.
        • A restart of one node has no impact, progress bar continues to appear.
      • The stuck job appears when trying to make other changes. Other changes cannot be made while this is stuck.
      • All nodes must be shutdown at the same time in order for this job to be removed from cache.

      Expected Results

      Either:

      • JIRA cluster detects the job is no longer progressing, throws an error, and no longer shows the stuck "Bulk Operation Progress" bar.
      • Or JIRA cluster detects the job is no longer progressing and continues the job

      In either case, a stuck job on one node does not require restart of entire cluster

      Notes

      • A stuck bulk edit can be reproduced by setting a breakpoint to stop thread on BulkEditOperation.java line 184.
      • The bulk edit task is in memory and is communicated to all nodes.
      • Nodes will keep this task in memory until it is deleted, or all nodes are down (clearing tasks in memory)
        • Task is deleted when the operation is complete. Operation happens on the node the task started on.

      Workaround

      In some cases, it may possible to manually delete the stuck job. This should only be done after being absolutely certain that the job is no longer running.

      Example for project migration, workflow migration , and bulk edits :

      • This needs to be run by the user who executed the task. In addition the user must be a JIRA Administrator in order to delete the task.
        • If necessary grant temporary admin permission, delete task, then remove admin permission.
        • If you are looking for the task id, you can click on the Workflow page where it says Migration in progress then click on the progress in blue. The task id should be on the page URL. Then replace the Id to the #id curl command below:
        • URL below applies to bulk edits as well as project changes

          There is a internal REST point to stop tasks.

          • DELETE /rest/projectconfig/1/migrationStatus/#id
          • "#id" can be taken from the progress page's URL.

      Attachments

        Issue Links

          Activity

            People

              drauf Daniel Rauf
              tevans Tim Evans (Inactive)
              Votes:
              63 Vote for this issue
              Watchers:
              90 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: