Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: Low
Fix Version/s: 8.5.4, 8.8.0, 8.7.1
Affects Version/s: 6.4.12, 7.5.1, 7.2.12, 7.6.7, 7.13.3, 8.7.0, 8.5.3
Component/s: Data Center - Other
Labels:

Fixed in Long Term Support Release/s:

Download 8.5
Introduced in Version:
6.04
Support reference count:
70
Symptom Severity:
Severity 2 - Major
UIS:
335
Bug Fix Policy:
View Atlassian Server bug fix policy
Current Status:
Hide

Atlassian Update – 06 Feb 2020

Hi everyone,

I’m glad to announce that Jira 8.5.4, 8.7.1, 8.8.0, and later contain a remedy for this issue.

First, when a node starts up, it will remove any tasks that were assigned to it before the restart. It's a generalized version of the same mechanism Jira uses for cleaning up stuck re-index and user anonymization jobs.

Secondly, there's a new scheduled service that detects offline nodes and clears tasks assigned to them. For this mechanism, a node is considered offline if it didn't write a heartbeat to the database in the past 30 minutes.

You can disable the second mechanism by setting the "jira.dc.cleanup.cluser.tasks.disabled" feature flag. You can follow our documentation to learn how to set it.

You can also customize its behavior (how often the service is run and how long it takes for a node to be considered offline) by modifying two properties:

cluster.task.cleanup.run.interval

cluster.task.cleanup.offline.node.threshold

You should refer to the jpm.xml file for an up-to-date documentation. Currently, the first property defaults to 60 seconds and the second one to 30 minutes (and disallows values below 10 minutes).

We also created a product improvement suggestion to notify the users (e.g. via email) that a task they submitted has been removed - JRASERVER-70584. Right now, only a log message like below is added:

2020-02-06 02:58:35,292+1100 Caesium-1-4 ERROR ServiceRunner [c.a.jira.cluster.ClusterTaskCleanupService] Removing stale 'Jira Indexing' task '10500' started on node 'node-id'.

If you’re interested in the ability to remove stuck global tasks manually, please see the suggestion ~~JRASERVER-66722~~.

Thank you,
Daniel Rauf
Jira Server Developer
Show
Atlassian Update – 06 Feb 2020 Hi everyone, I’m glad to announce that Jira 8.5.4, 8.7.1, 8.8.0, and later contain a remedy for this issue. First, when a node starts up, it will remove any tasks that were assigned to it before the restart. It's a generalized version of the same mechanism Jira uses for cleaning up stuck re-index and user anonymization jobs. Secondly, there's a new scheduled service that detects offline nodes and clears tasks assigned to them. For this mechanism, a node is considered offline if it didn't write a heartbeat to the database in the past 30 minutes. You can disable the second mechanism by setting the "jira.dc.cleanup.cluser.tasks.disabled" feature flag. You can follow our documentation to learn how to set it. You can also customize its behavior (how often the service is run and how long it takes for a node to be considered offline) by modifying two properties: cluster.task.cleanup.run.interval cluster.task.cleanup.offline.node.threshold You should refer to the jpm.xml file for an up-to-date documentation. Currently, the first property defaults to 60 seconds and the second one to 30 minutes (and disallows values below 10 minutes). We also created a product improvement suggestion to notify the users (e.g. via email) that a task they submitted has been removed - JRASERVER-70584 . Right now, only a log message like below is added: 2020-02-06 02:58:35,292+1100 Caesium-1-4 ERROR ServiceRunner [c.a.jira.cluster.ClusterTaskCleanupService] Removing stale 'Jira Indexing' task '10500' started on node 'node-id' . If you’re interested in the ability to remove stuck global tasks manually, please see the suggestion JRASERVER-66722 . Thank you, Daniel Rauf Jira Server Developer

Summary

Assume some node in JIRA Datacenter is executing long running task with has cluster wide status. If at some point progress abnormally stops before the job is complete, then job will be stuck for whole cluster.
The stuck job is stored in an in-memory cache and is replicated to other nodes when they start. All nodes must be shutdown at the same time in order for this job to be removed from cache.

Environment

JIRA DataCenter with 2+ nodes

Steps to Reproduce

Performs change that causes "Bulk Operation" action
Monitor "Bulk Operation Progress" bar.
Restart the node executing job (or create database connection failure)

Actual Results

Progress bar appears in stuck state on each node.
- A restart of one node has no impact, progress bar continues to appear.
The stuck job appears when trying to make other changes. Other changes cannot be made while this is stuck.
All nodes must be shutdown at the same time in order for this job to be removed from cache.

Expected Results

Either:

JIRA cluster detects the job is no longer progressing, throws an error, and no longer shows the stuck "Bulk Operation Progress" bar.
Or JIRA cluster detects the job is no longer progressing and continues the job

In either case, a stuck job on one node does not require restart of entire cluster

Notes

A stuck bulk edit can be reproduced by setting a breakpoint to stop thread on BulkEditOperation.java line 184.
The bulk edit task is in memory and is communicated to all nodes.
Nodes will keep this task in memory until it is deleted, or all nodes are down (clearing tasks in memory)
- Task is deleted when the operation is complete. Operation happens on the node the task started on.

Workaround

In some cases, it may possible to manually delete the stuck job. This should only be done after being absolutely certain that the job is no longer running.

Example for project migration, workflow migration , and bulk edits :

This needs to be run by the user who executed the task. In addition the user must be a JIRA Administrator in order to delete the task.
- If necessary grant temporary admin permission, delete task, then remove admin permission.
- If you are looking for the task id, you can click on the Workflow page where it says Migration in progress then click on the progress in blue. The task id should be on the page URL. Then replace the Id to the #id curl command below:
- URL below applies to bulk edits as well as project changes
  There is a internal REST point to stop tasks.
  - DELETE /rest/projectconfig/1/migrationStatus/#id
  - "#id" can be taken from the progress page's URL.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List

Screenshot 2022-07-12 at 2.18.17 PM.png
149 kB
12/Jul/2022 12:22 PM

causes

JRASERVER-70663 Tasks can be cleaned too early in Jira DC 8.7.1

Closed

JRASERVER-66722 As an JIRA Datacenter Administrator I want to delete stuck global tasks

Closed

JRASERVER-68616 As an JIRA Datacenter Administrator I want to delete reindexing task from offline node

Closed

JRASERVER-68885 Remove the stale indexing Job associated with current node on startup

Closed

depends on

JRASERVER-70585 Periodically clean-up cluster tasks from offline nodes

Closed

is related to

JRASERVER-47045 When a re-indexing node is abruptly stopped indexing still shows as in progress on other nodes, preventing future reindexing

Closed

JRASERVER-70584 When stuck operations are removed in DC, users should be notified

Gathering Interest

SSE-602 Loading...

relates to

JRASERVER-72055 Abruptly stopping node while project reindex is in progress causes inability to reindex Jira

Gathering Impact

JRASERVER-71269 Ability delete Orphaned Background and Project Re-index Tasks

Gathering Interest

MNSTR-3595 Loading...

Mentioned in

CSS Top Asks - Jira On Prem: JRASERVER-66204 - Bulk Operation can get stuck in JDC

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(3 is related to, 3 relates to, 1 Mentioned in, 31 mentioned in)

Assignee:: Daniel Rauf

Reporter:: Tim Evans (Inactive)

Votes:: 63 Vote for this issue

Watchers:: 90 Start watching this issue

Created:: 24/Oct/2017 9:18 PM

Updated:: 18/Jan/2024 7:32 PM

Resolved:: 10/Feb/2020 5:19 PM

Details

Description

Summary

Environment

Steps to Reproduce

Actual Results

Expected Results

Notes

Workaround

Attachments

Attachments

Issue Links

Forms

Activity

People

Dates