-
Bug
-
Resolution: Fixed
-
High
-
7.13.0, 8.5.0, 8.16.1, 8.17.1, 8.20.0, 8.13.13, 8.20.1
-
7.13
-
184
-
Severity 2 - Major
-
1,889
-
-
Issue Summary
Jira pauses the cluster index replication service when requesting an index snapshot from another node. If the sending node fails to provide an snapshot for any reason, the cluster index replication service will remain paused indefinitely.
Steps to Reproduce
- Deploy a two-node Jira Data Center cluster. Ensure one of the nodes does not have a valid index, and the other will not be able to provide a valid snapshot. For example, as described in
JRASERVER-62669. - Start one of the nodes while the other is already up.
Expected Results
- The starting node will obtain a valid index snapshot from another node.
- If that does not happen over a certain period of time, the starting node will either request another index snapshot, or at least unpause the index replication service.
Actual Results
- Starting node requests an index snapshot from any other node in the cluster:
Starting node requests an index snapshot
2021-02-14 04:20:41,530+0000 localhost-startStop-1 INFO [c.a.jira.startup.ClusteringLauncher] Checking local index on node start 2021-02-14 04:20:41,534+0000 localhost-startStop-1 INFO [c.a.jira.cluster.DefaultClusterManager] Current node: 10.0.80.208 index can't be rebuilt. Requesting an index from any other node. Current list of other nodes: [10.0.12.156, 10.0.66.220, 10.0.174.242] (...) 2021-02-14 04:20:41,540+0000 localhost-startStop-1 INFO [c.a.jira.cluster.DefaultClusterManager] Sending message: "Backup Index" - request to create index snapshot from node: ANY on current node: 10.0.80.208
- The starting node then pauses index replication while waiting for an index snapshot to be provided by another node:
Starting node pauses index replication
2021-02-14 04:20:41,534+0000 localhost-startStop-1 INFO [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Pausing node re-index service java.lang.Exception at com.atlassian.jira.index.ha.DefaultNodeReindexService.pause(DefaultNodeReindexService.java:213) at com.atlassian.jira.cluster.DefaultClusterManager.requestCurrentIndexFromNode(DefaultClusterManager.java:138) at com.atlassian.jira.cluster.DefaultClusterManager.checkIndex(DefaultClusterManager.java:131) at com.atlassian.jira.startup.ClusteringLauncher.start(ClusteringLauncher.java:37) at com.atlassian.jira.startup.DefaultJiraLauncher.postDBActivated(DefaultJiraLauncher.java:168) at com.atlassian.jira.startup.DefaultJiraLauncher.lambda$postDbLaunch$2(DefaultJiraLauncher.java:146) at com.atlassian.jira.config.database.DatabaseConfigurationManagerImpl.doNowOrEnqueue(DatabaseConfigurationManagerImpl.java:301) at com.atlassian.jira.config.database.DatabaseConfigurationManagerImpl.doNowOrWhenDatabaseActivated(DatabaseConfigurationManagerImpl.java:196) at com.atlassian.jira.startup.DefaultJiraLauncher.postDbLaunch(DefaultJiraLauncher.java:137) at com.atlassian.jira.startup.DefaultJiraLauncher.lambda$start$0(DefaultJiraLauncher.java:104) at com.atlassian.jira.util.devspeed.JiraDevSpeedTimer.run(JiraDevSpeedTimer.java:31) at com.atlassian.jira.startup.DefaultJiraLauncher.start(DefaultJiraLauncher.java:102) at com.atlassian.jira.startup.LauncherContextListener.initSlowStuff(LauncherContextListener.java:154) at com.atlassian.jira.startup.LauncherContextListener.initSlowStuffInBackground(LauncherContextListener.java:139) at com.atlassian.jira.startup.LauncherContextListener.contextInitialized(LauncherContextListener.java:101) ... 5 filtered at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
- The sending node fails to provide an index snapshot for any reason (i.e. due to
JRASERVER-62669):Sending node fails to provide an index snapshot2021-02-14 04:20:42,118+0000 ClusterMessageHandlerServiceThread:thread-1 INFO [c.a.j.index.ha.DefaultIndexCopyService] Received message: "Backup Index" - request to create index snapshot from node: 10.0.80.208 on current node: 10.0.12.156 2021-02-14 04:20:42,118+0000 ClusterMessageHandlerServiceThread:thread-1 INFO [c.a.j.index.ha.DefaultIndexCopyService] Index backup started. Requesting node: 10.0.80.208, currentNode: 10.0.12.156 2021-02-14 04:20:42,120+0000 ClusterMessageHandlerServiceThread:thread-1 WARN [c.a.j.index.ha.DefaultIndexCopyService] Index backup failed - latest index operation not found. Requesting node: 10.0.80.208, currentNode: 10.0.12.156
This can happen when the requested node is able to provide a good copy too.
- Starting node will keep waiting for an index snapshot indefinitely, while also keeping the replication service paused. Note the timestamps in this snippet, more than a day later:
2021-02-15 10:04:19,691+0000 NodeReindexServiceThread:thread-0 INFO [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered=true, notRunningCounter=21396, paused=true, lastPausedStacktrace=java.lang.Throwable at com.atlassian.jira.index.ha.DefaultNodeReindexService.pause(DefaultNodeReindexService.java:215) at com.atlassian.jira.util.index.CompositeIndexLifecycleManager.reIndexAll(CompositeIndexLifecycleManager.java:62) at com.atlassian.jira.util.index.CompositeIndexLifecycleManager.reIndexAll(CompositeIndexLifecycleManager.java:51) at com.atlassian.jira.web.action.admin.index.ReIndexAsyncIndexerCommand.doReindex(ReIndexAsyncIndexerCommand.java:27) at com.atlassian.jira.web.action.admin.index.AbstractAsyncIndexerCommand.call(AbstractAsyncIndexerCommand.java:63) at com.atlassian.jira.web.action.admin.index.ReIndexAsyncIndexerCommand.call(ReIndexAsyncIndexerCommand.java:18) at com.atlassian.jira.web.action.admin.index.AbstractAsyncIndexerCommand.call(AbstractAsyncIndexerCommand.java:26) at com.atlassian.jira.task.TaskManagerImpl$TaskCallableDecorator.call(TaskManagerImpl.java:533) at com.atlassian.jira.task.TaskManagerImpl$TaskCallableDecorator.call(TaskManagerImpl.java:491) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at com.atlassian.jira.task.ForkedThreadExecutor$ForkedRunnableDecorator.run(ForkedThreadExecutor.java:216) at java.lang.Thread.run(Thread.java:748)
Please notice, the "Node re-index service is not running" exceptions don't necessarily indicate that the instance is indeed affected by this bug report, especially when a newly added cluster node could appear to be hanging on startup. To identify that the node is indeed affected by the bug, please verify if other symptoms listed above are evident. If not, and if you're starting a new node up in a Data Center cluster, it's probably just a matter of waiting as the node could simply be recovering the indexes from an existing snapshot and catching up on the updates between the snapshot time and the time the new node started up. Depending on the JIRA's size, it's not unusual to see a cluster node taking one hour to startup
- Cluster index replication will fall behind in the starting node, eventually leading to a failed Cluster Index Replication health check, and symptoms such as:
- Missing issues in agile boards.
- Searches bringing incomplete or inconsistent results.
Workaround
- request index from another node via admin panel / Copy the Search Index from another node
- restore index from index backup
- restart node
- LB should not redirect users to node with no index:
JRASERVER-66970
Notes
- As of 8.19.0 we introduced fetching index snapshot from shared on startup which will prevent this issues to happen. For this feature to work index snapshot must be available in `export/indexsnapshots` directory of shared home. A service creating snapshot of indexes is enabled by default in 24 hours cycle in that version. See details in
JRASERVER-66649. - In Jira 9.0 we've ensure that Jira instance will create index snapshot and save it to shared home directory only when index on this instance is consistent. More details on how to handle situations where an index is not consistent can be found here: Indexing inconsistency troubleshooting
- is related to
-
JRASERVER-70443 NodeReindexServiceThread can stop checking messages
- Closed
-
JRASERVER-74244 Exception thrown during full reindex on node startup result in non-dismissible Johnson page
- Closed
-
JRASERVER-74248 Jira shows unnecessarily alarming stack trace when reindexing thread is expectedly disabled
- Closed
-
JRASERVER-74329 Starting a node while other node is performing full reindex may lead to inconsisten index.
- Gathering Impact
-
JRASERVER-66970 /status should indicate when indexes are broken on a node
- Closed
-
JRASERVER-74232 Make index catch up during startup multi-threaded
- Closed
-
JRASERVER-74233 After inactivity node should catch up with index changes before it serves traffic.
- Closed
-
JRASERVER-74328 Make the threshold to allow rebuilding local index configurable
- Closed
- relates to
-
JRASERVER-62669 Automatic restore of indexes will fail if the node that registered the latest index operation is unavailable
- Closed
-
ASCI-8 Loading...
-
PSR-707 Loading...
- Mentioned in
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...