[JRASERVER-72125] Index replication service is paused indefinitely after failing to obtain an index snapshot from another node

Type: Bug
Resolution: Fixed
Priority: High (View bug fix roadmap)
Fix Version/s: 9.1.0
Affects Version/s: 7.13.0, 8.5.0, 8.16.1, 8.17.1, 8.20.0, 8.13.13, 8.20.1
Component/s: Data Center - Index, Data Center - Node replication, Indexing
Labels:

Introduced in Version:
7.13
Support reference count:
184
Symptom Severity:
Severity 2 - Major
UIS:
1,889
Bug Fix Policy:
View Atlassian Server bug fix policy
Current Status:
Hide

Hi Team,

we're happy to announce that this issue is fixed in 9.1.0 release, available → https://www.atlassian.com/software/jira/update.

Before Jira 9.1 it was admin's responsibility to ensure that only one node starts at a time and that all nodes in the cluster are healthy, so that each of them can provide a healthy index snapshot.

Starting from Jira 9.1 the synchronous node start-up is enforced by the application. The start-up procedure ensures the local index is healthy before Jira can continue start-up. Jira will not start without a healthy index.

The start-up procedure is performed under a cluster lock, guaranteeing that only one node at a time executes it.

The index start-up procedure:

Re-index missing data if a local issue index is less than 10% behind the database.

Load a recent index snapshot from shared-home directory if one is available.

Otherwise trigger a full re-index.

Please refer to this Knowledge Base article for more information.

With that we continue our Scale & performance roadmap work around index management, stay tuned for more stuff coming later this year.

Cheers

Andrzej Kotas
Product Manager - Jira DC
Show
Hi Team, we're happy to announce that this issue is fixed in 9.1.0 release, available → https://www.atlassian.com/software/jira/update . Before Jira 9.1 it was admin's responsibility to ensure that only one node starts at a time and that all nodes in the cluster are healthy, so that each of them can provide a healthy index snapshot. Starting from Jira 9.1 the synchronous node start-up is enforced by the application. The start-up procedure ensures the local index is healthy before Jira can continue start-up. Jira will not start without a healthy index. The start-up procedure is performed under a cluster lock, guaranteeing that only one node at a time executes it. The index start-up procedure: Re-index missing data if a local issue index is less than 10% behind the database. Load a recent index snapshot from shared-home directory if one is available. Otherwise trigger a full re-index. Please refer to this Knowledge Base article for more information. With that we continue our Scale & performance roadmap work around index management, stay tuned for more stuff coming later this year. Cheers Andrzej Kotas Product Manager - Jira DC

Issue Summary

Jira pauses the cluster index replication service when requesting an index snapshot from another node. If the sending node fails to provide an snapshot for any reason, the cluster index replication service will remain paused indefinitely.

Steps to Reproduce

Deploy a two-node Jira Data Center cluster. Ensure one of the nodes does not have a valid index, and the other will not be able to provide a valid snapshot. For example, as described in ~~JRASERVER-62669~~.
Start one of the nodes while the other is already up.

Expected Results

The starting node will obtain a valid index snapshot from another node.
If that does not happen over a certain period of time, the starting node will either request another index snapshot, or at least unpause the index replication service.

Actual Results

Starting node requests an index snapshot from any other node in the cluster:

Starting node requests an index snapshot

2021-02-14 04:20:41,530+0000 localhost-startStop-1 INFO      [c.a.jira.startup.ClusteringLauncher] Checking local index on node start
2021-02-14 04:20:41,534+0000 localhost-startStop-1 INFO      [c.a.jira.cluster.DefaultClusterManager] Current node: 10.0.80.208 index can't be rebuilt. Requesting an index from any other node. Current list of other nodes: [10.0.12.156, 10.0.66.220, 10.0.174.242]
(...)
2021-02-14 04:20:41,540+0000 localhost-startStop-1 INFO      [c.a.jira.cluster.DefaultClusterManager] Sending message: "Backup Index" - request to create index snapshot from node: ANY on current node: 10.0.80.208

The starting node then pauses index replication while waiting for an index snapshot to be provided by another node:

Starting node pauses index replication

2021-02-14 04:20:41,534+0000 localhost-startStop-1 INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Pausing node re-index service
java.lang.Exception
    at com.atlassian.jira.index.ha.DefaultNodeReindexService.pause(DefaultNodeReindexService.java:213)
    at com.atlassian.jira.cluster.DefaultClusterManager.requestCurrentIndexFromNode(DefaultClusterManager.java:138)
    at com.atlassian.jira.cluster.DefaultClusterManager.checkIndex(DefaultClusterManager.java:131)
    at com.atlassian.jira.startup.ClusteringLauncher.start(ClusteringLauncher.java:37)
    at com.atlassian.jira.startup.DefaultJiraLauncher.postDBActivated(DefaultJiraLauncher.java:168)
    at com.atlassian.jira.startup.DefaultJiraLauncher.lambda$postDbLaunch$2(DefaultJiraLauncher.java:146)
    at com.atlassian.jira.config.database.DatabaseConfigurationManagerImpl.doNowOrEnqueue(DatabaseConfigurationManagerImpl.java:301)
    at com.atlassian.jira.config.database.DatabaseConfigurationManagerImpl.doNowOrWhenDatabaseActivated(DatabaseConfigurationManagerImpl.java:196)
    at com.atlassian.jira.startup.DefaultJiraLauncher.postDbLaunch(DefaultJiraLauncher.java:137)
    at com.atlassian.jira.startup.DefaultJiraLauncher.lambda$start$0(DefaultJiraLauncher.java:104)
    at com.atlassian.jira.util.devspeed.JiraDevSpeedTimer.run(JiraDevSpeedTimer.java:31)
    at com.atlassian.jira.startup.DefaultJiraLauncher.start(DefaultJiraLauncher.java:102)
    at com.atlassian.jira.startup.LauncherContextListener.initSlowStuff(LauncherContextListener.java:154)
    at com.atlassian.jira.startup.LauncherContextListener.initSlowStuffInBackground(LauncherContextListener.java:139)
    at com.atlassian.jira.startup.LauncherContextListener.contextInitialized(LauncherContextListener.java:101)
    ... 5 filtered
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

The sending node fails to provide an index snapshot for any reason (i.e. due to ~~JRASERVER-62669~~):

Sending node fails to provide an index snapshot

2021-02-14 04:20:42,118+0000 ClusterMessageHandlerServiceThread:thread-1 INFO      [c.a.j.index.ha.DefaultIndexCopyService] Received message: "Backup Index" - request to create index snapshot from node: 10.0.80.208 on current node: 10.0.12.156
2021-02-14 04:20:42,118+0000 ClusterMessageHandlerServiceThread:thread-1 INFO      [c.a.j.index.ha.DefaultIndexCopyService] Index backup started. Requesting node: 10.0.80.208, currentNode: 10.0.12.156
2021-02-14 04:20:42,120+0000 ClusterMessageHandlerServiceThread:thread-1 WARN      [c.a.j.index.ha.DefaultIndexCopyService] Index backup failed - latest index operation not found. Requesting node: 10.0.80.208, currentNode: 10.0.12.156

This can happen when the requested node is able to provide a good copy too.

Starting node will keep waiting for an index snapshot indefinitely, while also keeping the replication service paused. Note the timestamps in this snippet, more than a day later:

2021-02-15 10:04:19,691+0000 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered=true, notRunningCounter=21396, paused=true, lastPausedStacktrace=java.lang.Throwable
        at com.atlassian.jira.index.ha.DefaultNodeReindexService.pause(DefaultNodeReindexService.java:215)
        at com.atlassian.jira.util.index.CompositeIndexLifecycleManager.reIndexAll(CompositeIndexLifecycleManager.java:62)
        at com.atlassian.jira.util.index.CompositeIndexLifecycleManager.reIndexAll(CompositeIndexLifecycleManager.java:51)
        at com.atlassian.jira.web.action.admin.index.ReIndexAsyncIndexerCommand.doReindex(ReIndexAsyncIndexerCommand.java:27)
        at com.atlassian.jira.web.action.admin.index.AbstractAsyncIndexerCommand.call(AbstractAsyncIndexerCommand.java:63)
        at com.atlassian.jira.web.action.admin.index.ReIndexAsyncIndexerCommand.call(ReIndexAsyncIndexerCommand.java:18)
        at com.atlassian.jira.web.action.admin.index.AbstractAsyncIndexerCommand.call(AbstractAsyncIndexerCommand.java:26)
        at com.atlassian.jira.task.TaskManagerImpl$TaskCallableDecorator.call(TaskManagerImpl.java:533)
        at com.atlassian.jira.task.TaskManagerImpl$TaskCallableDecorator.call(TaskManagerImpl.java:491)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at com.atlassian.jira.task.ForkedThreadExecutor$ForkedRunnableDecorator.run(ForkedThreadExecutor.java:216)
        at java.lang.Thread.run(Thread.java:748)

Please notice, the "Node re-index service is not running" exceptions don't necessarily indicate that the instance is indeed affected by this bug report, especially when a newly added cluster node could appear to be hanging on startup. To identify that the node is indeed affected by the bug, please verify if other symptoms listed above are evident. If not, and if you're starting a new node up in a Data Center cluster, it's probably just a matter of waiting as the node could simply be recovering the indexes from an existing snapshot and catching up on the updates between the snapshot time and the time the new node started up. Depending on the JIRA's size, it's not unusual to see a cluster node taking one hour to startup

Cluster index replication will fall behind in the starting node, eventually leading to a failed Cluster Index Replication health check, and symptoms such as:
- Missing issues in agile boards.
- Searches bringing incomplete or inconsistent results.

Workaround

request index from another node via admin panel / Copy the Search Index from another node
restore index from index backup
restart node
LB should not redirect users to node with no index: ~~JRASERVER-66970~~

Notes

As of 8.19.0 we introduced fetching index snapshot from shared on startup which will prevent this issues to happen. For this feature to work index snapshot must be available in `export/indexsnapshots` directory of shared home. A service creating snapshot of indexes is enabled by default in 24 hours cycle in that version. See details in ~~JRASERVER-66649~~.
In Jira 9.0 we've ensure that Jira instance will create index snapshot and save it to shared home directory only when index on this instance is consistent. More details on how to handle situations where an index is not consistent can be found here: Indexing inconsistency troubleshooting

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List

Screen Shot 2021-08-05 at 12.18.32 PM.png
265 kB
05/Aug/2021 5:20 PM

is related to

JRASERVER-70443 NodeReindexServiceThread can stop checking messages

Closed

JRASERVER-74244 Exception thrown during full reindex on node startup result in non-dismissible Johnson page

Closed

JRASERVER-74248 Jira shows unnecessarily alarming stack trace when reindexing thread is expectedly disabled

Closed

JRASERVER-74329 Starting a node while other node is performing full reindex may lead to inconsisten index.

Gathering Impact

JRASERVER-66970 /status should indicate when indexes are broken on a node

Closed

JRASERVER-74232 Make index catch up during startup multi-threaded

Closed

JRASERVER-74233 After inactivity node should catch up with index changes before it serves traffic.

Closed

JRASERVER-74328 Make the threshold to allow rebuilding local index configurable

Closed

relates to

JRASERVER-62669 Automatic restore of indexes will fail if the node that registered the latest index operation is unavailable

Closed

ASCI-8 Loading...

PSR-707 Loading...

Mentioned in

CSS Top Asks - Jira On Prem: JRASERVER-72125 | Index replication service is paused indefinitely after failing ...

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(3 is related to, 3 relates to, 1 Mentioned in, 48 mentioned in)

Conny Postma made changes - 17/Jan/2025 10:45 AM

Remote Link

Original: This issue links to "Page (Atlassian Documentation)" [ 626031 ]

Thiago Masutti made changes - 04/Jan/2024 7:48 PM

Remote Link

Original: This issue links to "ASCI-8 (Bulldog)" [ 556435 ]

New: This issue links to "ASCI-8 (JIRA Server (Bulldog))" [ 556435 ]

Thiago Masutti made changes - 04/Jan/2024 7:48 PM

Remote Link

Original: This issue links to "PSR-707 (Bulldog)" [ 610728 ]

New: This issue links to "PSR-707 (JIRA Server (Bulldog))" [ 610728 ]

Azfar Masut added a comment - 20/Aug/2023 8:03 AM

Remove index folders from your Jira node $JIRA-HOME/caches/indexesV1 helps to remediate the problem as a workaround, we also added -Dcom.atlassian.jira.status.index.check=false to startup parameter

Azfar Masut added a comment - 20/Aug/2023 8:03 AM Remove index folders from your Jira node $JIRA-HOME/caches/indexesV1 helps to remediate the problem as a workaround, we also added -Dcom.atlassian.jira.status.index.check=false to startup parameter

Maciej Swinarski (Inactive) added a comment - 12/Jun/2023 9:04 AM

Hi a0ef4d4784b7, this is expected.

When a node performa a full reindex it is:

not accepting user traffic (out of LB)
not updating the index (outside of the full-reindex)

During this time such node does not replicated any internal index changes to other nodes and does not consume any index changes from other nodes. During this time you may see such entries in the log:

2023-06-05 14:33:19,745+0000 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered=true, notRunningCounter=768, paused=true, lastPausedStacktrace=com.atlassian.jira.index.ha.DefaultNodeReindexService$StackCollector

After the node finished the full re-index, created a snapshot and send the snapshot to shared the index replication should start working and there should be no such log entires. If this is the case please create a support request with the support.zip(s) so this can be investigated.

Best regards,

mac

Maciej Swinarski (Inactive) added a comment - 12/Jun/2023 9:04 AM Hi a0ef4d4784b7 , this is expected. When a node performa a full reindex it is: not accepting user traffic (out of LB) not updating the index (outside of the full-reindex) During this time such node does not replicated any internal index changes to other nodes and does not consume any index changes from other nodes. During this time you may see such entries in the log: 2023-06-05 14:33:19,745+0000 NodeReindexServiceThread:thread-0 INFO [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered= true , notRunningCounter=768, paused= true , lastPausedStacktrace=com.atlassian.jira.index.ha.DefaultNodeReindexService$StackCollector After the node finished the full re-index, created a snapshot and send the snapshot to shared the index replication should start working and there should be no such log entires. If this is the case please create a support request with the support.zip(s) so this can be investigated. Best regards, mac

Saurabh Gupta added a comment - 05/Jun/2023 2:40 PM - edited

Found this issue in Jira 9.4.6 while re-indexing the testing instance created via the data center performance script.

2023-06-05 14:33:19,745+0000 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered=true, notRunningCounter=768, paused=true, lastPausedStacktrace=com.atlassian.jira.index.ha.DefaultNodeReindexService$StackCollector

Although the re-indexing completed successfully.

Saurabh Gupta added a comment - 05/Jun/2023 2:40 PM - edited Found this issue in Jira 9.4.6 while re-indexing the testing instance created via the data center performance script. 2023-06-05 14:33:19,745+0000 NodeReindexServiceThread:thread-0 INFO [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered= true , notRunningCounter=768, paused= true , lastPausedStacktrace=com.atlassian.jira.index.ha.DefaultNodeReindexService$StackCollector Although the re-indexing completed successfully.

Neel made changes - 02/Jun/2023 12:35 PM

Remote Link

New: This issue links to "Page (Confluence)" [ 771959 ]

Michał Gozdera made changes - 10/May/2023 11:02 AM

Remote Link

New: This issue links to "Page (Confluence)" [ 760645 ]

Rinish made changes - 26/Apr/2023 6:17 AM

Remote Link

New: This issue links to "Page (Confluence)" [ 755034 ]

Maciej Swinarski (Inactive) made changes - 28/Feb/2023 3:03 PM

Assignee

New: Maciej Swinarski [ mswinarski ]

Assignee:: Maciej Swinarski (Inactive)

Reporter:: Vinicius Fontes

Affected customers:: 83 This affects my team

Watchers:: 121 Start watching this issue

Created:: 18/Feb/2021 8:23 PM

Updated:: 17/Jan/2025 10:45 AM

Resolved:: 21/Jul/2022 8:39 AM

Details

Description

Issue Summary

Steps to Reproduce

Expected Results

Actual Results

Workaround

Notes

Attachments

Attachments

Issue Links

Forms

Activity

Collapse comment: Azfar Masut added a comment - 20/Aug/2023 8:03 AM

Expand comment: Azfar Masut added a comment - 20/Aug/2023 8:03 AM

Collapse comment: Maciej Swinarski (Inactive) added a comment - 12/Jun/2023 9:04 AM

Expand comment: Maciej Swinarski (Inactive) added a comment - 12/Jun/2023 9:04 AM

Collapse comment: Saurabh Gupta added a comment - 05/Jun/2023 2:40 PM, Edited by Saurabh Gupta - 05/Jun/2023 2:41 PM

Expand comment: Saurabh Gupta added a comment - 05/Jun/2023 2:40 PM, Edited by Saurabh Gupta - 05/Jun/2023 2:41 PM

People

Dates