Loading...

Type: Bug
Resolution: Fixed
Priority: Low
Fix Version/s: 10.7.0, 11.0.0
Affects Version/s: 8.20.10, 9.12.11
Component/s: Data Center - Index
Labels:

Introduced in Version:
8.2
Support reference count:
30
Symptom Severity:
Severity 3 - Minor
UIS:
39
Bug Fix Policy:
View Atlassian Server bug fix policy

Issue Summary

By default, Jira is set to create a snapshot of indexes daily at 2 AM. However, if this time coincides with when a Lock and Re-index process was just completed in a different node in the cluster, it can break the snapshot task and lead to the instance index entering a deadlocked state, requiring a hard restart of the node index and restoring of a healthy index from a different node.

This is reproducible on Data Center: yes

Steps to Reproduce

Execute a Lock and Re-index task in a node in the data center cluster
When the reindex completes in the node and a different node (which will run the snapshot backup job) identifies that it has been completed, have the snapshot backup job kick in

Starting a full reindex while the snapshot job is running also causes this same issue.

Expected Results

Jira prevents the index from being replaced when a new index is available in the cluster, while the snapshot backup job is running.

Actual Results

Jira allows both tasks to run simultaneously, leading to "deadlock" scenario where both tasks fails and corrupt that node's index.

Timeline

The affected node starts creating a backup of its own index:

2022-11-09 01:00:00,433+0100 Caesium-1-1 INFO anonymous    JIRA Index Snapshot Service [c.a.j.index.ha.IndexUtils] Creating index snapshot of type: SNAPPY in local temporary directory: /appl/jira/inst/temp/snapshot1667952000433/IndexSnapshot_2022-Nov-09--0100.tar.sz
2022-11-09 01:00:00,439+0100 Caesium-1-1 INFO anonymous    JIRA Index Snapshot Service [c.a.j.util.compression.ArchiveUtils] Creating archive of type: SNAPPY in path: /appl/jira/inst/temp/snapshot1667952000433/IndexSnapshot_2022-Nov-09--0100.tar.sz

However, a few seconds later the DefaultNodeReindexService job identifies that a different node in the cluster has just completed a new re-index and that a new index snapshot is available, so it starts restoring that file:

2022-11-09 01:00:37,542+0100 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Re-index after last sync detected: operationName=FULL_REINDEX_END, sourceNodeId=jira-0, indexTime=2022-11-09 01:00:22.723412. Resetting index count and restoring index from backupFilename='IndexSnapshot_16800.tar.sz'
2022-11-09 01:00:37,739+0100 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultIndexCopyService] Index restore started. Total 4745089 issues on instance before loading Snapshot file: IndexSnapshot_16800.tar.sz

Simultaneously we have the index snapshot system backing up files:

2022-11-09 01:02:37,518+0100 Caesium-1-1 INFO anonymous    JIRA Index Snapshot Service [c.a.j.index.ha.IndexUtils] Progress of creating index snapshot: IndexSnapshot_2022-Nov-09--0100.tar.sz, number of index files processed: 232 ...
2022-11-09 01:03:13,671+0100 Caesium-1-1 INFO anonymous    JIRA Index Snapshot Service [c.a.j.index.ha.IndexUtils] Progress of creating index snapshot: IndexSnapshot_2022-Nov-09--0100.tar.sz, number of index files processed: 250 ...
2022-11-09 01:03:52,192+0100 Caesium-1-1 INFO anonymous    JIRA Index Snapshot Service [c.a.j.index.ha.IndexUtils] Progress of creating index snapshot: IndexSnapshot_2022-Nov-09--0100.tar.sz, number of index files processed: 264 ...
2022-11-09 01:04:38,108+0100 Caesium-1-1 INFO anonymous    JIRA Index Snapshot Service [c.a.j.index.ha.IndexUtils] Progress of creating index snapshot: IndexSnapshot_2022-Nov-09--0100.tar.sz, number of index files processed: 299 ...

However, while the snapshot backup job is still running, the Index Fixer starts replacing the index files with the new index it received from the other node. For this process, it holds the "stop-the-world" index lock which will prevent any write to the index until that lock is removed by that task:

2022-11-09 01:06:00,054+0100 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Restoring index with 4743369 issues
2022-11-09 01:06:00,119+0100 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Restoring search indexes - 1% complete... [INDEX-FIXER] Replacing indexes
2022-11-09 01:06:00,119+0100 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Starting the index restore process by replacing existing indexes. For this, we're holding the 'stop-the-world' index lock.

Because the Index Fixer already removed physical files which the snapshot backup job had already mapped while generating the list of files to backup, the snapshot backup job fails to complete:

2022-11-09 01:06:01,422+0100 Caesium-1-1 ERROR anonymous    JIRA Index Snapshot Service [c.a.jira.service.ServiceRunner] An error occurred while trying to run service 'JIRA Index Snapshot Service'. java.io.FileNotFoundException: /appl/jira/home/caches/indexesV1/comments/_24jm2_Lucene50_0.tip (No such file or directory)
com.atlassian.jira.util.RuntimeIOException: java.io.FileNotFoundException: /appl/jira/home/caches/indexesV1/comments/_24jm2_Lucene50_0.tip (No such file or directory)
Caused by: java.io.FileNotFoundException: /appl/jira/home/caches/indexesV1/comments/_24jm2_Lucene50_0.tip (No such file or directory)

The index fixer process dies after that point, along with the index snapshot backup job. And finally we have the Index Replay failing as well because it's unable to sync the changes the node has been receiving via DBR, given that it can't delete files from the index folder due to the write lock put in place by the snapshot restore process:

2022-11-09 01:06:03,169+0100 NodeReindexServiceThread:thread-0 ERROR      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Error re-indexing node changes
java.lang.RuntimeException: Failed to update index for ([ReplicatedIndexOperation{indexTime=2022-11-09 01:00:22.723412, id=169485404, nodeId='jira-0', affectedIdToVersion={}, operation=FULL_REINDEX_END, affectedIndex=ALL, entityType=NONE, backupFilename='IndexSnapshot_16800.tar.sz'}, 
[...]
Caused by: com.atlassian.jira.util.RuntimeIOException: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /appl/jira/home/caches/indexesV1/issues/write.lock
	at com.atlassian.jira.index.DefaultIndexEngine.clean(DefaultIndexEngine.java:143)
	at com.atlassian.jira.index.DefaultManager.deleteIndexDirectory(DefaultManager.java:40)
	at com.atlassian.jira.issue.index.DefaultIssueIndexer.deleteIndexes(DefaultIssueIndexer.java:425)

The node ends up in an unusable state (the <$node-ip>/status endpoint returns MAINTENANCE) and the index needs to be manually replaced

Workaround

Delete the indexes from the node's caches\indexesV1: directory, then restart Jira to induce a snapshot restore:

changes
comments
entities
issues
plugins
worklogs

is related to

JRASERVER-74514 Jira will experience index replication issue when the write.lock file could not be acquired

Closed

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(5 mentioned in)

Details

Description

Issue Summary

Steps to Reproduce

Expected Results

Actual Results

Timeline

Workaround

Attachments

Issue Links

Forms

Activity

People

Dates