-
Bug
-
Resolution: Fixed
-
Low
-
8.20.10, 9.12.11
-
8.2
-
30
-
Severity 3 - Minor
-
39
-
Issue Summary
By default, Jira is set to create a snapshot of indexes daily at 2 AM. However, if this time coincides with when a Lock and Re-index process was just completed in a different node in the cluster, it can break the snapshot task and lead to the instance index entering a deadlocked state, requiring a hard restart of the node index and restoring of a healthy index from a different node.
This is reproducible on Data Center: yes
Steps to Reproduce
- Execute a Lock and Re-index task in a node in the data center cluster
- When the reindex completes in the node and a different node (which will run the snapshot backup job) identifies that it has been completed, have the snapshot backup job kick in
Starting a full reindex while the snapshot job is running also causes this same issue.
Expected Results
Jira prevents the index from being replaced when a new index is available in the cluster, while the snapshot backup job is running.
Actual Results
Jira allows both tasks to run simultaneously, leading to "deadlock" scenario where both tasks fails and corrupt that node's index.
Timeline
The affected node starts creating a backup of its own index:
2022-11-09 01:00:00,433+0100 Caesium-1-1 INFO anonymous JIRA Index Snapshot Service [c.a.j.index.ha.IndexUtils] Creating index snapshot of type: SNAPPY in local temporary directory: /appl/jira/inst/temp/snapshot1667952000433/IndexSnapshot_2022-Nov-09--0100.tar.sz 2022-11-09 01:00:00,439+0100 Caesium-1-1 INFO anonymous JIRA Index Snapshot Service [c.a.j.util.compression.ArchiveUtils] Creating archive of type: SNAPPY in path: /appl/jira/inst/temp/snapshot1667952000433/IndexSnapshot_2022-Nov-09--0100.tar.sz
However, a few seconds later the DefaultNodeReindexService job identifies that a different node in the cluster has just completed a new re-index and that a new index snapshot is available, so it starts restoring that file:
2022-11-09 01:00:37,542+0100 NodeReindexServiceThread:thread-0 INFO [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Re-index after last sync detected: operationName=FULL_REINDEX_END, sourceNodeId=jira-0, indexTime=2022-11-09 01:00:22.723412. Resetting index count and restoring index from backupFilename='IndexSnapshot_16800.tar.sz' 2022-11-09 01:00:37,739+0100 NodeReindexServiceThread:thread-0 INFO [c.a.j.index.ha.DefaultIndexCopyService] Index restore started. Total 4745089 issues on instance before loading Snapshot file: IndexSnapshot_16800.tar.sz
Simultaneously we have the index snapshot system backing up files:
2022-11-09 01:02:37,518+0100 Caesium-1-1 INFO anonymous JIRA Index Snapshot Service [c.a.j.index.ha.IndexUtils] Progress of creating index snapshot: IndexSnapshot_2022-Nov-09--0100.tar.sz, number of index files processed: 232 ... 2022-11-09 01:03:13,671+0100 Caesium-1-1 INFO anonymous JIRA Index Snapshot Service [c.a.j.index.ha.IndexUtils] Progress of creating index snapshot: IndexSnapshot_2022-Nov-09--0100.tar.sz, number of index files processed: 250 ... 2022-11-09 01:03:52,192+0100 Caesium-1-1 INFO anonymous JIRA Index Snapshot Service [c.a.j.index.ha.IndexUtils] Progress of creating index snapshot: IndexSnapshot_2022-Nov-09--0100.tar.sz, number of index files processed: 264 ... 2022-11-09 01:04:38,108+0100 Caesium-1-1 INFO anonymous JIRA Index Snapshot Service [c.a.j.index.ha.IndexUtils] Progress of creating index snapshot: IndexSnapshot_2022-Nov-09--0100.tar.sz, number of index files processed: 299 ...
However, while the snapshot backup job is still running, the Index Fixer starts replacing the index files with the new index it received from the other node. For this process, it holds the "stop-the-world" index lock which will prevent any write to the index until that lock is removed by that task:
2022-11-09 01:06:00,054+0100 NodeReindexServiceThread:thread-0 INFO [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Restoring index with 4743369 issues 2022-11-09 01:06:00,119+0100 NodeReindexServiceThread:thread-0 INFO [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Restoring search indexes - 1% complete... [INDEX-FIXER] Replacing indexes 2022-11-09 01:06:00,119+0100 NodeReindexServiceThread:thread-0 INFO [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Starting the index restore process by replacing existing indexes. For this, we're holding the 'stop-the-world' index lock.
Because the Index Fixer already removed physical files which the snapshot backup job had already mapped while generating the list of files to backup, the snapshot backup job fails to complete:
2022-11-09 01:06:01,422+0100 Caesium-1-1 ERROR anonymous JIRA Index Snapshot Service [c.a.jira.service.ServiceRunner] An error occurred while trying to run service 'JIRA Index Snapshot Service'. java.io.FileNotFoundException: /appl/jira/home/caches/indexesV1/comments/_24jm2_Lucene50_0.tip (No such file or directory) com.atlassian.jira.util.RuntimeIOException: java.io.FileNotFoundException: /appl/jira/home/caches/indexesV1/comments/_24jm2_Lucene50_0.tip (No such file or directory) Caused by: java.io.FileNotFoundException: /appl/jira/home/caches/indexesV1/comments/_24jm2_Lucene50_0.tip (No such file or directory)
The index fixer process dies after that point, along with the index snapshot backup job. And finally we have the Index Replay failing as well because it's unable to sync the changes the node has been receiving via DBR, given that it can't delete files from the index folder due to the write lock put in place by the snapshot restore process:
2022-11-09 01:06:03,169+0100 NodeReindexServiceThread:thread-0 ERROR [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Error re-indexing node changes java.lang.RuntimeException: Failed to update index for ([ReplicatedIndexOperation{indexTime=2022-11-09 01:00:22.723412, id=169485404, nodeId='jira-0', affectedIdToVersion={}, operation=FULL_REINDEX_END, affectedIndex=ALL, entityType=NONE, backupFilename='IndexSnapshot_16800.tar.sz'}, [...] Caused by: com.atlassian.jira.util.RuntimeIOException: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /appl/jira/home/caches/indexesV1/issues/write.lock at com.atlassian.jira.index.DefaultIndexEngine.clean(DefaultIndexEngine.java:143) at com.atlassian.jira.index.DefaultManager.deleteIndexDirectory(DefaultManager.java:40) at com.atlassian.jira.issue.index.DefaultIssueIndexer.deleteIndexes(DefaultIssueIndexer.java:425)
The node ends up in an unusable state (the <$node-ip>/status endpoint returns MAINTENANCE) and the index needs to be manually replaced
Workaround
Delete the indexes from the node's caches\indexesV1: directory, then restart Jira to induce a snapshot restore:
- changes
- comments
- entities
- issues
- plugins
- worklogs
- is related to
-
JRASERVER-74514 Jira will experience index replication issue when the write.lock file could not be acquired
-
- Closed
-
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...