Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-74514

Jira will experience index replication issue when the write.lock file could not be acquired

XMLWordPrintable

      Issue Summary

      This is reproducible on Data Center: yes

      Steps to Reproduce

      Reproduced on a customer environment but not able to replicate locally

      An index copy is requested:

      2022-10-25 10:51:55,405-0400 ClusterMessageHandlerServiceThread:thread-1 INFO anonymous     [c.a.j.index.ha.DefaultIndexCopyService] Received message: "Index Backed Up" - notification that node: node2 created an index snapshot which can be restored on current node: node1
      2022-10-25 10:51:55,406-0400 ClusterMessageHandlerServiceThread:thread-1 INFO anonymous     [c.a.j.index.ha.DefaultIndexCopyService] Index restore started. Total 6882913 issues on instance before loading Snapshot file: IndexSnapshot_118402.tar.sz
      

      The index restoration starts in node node1. Lock file is created and the path to it is put into LOCK_HELD Set.

      2022-10-25 10:53:43,583-0400 ClusterMessageHandlerServiceThread:thread-1 INFO anonymous     [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Restoring index with 6887494 issues
      2022-10-25 10:53:43,586-0400 ClusterMessageHandlerServiceThread:thread-1 INFO anonymous     [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Restoring search indexes - 1% complete... [INDEX-FIXER] Replacing indexes
      2022-10-25 10:53:43,586-0400 ClusterMessageHandlerServiceThread:thread-1 INFO anonymous     [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Starting the index restore process by replacing existing indexes. For this, we're holding the 'stop-the-world' index lock.
      

      If something happens to the write.lock file (i.e. file is missing), a NoSuchFileException error will be visible in the logs:

      2022-10-25 10:53:52,109-0400 ClusterMessageHandlerServiceThread:thread-1 INFO anonymous     [c.a.jira.index.WriterWrapper] Closing index writer with non-existing file. Ignoring: /jirahome/caches/indexesV1/entities/searchrequest/write.lock
      java.nio.file.NoSuchFileException: /jirahome/caches/indexesV1/entities/searchrequest/write.lock
      

      The index restore process won't end successfully because of this error. Jira didn't release the lock and, because of that, there were errors with the failure to obtain this lock.

      Next the LockObtainFailedException failure will be visible in the logs "Lock held by this virtual machine"

      2022-10-25 10:53:52,144-0400 ClusterMessageHandlerServiceThread:thread-1 ERROR anonymous     [c.a.jira.cluster.OfBizMessageHandlerService] There was a problem handling a cluster message
      com.atlassian.jira.util.RuntimeIOException: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /jirahome/caches/indexesV1/entities/searchrequest/write.lock
      	at com.atlassian.jira.index.DefaultIndexEngine.clean(DefaultIndexEngine.java:143)
      	at com.atlassian.jira.index.DefaultManager.deleteIndexDirectory(DefaultManager.java:40)
      	at com.atlassian.jira.sharing.index.DefaultSharedEntityIndexer.recreate(DefaultSharedEntityIndexer.java:243)
      	at com.atlassian.jira.sharing.index.DefaultSharedEntityIndexManager.reIndexAll(DefaultSharedEntityIndexManager.java:92)
      	at com.atlassian.jira.sharing.index.DefaultSharedEntityIndexManager.reIndexAll(DefaultSharedEntityIndexManager.java:83)
      	at com.atlassian.jira.index.ha.DefaultIndexRecoveryManager.recoverIndexFromBackup(DefaultIndexRecoveryManager.java:194)
      	at com.atlassian.jira.index.ha.DefaultIndexCopyService$MessageConsumer.restoreIndex(DefaultIndexCopyService.java:165)
      	at com.atlassian.jira.index.ha.DefaultIndexCopyService$MessageConsumer.receive(DefaultIndexCopyService.java:191)
      	at com.atlassian.jira.cluster.OfBizMessageHandlerService.sendLocalFromNode(OfBizMessageHandlerService.java:214)
      	at com.atlassian.jira.cluster.OfBizMessageHandlerService.tryHandleMessage(OfBizMessageHandlerService.java:140)
      	at com.atlassian.jira.cluster.OfBizMessageHandlerService.handleReceivedMessages(OfBizMessageHandlerService.java:124)
      	at com.atlassian.jira.util.thread.JiraThreadLocalUtils.lambda$wrap$1(JiraThreadLocalUtils.java:156)
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
      	at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source)
      	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      	at java.base/java.lang.Thread.run(Unknown Source)
      Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /jirahome/caches/indexesV1/entities/searchrequest/write.lock
      	at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139)
      	at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
      	at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
      	at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:948)
      	at com.atlassian.jira.index.MonitoringIndexWriter.<init>(MonitoringIndexWriter.java:43)
      	at com.atlassian.jira.index.DefaultIndexEngine.clean(DefaultIndexEngine.java:141)
      

      Expected Results

      Jira to be able to gracefully handle the situation by releasing the lock without a restart.

      Actual Results

      Index replication will fall behind on this node

      Name: Cluster Index Replication
      NodeId: null
      Is healthy: false
      Failure reason: Index replication for cluster node 'node1' is behind by 4,300 seconds.
      Severity: CRITICAL
      

      Workaround

      Restart the affected node to clear the lock

              klopacinski Karol Lopacinski
              astegani Alex Stegani
              Votes:
              36 Vote for this issue
              Watchers:
              51 Start watching this issue

                Created:
                Updated:
                Resolved: