Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-76391

Index recovery unhelpfully triggers a full lock reindex if an issue in the index catchup phase fails to reindex

    XMLWordPrintable

Details

    Description

      Issue Summary

      Index recovery unhelpfully triggers a full lock reindex if an issue in the index catchup phase fails to index.

      For example, the implementation of JSWSERVER-20133 results in the issue containing the immense value to be marked as failed.

      During a full reindex, the reindex will "complete" to 100%, despite individual failures. During index recovery during node startup, the catch up process will also sustain individual failures, however, the catch-up process will be deemed failed and the node will then proceed to performing a full lock reindex.

      This is unhelpful because the node startup will be delayed, potentially for a very long time, and, the full reindex will also fail to index these issues - the underlying problem is not fixed.

      Enterprise Jira instance could take 24 hours or longer to perform full lock reindex. Startup is often a sensitive time for Jira Administrators. While other nodes may be typically online to mitigate the impact, this is not guaranteed, for example, environment failover, environment restore, etc.

      Steps to Reproduce

      1. Configure a 2 node Jira v9.4.8 cluster. Have 1 node started, the other stopped.
      2. Add an issue with some sample data
      3. Perform a scheduled Jira snapshot to create a snapshot in shared_home
      4. Insert a customfield value over the UTF8 encoding limit of 32766 (we inserted a checklist item via Checklist for Jira v6.2.5 to do this)
      5. Delete Node 2's local index, then start the node

      Expected Results

      • Node 2 starts and initiates the index recovery procedure, picking the snapshot in shared_home
      • The index catch up phase executes, but the issue with the immense term fails to index.
      • An warning is logged for the admin, but the startup continues

      Actual Results

      • Node 2 starts and initiates the index recovery procedure, picking the snapshot in shared_home
        [c.a.jira.cluster.DefaultClusterManager] Current node: <NODE_NAME>. Rebuilding local index will be skipped, because replicated index operation is missing necessary data. This is likely to happen after system restore.
        2023-10-16 18:01:37,736-0400 main INFO      [c.a.jira.index.DefaultIndexFetcher] [INDEX_FETCHER]  Jira will try to recover the most recent snapshot: <SNAPSHOT>, chosen based on "last modified" file property from <N> existing snapshots: <SNAPSHOTS>, timestamp: <TIMESTAMP>). Only snapshots younger than 192 hours are considered usable.
        2023-10-16 18:03:26,544-0400 main INFO      [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Restoring search indexes from snapshot and catching up with changes - 1% complete... Starting
        2023-10-16 18:03:26,734-0400 main INFO      [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Restoring search indexes - 30% complete... Restored index backup
        
      • The index catch up phase executes, but the issue with the immense term fails to index.
        2023-10-16 18:03:27,646-0400 main INFO      [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Re-indexing issues updated in the last {1 days, 18 hours, 18 minutes, and 27 seconds}. (Note: it's an intentionally wider range)
        2023-10-16 18:04:13,666-0400 IssueIndexer:thread-71 INFO      [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Updating index with recent changes - 31% complete... Processing issue changes is 1% complete
        ...
        2023-10-16 18:06:40,136-0400 IssueIndexer:thread-48 ERROR      [c.a.j.issue.index.DocumentScrubber] A document contained a potential immense term in field customfield_12000. The field has been removed from the document.
        2023-10-16 18:06:40,136-0400 IssueIndexer:thread-48 ERROR      [c.a.j.issue.index.DocumentScrubber] A document contained a potential immense term in field customfield_12000_item. The field has been removed from the document.
        ...
        2023-10-16 18:06:41,265-0400 IssueIndexer:thread-48 WARN      [c.a.jira.index.AccumulatingResultBuilder] Indexing failed for Issue - '10002'
        2023-10-16 18:06:41,265-0400 IssueIndexer:thread-48 WARN      [c.a.jira.index.AccumulatingResultBuilder] Document contains at least one immense term in field="customfield_10002_item" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[73, 91, 116, 99, 114, 100, 95, 99, 104, 101, 99, 107, 109, 97, 114, 120, 95, 101, 120, 116, 101, 114, 110, 97, 108, 95, 102, 105, 110, 100]...', original message: bytes can be at most 32766 in length; got 68089
        java.lang.IllegalArgumentException: Document contains at least one immense term in field="customfield_10002_item" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[73, 91, 116, 99, 114, 100, 95, 99, 104, 101, 99, 107, 109, 97, 114, 120, 95, 101, 120, 116, 101, 114, 110, 97, 108, 95, 102, 105, 110, 100]...', original message: bytes can be at most 32766 in length; got 68089
        	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:796)
        2023-10-16 18:06:42,760-0400 main WARN      [c.a.jira.index.AccumulatingResultBuilder] Indexing completed with 1 errors
        com.atlassian.jira.index.IndexingFailureException: Indexing completed with 1 errors
        2023-10-16 18:07:11,479-0400 IssueIndexer:thread-57 INFO      [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Updating index with recent changes - 79% complete... Processing issue changes is 97% complete
        2023-10-16 18:07:16,720-0400 IssueIndexer:thread-14 INFO      [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Updating index with recent changes - 80% complete... Processing issue changes is 99% complete
        
      • The failure is recognised by the snapshot restore process, and a full-lock reindex is commenced:
        2023-10-16 18:07:47,800-0400 main ERROR      [c.a.jira.cluster.DefaultClusterManager] Current node: <NODE_NAME>. Couldn't recover index even though it had been found in shared. Current list of other nodes: [<NODE_LIST>]
        com.atlassian.jira.index.IndexingFailureException: Indexing completed with 1 errors
        	at com.atlassian.jira.index.AccumulatingResultBuilder$CompositeResult.await(AccumulatingResultBuilder.java:151)
        	at com.atlassian.jira.issue.index.DefaultIssueIndexer.doIssuesOperationBatchMode(DefaultIssueIndexer.java:263)
        	at com.atlassian.jira.issue.index.DefaultIssueIndexer.reindexIssuesBatchMode(DefaultIssueIndexer.java:250)
        	at com.atlassian.jira.issue.index.DefaultIndexManager.lambda$reindexIssuesBatchMode$15(DefaultIndexManager.java:1024)
        	at com.atlassian.jira.util.thread.JiraThreadLocalUtils.lambda$wrap$1(JiraThreadLocalUtils.java:156)
        	at com.atlassian.jira.issue.index.DefaultIndexManager.withReindexLock(DefaultIndexManager.java:423)
        	at com.atlassian.jira.issue.index.DefaultIndexManager.reindexIssuesBatchMode(DefaultIndexManager.java:1024)
        	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        	at com.atlassian.jira.config.component.SwitchingInvocationHandler.invoke(SwitchingInvocationHandler.java:38)
        	at com.sun.proxy.$Proxy86.reindexIssuesBatchMode(Unknown Source)
        	at com.atlassian.jira.index.ha.DefaultIndexRecoveryManager.reindexIssueByIds(DefaultIndexRecoveryManager.java:326)
        	at com.atlassian.jira.index.ha.DefaultIndexRecoveryManager.reindexOutdatedEntities(DefaultIndexRecoveryManager.java:306)
        	at com.atlassian.jira.index.ha.DefaultIndexRecoveryManager.reindexWithVersionCheckEntitiesUpdatedInTheLast(DefaultIndexRecoveryManager.java:245)
        	at com.atlassian.jira.index.ha.DefaultIndexRecoveryManager.access$300(DefaultIndexRecoveryManager.java:74)
        	at com.atlassian.jira.index.ha.DefaultIndexRecoveryManager$ReplaceIndexRunner.catchUp(DefaultIndexRecoveryManager.java:529)
        	at com.atlassian.jira.index.ha.DefaultIndexRecoveryManager$ReplaceIndexRunner.run(DefaultIndexRecoveryManager.java:487)
        <SNIP>
        	at com.sun.proxy.$Proxy68.withReindexLock(Unknown Source)
        	at com.atlassian.jira.index.ha.DefaultIndexRecoveryManager.recoverIndexFromBackup(DefaultIndexRecoveryManager.java:171)
        	at com.atlassian.jira.index.ha.DefaultIndexRecoveryManager.safeRecoverIndexFromBackup(DefaultIndexRecoveryManager.java:138)
        	at com.atlassian.jira.index.DefaultIndexFetcher.recoverIndexFromMostRecentSnapshot(DefaultIndexFetcher.java:108)
        	at com.atlassian.jira.cluster.DefaultClusterManager.pickIndexSnapshotFromSharedHome(DefaultClusterManager.java:428)
        	at com.atlassian.jira.cluster.DefaultClusterManager.checkIndexOnStart(DefaultClusterManager.java:221)
        	at com.atlassian.jira.startup.ClusteringLauncher.clusterSynchronizedCheckIndex(ClusteringLauncher.java:99)
        	at com.atlassian.jira.startup.ClusteringLauncher.start(ClusteringLauncher.java:130)
        	at com.atlassian.jira.startup.DefaultJiraLauncher.postDBActivated(DefaultJiraLauncher.java:166)
        	at com.atlassian.jira.startup.DefaultJiraLauncher.lambda$postDbLaunch$2(DefaultJiraLauncher.java:144)
        	at com.atlassian.jira.config.database.DatabaseConfigurationManagerImpl.doNowOrEnqueue(DatabaseConfigurationManagerImpl.java:307)
        	at com.atlassian.jira.config.database.DatabaseConfigurationManagerImpl.doNowOrWhenDatabaseActivated(DatabaseConfigurationManagerImpl.java:202)
        	at com.atlassian.jira.startup.DefaultJiraLauncher.postDbLaunch(DefaultJiraLauncher.java:135)
        <SNIP>
        
        2023-10-16 18:07:47,801-0400 main INFO      [c.a.jira.cluster.DefaultClusterManager] Current node: <NODE_NAME> Will trigger full foreground reindex. System property com.atlassian.jira.startup.allow.full.reindex is not defined, and it will default to true. You can disable triggering full foreground reindex by setting this property to "false".
        2023-10-16 18:07:47,801-0400 main INFO      [c.a.jira.cluster.DefaultClusterManager] Current node: <NODE_NAME>. Triggering full-reindex. Current list of other nodes: [<NODE_LIST>]
        2023-10-16 18:07:47,857-0400 main INFO      [c.a.jira.cluster.DefaultClusterManager] Current node: <NODE_NAME>. Triggered full-reindex, requestId: 10500. Waiting for reindex to finish...
        2023-10-16 18:07:49,992-0400 JiraTaskExecutionThread-1 INFO anonymous     [c.a.j.index.request.DefaultReindexRequestManager] Re-indexing started
        2023-10-16 18:07:49,993-0400 JiraTaskExecutionThread-1 INFO anonymous     [c.a.j.util.index.CompositeIndexLifecycleManager] Reindex All starting...
        
        

        Workaround

        Currently there is no known workaround for this behavior. A workaround will be added here when available

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              allewellyn@atlassian.com Alex [Atlassian,PSE]
              Votes:
              5 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: