Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-72125

Index replication service is paused indefinitely after failing to obtain an index snapshot from another node

    • 7.13
    • 184
    • Severity 2 - Major
    • 1,889
    • Hide

      Hi Team,

      we're happy to announce that this issue is fixed in 9.1.0 release, available → https://www.atlassian.com/software/jira/update

      Before Jira 9.1 it was admin's responsibility to ensure that only one node starts at a time and that all nodes in the cluster are healthy, so that each of them can provide a healthy index snapshot. 

      Starting from Jira 9.1 the synchronous node start-up is enforced by the application. The start-up procedure ensures the local index is healthy before Jira can continue start-up. Jira will not start without a healthy index.

      The start-up procedure is performed under a cluster lock, guaranteeing that only one node at a time executes it.

      The index start-up procedure:

      1. Re-index missing data if a local issue index is less than 10% behind the database.
      1. Load a recent index snapshot from shared-home directory if one is available.
      1. Otherwise trigger a full re-index.

      Please refer to this Knowledge Base article for more information.

      With that we continue our Scale & performance roadmap work around index management, stay tuned for more stuff coming later this year.

      Cheers

      Andrzej Kotas
      Product Manager - Jira DC

      Show
      Hi Team, we're happy to announce that this issue is fixed in 9.1.0  release, available → https://www.atlassian.com/software/jira/update .  Before Jira 9.1 it was admin's responsibility to ensure that only one node starts at a time and that all nodes in the cluster are healthy, so that each of them can provide a healthy index snapshot.  Starting from Jira 9.1 the synchronous node start-up is enforced by the application. The start-up procedure ensures the local index is healthy before Jira can continue start-up. Jira will not start without a healthy index. The start-up procedure is performed under a cluster lock, guaranteeing that only one node at a time executes it. The index start-up procedure: Re-index missing data if a local issue index is less than 10% behind the database. Load a recent index snapshot from shared-home directory if one is available. Otherwise trigger a full re-index. Please refer to this  Knowledge Base  article for more information. With that we continue our  Scale & performance   roadmap work  around index management, stay tuned for more stuff coming later this year. Cheers Andrzej Kotas Product Manager - Jira DC

      Issue Summary

      Jira pauses the cluster index replication service when requesting an index snapshot from another node. If the sending node fails to provide an snapshot for any reason, the cluster index replication service will remain paused indefinitely.

      Steps to Reproduce

      1. Deploy a two-node Jira Data Center cluster. Ensure one of the nodes does not have a valid index, and the other will not be able to provide a valid snapshot. For example, as described in JRASERVER-62669.
      2. Start one of the nodes while the other is already up.

      Expected Results

      • The starting node will obtain a valid index snapshot from another node.
      • If that does not happen over a certain period of time, the starting node will either request another index snapshot, or at least unpause the index replication service.

      Actual Results

      • Starting node requests an index snapshot from any other node in the cluster:
        Starting node requests an index snapshot
        2021-02-14 04:20:41,530+0000 localhost-startStop-1 INFO      [c.a.jira.startup.ClusteringLauncher] Checking local index on node start
        2021-02-14 04:20:41,534+0000 localhost-startStop-1 INFO      [c.a.jira.cluster.DefaultClusterManager] Current node: 10.0.80.208 index can't be rebuilt. Requesting an index from any other node. Current list of other nodes: [10.0.12.156, 10.0.66.220, 10.0.174.242]
        (...)
        2021-02-14 04:20:41,540+0000 localhost-startStop-1 INFO      [c.a.jira.cluster.DefaultClusterManager] Sending message: "Backup Index" - request to create index snapshot from node: ANY on current node: 10.0.80.208
        
      • The starting node then pauses index replication while waiting for an index snapshot to be provided by another node:
        Starting node pauses index replication
        2021-02-14 04:20:41,534+0000 localhost-startStop-1 INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Pausing node re-index service
        java.lang.Exception
            at com.atlassian.jira.index.ha.DefaultNodeReindexService.pause(DefaultNodeReindexService.java:213)
            at com.atlassian.jira.cluster.DefaultClusterManager.requestCurrentIndexFromNode(DefaultClusterManager.java:138)
            at com.atlassian.jira.cluster.DefaultClusterManager.checkIndex(DefaultClusterManager.java:131)
            at com.atlassian.jira.startup.ClusteringLauncher.start(ClusteringLauncher.java:37)
            at com.atlassian.jira.startup.DefaultJiraLauncher.postDBActivated(DefaultJiraLauncher.java:168)
            at com.atlassian.jira.startup.DefaultJiraLauncher.lambda$postDbLaunch$2(DefaultJiraLauncher.java:146)
            at com.atlassian.jira.config.database.DatabaseConfigurationManagerImpl.doNowOrEnqueue(DatabaseConfigurationManagerImpl.java:301)
            at com.atlassian.jira.config.database.DatabaseConfigurationManagerImpl.doNowOrWhenDatabaseActivated(DatabaseConfigurationManagerImpl.java:196)
            at com.atlassian.jira.startup.DefaultJiraLauncher.postDbLaunch(DefaultJiraLauncher.java:137)
            at com.atlassian.jira.startup.DefaultJiraLauncher.lambda$start$0(DefaultJiraLauncher.java:104)
            at com.atlassian.jira.util.devspeed.JiraDevSpeedTimer.run(JiraDevSpeedTimer.java:31)
            at com.atlassian.jira.startup.DefaultJiraLauncher.start(DefaultJiraLauncher.java:102)
            at com.atlassian.jira.startup.LauncherContextListener.initSlowStuff(LauncherContextListener.java:154)
            at com.atlassian.jira.startup.LauncherContextListener.initSlowStuffInBackground(LauncherContextListener.java:139)
            at com.atlassian.jira.startup.LauncherContextListener.contextInitialized(LauncherContextListener.java:101)
            ... 5 filtered
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            at java.lang.Thread.run(Thread.java:748)
        
      • The sending node fails to provide an index snapshot for any reason (i.e. due to JRASERVER-62669):
        Sending node fails to provide an index snapshot
        2021-02-14 04:20:42,118+0000 ClusterMessageHandlerServiceThread:thread-1 INFO      [c.a.j.index.ha.DefaultIndexCopyService] Received message: "Backup Index" - request to create index snapshot from node: 10.0.80.208 on current node: 10.0.12.156
        2021-02-14 04:20:42,118+0000 ClusterMessageHandlerServiceThread:thread-1 INFO      [c.a.j.index.ha.DefaultIndexCopyService] Index backup started. Requesting node: 10.0.80.208, currentNode: 10.0.12.156
        2021-02-14 04:20:42,120+0000 ClusterMessageHandlerServiceThread:thread-1 WARN      [c.a.j.index.ha.DefaultIndexCopyService] Index backup failed - latest index operation not found. Requesting node: 10.0.80.208, currentNode: 10.0.12.156
        

        This can happen when the requested node is able to provide a good copy too.

      • Starting node will keep waiting for an index snapshot indefinitely, while also keeping the replication service paused. Note the timestamps in this snippet, more than a day later:
        2021-02-15 10:04:19,691+0000 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered=true, notRunningCounter=21396, paused=true, lastPausedStacktrace=java.lang.Throwable
                at com.atlassian.jira.index.ha.DefaultNodeReindexService.pause(DefaultNodeReindexService.java:215)
                at com.atlassian.jira.util.index.CompositeIndexLifecycleManager.reIndexAll(CompositeIndexLifecycleManager.java:62)
                at com.atlassian.jira.util.index.CompositeIndexLifecycleManager.reIndexAll(CompositeIndexLifecycleManager.java:51)
                at com.atlassian.jira.web.action.admin.index.ReIndexAsyncIndexerCommand.doReindex(ReIndexAsyncIndexerCommand.java:27)
                at com.atlassian.jira.web.action.admin.index.AbstractAsyncIndexerCommand.call(AbstractAsyncIndexerCommand.java:63)
                at com.atlassian.jira.web.action.admin.index.ReIndexAsyncIndexerCommand.call(ReIndexAsyncIndexerCommand.java:18)
                at com.atlassian.jira.web.action.admin.index.AbstractAsyncIndexerCommand.call(AbstractAsyncIndexerCommand.java:26)
                at com.atlassian.jira.task.TaskManagerImpl$TaskCallableDecorator.call(TaskManagerImpl.java:533)
                at com.atlassian.jira.task.TaskManagerImpl$TaskCallableDecorator.call(TaskManagerImpl.java:491)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at com.atlassian.jira.task.ForkedThreadExecutor$ForkedRunnableDecorator.run(ForkedThreadExecutor.java:216)
                at java.lang.Thread.run(Thread.java:748)
        

      Please notice, the "Node re-index service is not running" exceptions don't necessarily indicate that the instance is indeed affected by this bug report, especially when a newly added cluster node could appear to be hanging on startup. To identify that the node is indeed affected by the bug, please verify if other symptoms listed above are evident. If not, and if you're starting a new node up in a Data Center cluster, it's probably just a matter of waiting as the node could simply be recovering the indexes from an existing snapshot and catching up on the updates between the snapshot time and the time the new node started up. Depending on the JIRA's size, it's not unusual to see a cluster node taking one hour to startup

      • Cluster index replication will fall behind in the starting node, eventually leading to a failed Cluster Index Replication health check, and symptoms such as:
        • Missing issues in agile boards.
        • Searches bringing incomplete or inconsistent results.

      Workaround

      • request index from another node via admin panel / Copy the Search Index from another node
      • restore index from index backup
      • restart node
      • LB should not redirect users to node with no index: JRASERVER-66970

      Notes

      • As of 8.19.0 we introduced fetching index snapshot from shared on startup which will prevent this issues to happen. For this feature to work index snapshot must be available in `export/indexsnapshots` directory of shared home. A service creating snapshot of indexes is enabled by default in 24 hours cycle in that version. See details in JRASERVER-66649.
      • In Jira 9.0 we've ensure that Jira instance will create index snapshot and save it to shared home directory only when index on this instance is consistent. More details on how to handle situations where an index is not consistent can be found here: Indexing inconsistency troubleshooting

            [JRASERVER-72125] Index replication service is paused indefinitely after failing to obtain an index snapshot from another node

            Remove index folders from your Jira node $JIRA-HOME/caches/indexesV1 helps to remediate the problem as a workaround, we also added -Dcom.atlassian.jira.status.index.check=false to startup parameter

            Azfar Masut added a comment - Remove index folders from your Jira node $JIRA-HOME/caches/indexesV1 helps to remediate the problem as a workaround, we also added -Dcom.atlassian.jira.status.index.check=false to startup parameter

            Hi a0ef4d4784b7, this is expected. 

            When a node performa a full reindex it is:

            • not accepting user traffic (out of LB)
            • not updating the index (outside of the full-reindex)

            During this time such node does not replicated any internal index changes to other nodes and does not consume any index changes from other nodes. During this time you may see such entries in the log:

            2023-06-05 14:33:19,745+0000 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered=true, notRunningCounter=768, paused=true, lastPausedStacktrace=com.atlassian.jira.index.ha.DefaultNodeReindexService$StackCollector  

            After the node finished the full re-index, created a snapshot and send the snapshot to shared the index replication should start working and there should be no such log entires. If this is the case please create a support request with the support.zip(s) so this can be investigated.

            Best regards,

            mac

            Maciej Swinarski (Inactive) added a comment - Hi a0ef4d4784b7 , this is expected.  When a node performa a full reindex it is: not accepting user traffic (out of LB) not updating the index (outside of the full-reindex) During this time such node does not replicated any internal index changes to other nodes and does not consume any index changes from other nodes. During this time you may see such entries in the log: 2023-06-05 14:33:19,745+0000 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered= true , notRunningCounter=768, paused= true , lastPausedStacktrace=com.atlassian.jira.index.ha.DefaultNodeReindexService$StackCollector After the node finished the full re-index, created a snapshot and send the snapshot to shared the index replication should start working and there should be no such log entires. If this is the case please create a support request with the support.zip(s) so this can be investigated. Best regards, mac

            Saurabh Gupta added a comment - - edited

            Found this issue in Jira 9.4.6 while re-indexing the testing instance created via the data center performance script.

            2023-06-05 14:33:19,745+0000 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered=true, notRunningCounter=768, paused=true, lastPausedStacktrace=com.atlassian.jira.index.ha.DefaultNodeReindexService$StackCollector 
            

            Although the re-indexing completed successfully. 

            Saurabh Gupta added a comment - - edited Found this issue in Jira 9.4.6 while re-indexing the testing instance created via the data center performance script. 2023-06-05 14:33:19,745+0000 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered= true , notRunningCounter=768, paused= true , lastPausedStacktrace=com.atlassian.jira.index.ha.DefaultNodeReindexService$StackCollector Although the re-indexing completed successfully. 

            We need this backported to 8.20.x LTS, too. 

            Cristina Sanz García added a comment - We need this backported to 8.20.x LTS, too. 

            Hi a09d61b5de46,

            In order to fix this problem we have made couple of changes which can not be backported to 8.20.x: upgrade task, full-reindex required, index snapshot in new directory.

            More details can be found in:

            Regards,
            Mac

            Maciej Swinarski (Inactive) added a comment - Hi a09d61b5de46 , In order to fix this problem we have made couple of changes which can not be backported to 8.20.x: upgrade task, full-reindex required, index snapshot in new directory. More details can be found in: https://confluence.atlassian.com/jirakb/index-management-on-jira-start-up-1141500654.html https://confluence.atlassian.com/jirakb/failed-getting-index-on-start-1141970837.html Regards, Mac

            We need this backported to 8.20.x LTS. Please consider.

            Jonathan Franconi added a comment - We need this backported to 8.20.x LTS. Please consider.

            Hi Team,

            we're happy to announce that this issue is fixed in 9.1.0 release, available → https://www.atlassian.com/software/jira/update

            Before Jira 9.1 it was admin's responsibility to ensure that only one node starts at a time and that all nodes in the cluster are healthy, so that each of them can provide a healthy index snapshot. 

            Starting from Jira 9.1 the synchronous node start-up is enforced by the application. The start-up procedure ensures the local index is healthy before Jira can continue start-up. Jira will not start without a healthy index.

            The start-up procedure is performed under a cluster lock, guaranteeing that only one node at a time executes it.

            The index start-up procedure:

            1. Re-index missing data if a local issue index is less than 10% behind the database.
            1. Load a recent index snapshot from shared-home directory if one is available.
            1. Otherwise trigger a full re-index.

            Please refer to this Knowledge Base article for more information.

            With that we continue our Scale & performance roadmap work around index management, stay tuned for more stuff coming later this year.

            Cheers

            Andrzej Kotas
            Product Manager - Jira DC

            Andrzej Kotas added a comment - Hi Team, we're happy to announce that this issue is fixed in 9.1.0  release, available → https://www.atlassian.com/software/jira/update .  Before Jira 9.1 it was admin's responsibility to ensure that only one node starts at a time and that all nodes in the cluster are healthy, so that each of them can provide a healthy index snapshot.  Starting from Jira 9.1 the synchronous node start-up is enforced by the application. The start-up procedure ensures the local index is healthy before Jira can continue start-up. Jira will not start without a healthy index. The start-up procedure is performed under a cluster lock, guaranteeing that only one node at a time executes it. The index start-up procedure: Re-index missing data if a local issue index is less than 10% behind the database. Load a recent index snapshot from shared-home directory if one is available. Otherwise trigger a full re-index. Please refer to this  Knowledge Base  article for more information. With that we continue our  Scale & performance   roadmap work  around index management, stay tuned for more stuff coming later this year. Cheers Andrzej Kotas Product Manager - Jira DC

            Hi 390913d0deff

            Thank you for the detailed description of the problem.

            I can see index snapshots exists on shared drive and it is very recent.

            It would be interesting to see the logs from the starting node related to index management.

            If you run this grep on the log covering the start of the node which failed getting the index you should be able to understand what went wrong:

            • this was a new node so it could not rebuild the index
            • it should try getting the index from shared - why this failed?
            • if this failed it should try getting the index by asking existing nodes - why this failed as well
            grep 'IndexUtils\|ArchiveUtils\|DefaultIssueIndexer\|DefaultClusterManager\|DefaultIndexCopyService\|DefaultNodeReindexService\|SnapshotDeletionPolicyContributionStrategy\|DefaultIndexManager' atlassian-jira.log
            

            Maciej Swinarski (Inactive) added a comment - Hi 390913d0deff Thank you for the detailed description of the problem. I can see index snapshots exists on shared drive and it is very recent. It would be interesting to see the logs from the starting node related to index management. If you run this grep on the log covering the start of the node which failed getting the index you should be able to understand what went wrong: this was a new node so it could not rebuild the index it should try getting the index from shared - why this failed? if this failed it should try getting the index by asking existing nodes - why this failed as well grep 'IndexUtils\|ArchiveUtils\|DefaultIssueIndexer\|DefaultClusterManager\|DefaultIndexCopyService\|DefaultNodeReindexService\|SnapshotDeletionPolicyContributionStrategy\|DefaultIndexManager' atlassian-jira.log

            Vasant Chourasia added a comment - - edited

            Hi Andrzej,

            I am running Jira version 8.20.8 with two nodes in AWS environment. Cluster status was good for both nodes. Due to health check issue one of the node get terminated and auto scaling triggers a new node. But new node is having issue - 

            "2022-06-02 23:34:56,096+0000 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered=true, notRunningCounter=1680, paused=true, lastPausedStacktrace=n/a"

            I can see index snapshots exists on shared drive and it is very recent.

            [root@ip-10-133-80-149 ec2-user]# ls -l /apps/atlassian/jira-shared/export/indexsnapshots/

            total 24

            rw-r---- 1 jira jira 4602 Jun  1 02:00 IndexSnapshot_2022-Jun-01--0200.tar.sz

            rw-r---- 1 jira jira 6575 Jun  2 02:00 IndexSnapshot_2022-Jun-02--0200.tar.sz

            rw-r---- 1 jira jira 4582 May 31 02:00 IndexSnapshot_2022-May-31–0200.tar.sz

            As per the troubleshooting section I grep on failed node and here is the output for the same:

            ------

            [root@ip-10-133-80-140 bin]# cd /apps/atlassian/jira-home/log

            [root@ip-10-133-80-140 log]# grep 'IndexUtils|ArchiveUtils|DefaultIssueIndexer|DefaultClusterManager|DefaultIndexCopyService|DefaultNodeReindexService|SnapshotDeletionPolicyContributionStrategy|DefaultIndexManager' atlassian-jira.log

            2022-06-02 21:14:42,906+0000 localhost-startStop-1 INFO      [c.a.j.issue.index.*DefaultIndexManager*] Legacy mode for reIndex(issue): com.atlassian.jira.issue.reindex.legacy.mode=false

            2022-06-02 21:14:42,906+0000 localhost-startStop-1 INFO      [c.a.j.issue.index.*DefaultIndexManager*] Legacy mode for reIndex(issue): com.atlassian.jira.issue.reindex.legacy.mode=false

            2022-06-02 21:14:45,948+0000 localhost-startStop-1 INFO      [c.a.j.index.ha.*DefaultNodeReindexService*] [INDEX-REPLAY] Created node re-index service, paused=true, running period=5sec, delay=10sec

            2022-06-02 21:14:55,948+0000 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.*DefaultNodeReindexService*] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered=true, notRunningCounter=0, paused=true, lastPausedStacktrace=n/a


            This is running in test environment but we are soon migrating to Prod and this could be an issue when any of the node go down and new one comes up.

            Please suggest how to resolve this issue without reindexing as it takes good chunk of time to reindex.

             

            Thanks,

            Vasant

             

            Post Edit: I restarted Jira service on the troubled node and it was able to sync up index.  Cluster health checks pass for both nodes.

            Vasant Chourasia added a comment - - edited Hi Andrzej, I am running Jira version 8.20.8 with two nodes in AWS environment. Cluster status was good for both nodes. Due to health check issue one of the node get terminated and auto scaling triggers a new node. But new node is having issue -  "2022-06-02 23:34:56,096+0000 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered=true, notRunningCounter=1680, paused=true, lastPausedStacktrace=n/a" I can see index snapshots exists on shared drive and it is very recent. [root@ip-10-133-80-149 ec2-user] # ls -l /apps/atlassian/jira-shared/export/indexsnapshots/ total 24 rw-r ---- 1 jira jira 4602 Jun  1 02:00 IndexSnapshot_2022-Jun-01--0200.tar.sz rw-r ---- 1 jira jira 6575 Jun  2 02:00 IndexSnapshot_2022-Jun-02--0200.tar.sz rw-r ---- 1 jira jira 4582 May 31 02:00 IndexSnapshot_2022-May-31–0200.tar.sz As per the troubleshooting section I grep on failed node and here is the output for the same: ------ [root@ip-10-133-80-140 bin] # cd /apps/atlassian/jira-home/log [root@ip-10-133-80-140 log] # grep 'IndexUtils|ArchiveUtils|DefaultIssueIndexer|DefaultClusterManager|DefaultIndexCopyService|DefaultNodeReindexService|SnapshotDeletionPolicyContributionStrategy|DefaultIndexManager' atlassian-jira.log 2022-06-02 21:14:42,906+0000 localhost-startStop-1 INFO      [c.a.j.issue.index.*DefaultIndexManager*] Legacy mode for reIndex(issue): com.atlassian.jira.issue.reindex.legacy.mode=false 2022-06-02 21:14:42,906+0000 localhost-startStop-1 INFO      [c.a.j.issue.index.*DefaultIndexManager*] Legacy mode for reIndex(issue): com.atlassian.jira.issue.reindex.legacy.mode=false 2022-06-02 21:14:45,948+0000 localhost-startStop-1 INFO      [c.a.j.index.ha.*DefaultNodeReindexService*] [INDEX-REPLAY] Created node re-index service, paused=true, running period=5sec, delay=10sec 2022-06-02 21:14:55,948+0000 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.*DefaultNodeReindexService*] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered=true, notRunningCounter=0, paused=true, lastPausedStacktrace=n/a This is running in test environment but we are soon migrating to Prod and this could be an issue when any of the node go down and new one comes up. Please suggest how to resolve this issue without reindexing as it takes good chunk of time to reindex.   Thanks, Vasant   Post Edit: I restarted Jira service on the troubled node and it was able to sync up index.  Cluster health checks pass for both nodes.

            b0e65afe5d42

            In the past it was simple - all you needed to do was update the version parameter in the cloudformation stack, then terminate the active node/s.
            The autoscaling rule will spawn in replacement nodes on the newer version and Jira would then run through the upgrade process and nodes would become healthy again in the load balancer and you're done.

            Before Jira 8.19 the only way for a new node to get an index was getting it from another node (or running full-reindex). So you could run in-place upgrade (on the exiting nodes) and this would work as you described.
            If you would terminate all active nodes and create new replacements nodes (fresh nodes with no state) they would not have the previous index so it would require to manually restore the index from an index snapshot or run full-reindex on one of the nodes.

            This changed in 8.19 and made your scenario possible. After terminating all existing nodes, you can create new/clean nodes which on start will get the index snapshot from shared home (if its there and its not older than 24 hours).

            Note we are talking only about minor version upgrades. For major version upgrades we require the first node to run a full-reindex or prepare the index snapshot manually and place it in shared (we have provided guidelines on how to achieve it for 7.x->8.x and 8.x->9.x).

            This is going to be a problem if the index snapshot is directly tied to the Jira version and your replacement nodes suddenly jump > to a newer version.

            Jira index can only change its format (which requires a full-reindex) in major version. We assume `export/index snapshots` contains Jira 8.x snpashot. The location of snapshot changed in 9.x will now follow the same directory naming as on local home:
            Jira 9.x
            index: <local_home_directory>/caches/indexesV2
            index snapshot: <shared_home_directory>/caches/indexesV2/snapshots

            To investigate your case we would need to see the logs from the node startup to find the answer why the index from shared was not picked up.

            Please check the KB mentioned in this issue: https://confluence.atlassian.com/jirakb/indexing-inconsistency-troubleshooting-1114800953.html and check the Troubleshooting section. You will find the "grep" which will extracts all the logs relevant to index management.

            Regards,
            Mac

            Maciej Swinarski (Inactive) added a comment - - edited b0e65afe5d42 In the past it was simple - all you needed to do was update the version parameter in the cloudformation stack, then terminate the active node/s. The autoscaling rule will spawn in replacement nodes on the newer version and Jira would then run through the upgrade process and nodes would become healthy again in the load balancer and you're done. Before Jira 8.19 the only way for a new node to get an index was getting it from another node (or running full-reindex). So you could run in-place upgrade (on the exiting nodes) and this would work as you described. If you would terminate all active nodes and create new replacements nodes (fresh nodes with no state) they would not have the previous index so it would require to manually restore the index from an index snapshot or run full-reindex on one of the nodes. This changed in 8.19 and made your scenario possible. After terminating all existing nodes, you can create new/clean nodes which on start will get the index snapshot from shared home (if its there and its not older than 24 hours). Note we are talking only about minor version upgrades. For major version upgrades we require the first node to run a full-reindex or prepare the index snapshot manually and place it in shared (we have provided guidelines on how to achieve it for 7.x->8.x and 8.x->9.x). This is going to be a problem if the index snapshot is directly tied to the Jira version and your replacement nodes suddenly jump > to a newer version. Jira index can only change its format (which requires a full-reindex) in major version. We assume `export/index snapshots` contains Jira 8.x snpashot. The location of snapshot changed in 9.x will now follow the same directory naming as on local home: Jira 9.x index: <local_home_directory>/caches/indexesV2 index snapshot: <shared_home_directory>/caches/indexesV2/snapshots To investigate your case we would need to see the logs from the node startup to find the answer why the index from shared was not picked up. Please check the KB mentioned in this issue: https://confluence.atlassian.com/jirakb/indexing-inconsistency-troubleshooting-1114800953.html and check the Troubleshooting section. You will find the "grep" which will extracts all the logs relevant to index management. Regards, Mac

            Mark Benson added a comment - - edited

            Hi Andrzej,

            I've just run into this problem while upgrading an autoscaling cluster (this was just a test run before we do the same in Prod to mitigate the recent CVE that was disclosed).

            In the past it was simple - all you needed to do was update the version parameter in the cloudformation stack, then terminate the active node/s.
            The autoscaling rule will spawn in replacement nodes on the newer version and Jira would then run through the upgrade process and nodes would become healthy again in the load balancer and you're done.

            I have experienced this on a single node test, have yet to test in a multi node environment. 

            If I'm understanding this correctly: For this feature to work index snapshot must be available in `export/index snapshots` directory of shared home. A service creating snapshot of indexes is enabled by default in 24 hours cycle in that version.
            This is going to be a problem if the index snapshot is directly tied to the Jira version and your replacement nodes suddenly jump to a newer version.

             

            Can you confirm if this is the case or not? Either way something went wrong with my test and the node was stuck in perma-maintenance mode waiting to pull the index so needed to use the CATALINA_OPTS workaround.
            You can see my snapshot index results from my test cluster below which doesn't reveal much about Jira version number:

            [root@jiradev indexsnapshots]# ls -lh
            total 456K
            rw-r---- 1 jira jira 3.0K Apr 21 02:00 IndexSnapshot_2022-Apr-21--0200.tar.sz
            rw-r---- 1 jira jira 182K Apr 21 23:58 IndexSnapshot_2022-Apr-21--2358.tar.sz
            rw-r---- 1 jira jira 266K Apr 22 09:00 IndexSnapshot_2022-Apr-22--0900.tar.sz

             

            Cheers,
            Mark.

            Mark Benson added a comment - - edited Hi Andrzej, I've just run into this problem while upgrading an autoscaling cluster (this was just a test run before we do the same in Prod to mitigate the recent CVE that was disclosed). In the past it was simple - all you needed to do was update the version parameter in the cloudformation stack, then terminate the active node/s. The autoscaling rule will spawn in replacement nodes on the newer version and Jira would then run through the upgrade process and nodes would become healthy again in the load balancer and you're done. I have experienced this on a single node test, have yet to test in a multi node environment.  If I'm understanding this correctly: For this feature to work index snapshot must be available in `export/index snapshots` directory of shared home. A service creating snapshot of indexes is enabled by default in 24 hours cycle in that version. This is going to be a problem if the index snapshot is directly tied to the Jira version and your replacement nodes suddenly jump to a newer version.   Can you confirm if this is the case or not? Either way something went wrong with my test and the node was stuck in perma-maintenance mode waiting to pull the index so needed to use the CATALINA_OPTS workaround. You can see my snapshot index results from my test cluster below which doesn't reveal much about Jira version number: [root@jiradev indexsnapshots] # ls -lh total 456K rw-r ---- 1 jira jira 3.0K Apr 21 02:00 IndexSnapshot_2022-Apr-21--0200.tar.sz rw-r ---- 1 jira jira 182K Apr 21 23:58 IndexSnapshot_2022-Apr-21--2358.tar.sz rw-r ---- 1 jira jira 266K Apr 22 09:00 IndexSnapshot_2022-Apr-22--0900.tar.sz   Cheers, Mark.

            Hi all,
            Updating the status accordingly as this work is under way. 

            Improvements already implemented:

            • as of 8.19.0 we introduced fetching index snapshot from shared on startup which will prevent this issues to happen.
              • For this feature to work index snapshot must be available in `export/index snapshots` directory of shared home. A service creating snapshot of indexes is enabled by default in 24 hours cycle in that version. See details in JRASERVER-66649.

            Improvements coming soon:

            • In Jira 9.0 we've ensured that Jira instance will create index snapshot and save it to shared home directory only when index on this instance is consistent.

            Full ticket resolution:

            • We are targeting to resolve the the indexing distribution problem on node start-up by next 9.x LTS version, coming later this year. 

            Andrzej Kotas
            Product Manager - Jira DC

            Andrzej Kotas added a comment - Hi all, Updating the status accordingly as this work is under way.  Improvements already implemented: as of 8.19.0 we introduced fetching index snapshot from shared on startup which will prevent this issues to happen. For this feature to work index snapshot must be available in `export/index snapshots` directory of shared home. A service creating snapshot of indexes is enabled by default in 24 hours cycle in that version. See details in JRASERVER-66649 . Improvements coming soon: In Jira 9.0 we've ensured that Jira instance will create index snapshot and save it to shared home directory only when index on this instance is consistent. More details on how to handle situations where an index is not consistent can be found here: Indexing inconsistency troubleshooting  Full ticket resolution: We are targeting to resolve the the indexing distribution problem on node start-up by next 9.x LTS version, coming later this year.  Andrzej Kotas Product Manager - Jira DC

            Hi Everyone,

            The Workaround Recommended by @Marcieij has worked. But please hurry and come up with a solution as this is a reoccuring issue and is quite critical

            BR

            Hossam

            Hossam.Khalil added a comment - Hi Everyone, The Workaround Recommended by @Marcieij has worked. But please hurry and come up with a solution as this is a reoccuring issue and is quite critical BR Hossam

            bfb3093d2c47 - new nodes should only be allowed to join a cluster when all running nodes have an up to date index;
            In your case:

            • shut down all nodes except one
            • triggering a full-foreground reindex on this node
            • start new nodes (one-by-one) - start only new nodes when all existing nodes have proper index as they may be the source of the index for any new node

            Node re-index service should be running by default since 8.19.0.
            Some more details about changes in 8.19 and 9.0 related to this problem can be found here: https://confluence.atlassian.com/jirakb/indexing-inconsistency-troubleshooting-1114800953.html

            Maciej Swinarski (Inactive) added a comment - bfb3093d2c47 - new nodes should only be allowed to join a cluster when all running nodes have an up to date index; In your case: shut down all nodes except one triggering a full-foreground reindex on this node start new nodes (one-by-one) - start only new nodes when all existing nodes have proper index as they may be the source of the index for any new node Node re-index service should be running by default since 8.19.0. Some more details about changes in 8.19 and 9.0 related to this problem can be found here: https://confluence.atlassian.com/jirakb/indexing-inconsistency-troubleshooting-1114800953.html

            Dear Support,

            So the problem with the workarounds:

            • request index from another node via admin panel / Copy the Search Index from another node
              • The node changes within minutes, but the sync still broken, so I have to copy it...
            • restore index from index backup
              • after restore the cache become incosistent with each other so I have to start it again.
            • restart node
              • completely not working due it's not recover or copy indexes
            • LB should not redirect users to node with no index: JRASERVER-66970
              • all of the nodes have indexing problems. We are currently stucked with differently indexed nodes!

            Please fix it urgent. Thanks.

            Peter Cselotei - Lupus Consulting Zrt. added a comment - Dear Support, So the problem with the workarounds: request index from another node via admin panel / Copy the Search Index from another node The node changes within minutes, but the sync still broken, so I have to copy it... restore index from index backup after restore the cache become incosistent with each other so I have to start it again. restart node completely not working due it's not recover or copy indexes LB should not redirect users to node with no index: JRASERVER-66970 all of the nodes have indexing problems. We are currently stucked with differently indexed nodes! Please fix it urgent. Thanks.

            Hi,

            All of the 4 nodes in my cluster give the same error (Node re-index service is not running). I also don't have an index snaphost. How can I force one of the nodes to start indexing? 

            Deniz Oğuz - The Starware added a comment - Hi, All of the 4 nodes in my cluster give the same error (Node re-index service is not running). I also don't have an index snaphost. How can I force one of the nodes to start indexing? 

            Omid H. added a comment -

            +1

            Omid H. added a comment - +1

            I am getting similar issue:

            INFO anonymous     [c.a.j.cluster.service.NonAliveNodesScannerService] [CLUSTER-STATE] Service is starting to check the cluster state with retention period PT48H
            INFO anonymous     [c.a.j.cluster.service.NonAliveNodesScannerService] [CLUSTER-STATE] Service did not find any stale ACTIVE without heartbeat nodes. Current cluster state: {numberOfNodes=2, numberOfActiveNodes=2, numberOfActiveNotAliveNodes=0, numberOfOfflineNodes=0}
            INFO anonymous     [c.a.j.cluster.service.OfflineNodesScannerService] [CLUSTER-STATE] Service is starting to check the cluster state with retention period PT48H
            INFO anonymous     [c.a.j.cluster.service.OfflineNodesScannerService] [CLUSTER-STATE] Service did not find any stale OFFLINE nodes. Current cluster state: {numberOfNodes=2, numberOfActiveNodes=2, numberOfActiveNotAliveNodes=0, numberOfOfflineNodes=0}
            INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered=true, notRunningCounter=2328, paused=true, lastPausedStacktrace=java.lang.Throwable 

            I have verified I only have 2 valid and Active nodes in the cluster. No stale nodes. I don't get what's going on and how to fix

             

            Alyson Whitaker added a comment - I am getting similar issue: INFO anonymous     [c.a.j.cluster.service.NonAliveNodesScannerService] [CLUSTER-STATE] Service is starting to check the cluster state with retention period PT48H INFO anonymous     [c.a.j.cluster.service.NonAliveNodesScannerService] [CLUSTER-STATE] Service did not find any stale ACTIVE without heartbeat nodes. Current cluster state: {numberOfNodes=2, numberOfActiveNodes=2, numberOfActiveNotAliveNodes=0, numberOfOfflineNodes=0} INFO anonymous     [c.a.j.cluster.service.OfflineNodesScannerService] [CLUSTER-STATE] Service is starting to check the cluster state with retention period PT48H INFO anonymous     [c.a.j.cluster.service.OfflineNodesScannerService] [CLUSTER-STATE] Service did not find any stale OFFLINE nodes. Current cluster state: {numberOfNodes=2, numberOfActiveNodes=2, numberOfActiveNotAliveNodes=0, numberOfOfflineNodes=0} INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered= true , notRunningCounter=2328, paused= true , lastPausedStacktrace=java.lang.Throwable I have verified I only have 2 valid and Active nodes in the cluster. No stale nodes. I don't get what's going on and how to fix  

            John Hayes added a comment -

            I encountered this issue when I had scaled down the cluster to 1 node for an upgrade to 8.20.1 - after the upgrade start up was hanging and would not come up.

            A previous node was still marked as "ACTIVE" in the cluster, though it had already been terminated at this point. 

            When Jira started up, it was waiting for this node, and would never progress with the index update. Eventually I stopped the node, cleared out the incorrect ACTIVE node from the clusternode table, and everything started as normal.

            NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered=true, notRunningCounter=36, paused=true, lastPausedStacktrace=java.lang.Throwable2021-11-07 13:50:27,069+0000 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered=true, notRunningCounter=36, paused=true, lastPausedStacktrace=java.lang.Throwable    at com.atlassian.jira.index.ha.DefaultNodeReindexService.pause(DefaultNodeReindexService.java:218)    at com.atlassian.jira.cluster.DefaultClusterManager.checkIndex(DefaultClusterManager.java:155)    at com.atlassian.jira.startup.ClusteringLauncher.start(ClusteringLauncher.java:37)

            John Hayes added a comment - I encountered this issue when I had scaled down the cluster to 1 node for an upgrade to 8.20.1 - after the upgrade start up was hanging and would not come up. A previous node was still marked as "ACTIVE" in the cluster, though it had already been terminated at this point.  When Jira started up, it was waiting for this node, and would never progress with the index update. Eventually I stopped the node, cleared out the incorrect ACTIVE node from the clusternode table, and everything started as normal. NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered=true, notRunningCounter=36, paused=true, lastPausedStacktrace=java.lang.Throwable2021-11-07 13:50:27,069+0000 NodeReindexServiceThread:thread-0 INFO      [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered=true, notRunningCounter=36, paused=true, lastPausedStacktrace=java.lang.Throwable    at com.atlassian.jira.index.ha.DefaultNodeReindexService.pause(DefaultNodeReindexService.java:218)    at com.atlassian.jira.cluster.DefaultClusterManager.checkIndex(DefaultClusterManager.java:155)    at com.atlassian.jira.startup.ClusteringLauncher.start(ClusteringLauncher.java:37)

            Divya TV added a comment -

            This was during a full re-index on one node taken off of load balancer and other nodes running. But load balancer shutdown. I copied the index from the node that ran full re-index to others for the cluster index replication warnings to go away.

             

            Divya TV added a comment - This was during a full re-index on one node taken off of load balancer and other nodes running. But load balancer shutdown. I copied the index from the node that ran full re-index to others for the cluster index replication warnings to go away.  

            5d8a2d43a040 

             The behavior is on 8.13.12 as well. It happens even on  a node which is performing a full re-index.

            Not sure if I understand you correctly but this is a valid case, i.e. when a node is doing a full-foreground-reindex, waiting for index snapshot from other nodes, restoring an index snapshot, ... it will pause index replication (and you may see in the logs (INFO) as described in the issue. 

            The problem (this bug) is when this state never ends. Most often it will happen when a node waits an index snapshot from another node but fails (due to many reasons) to get this index snapshot. Such node will never recover on its own from such event. 

            In case of full re-index you may see this log message during re-indexing but replication should start after finishing the full reindex. Can you confirm that after the re-index was done the replication would still be paused?

            Maciej Swinarski (Inactive) added a comment - 5d8a2d43a040    The behavior is on 8.13.12 as well. It happens even on  a node which is performing a full re-index. Not sure if I understand you correctly but this is a valid case, i.e. when a node is doing a full-foreground-reindex, waiting for index snapshot from other nodes, restoring an index snapshot, ... it will pause index replication (and you may see in the logs (INFO) as described in the issue.  The problem (this bug) is when this state never ends. Most often it will happen when a node waits an index snapshot from another node but fails (due to many reasons) to get this index snapshot. Such node will never recover on its own from such event.  In case of full re-index you may see this log message during re-indexing but replication should start after finishing the full reindex. Can you confirm that after the re-index was done the replication would still be paused?

            Divya TV added a comment -

            The behavior is on 8.13.12 as well. It happens even on  a node which is performing a full re-index.

            Divya TV added a comment - The behavior is on 8.13.12 as well. It happens even on  a node which is performing a full re-index.

            @Karen: you can always copy it from CLI access in Linux or just copy+paste on a Windows server. In other cases, try to let your networking team set-up a specific IP address for that server to be able to reach the node directly not using the load balancer.

            Other option is to just rename the /var/.../caches/indexV1 directory to /caches/indexV1backup and restart the server. Normally the server will then automatically rebuild the index.

            Stéphane Veraart added a comment - @Karen: you can always copy it from CLI access in Linux or just copy+paste on a Windows server. In other cases, try to let your networking team set-up a specific IP address for that server to be able to reach the node directly not using the load balancer. Other option is to just rename the /var/.../caches/indexV1 directory to /caches/indexV1backup and restart the server. Normally the server will then automatically rebuild the index.

            Karen Mixon added a comment - - edited

            So, the trick is you have to be ON THE BAD NODE to copy the index from a good node. I had to log in to a different incognito window in three different browsers to roll my dice and finally hit the bad node. C'mon y'all.

            Karen Mixon added a comment - - edited So, the trick is you have to be ON THE BAD NODE to copy the index from a good node. I had to log in to a different incognito window in three different browsers to roll my dice and finally hit the bad node. C'mon y'all.

            We also happened it when upgrading to 8.13.6 and 8.13.7.

            Shirley Tsai added a comment - We also happened it when upgrading to 8.13.6 and 8.13.7.

            That situation is persist for 8.13.3, 8.13.6, 8.13.7 releases

            Gonchik Tsymzhitov added a comment - That situation is persist for 8.13.3, 8.13.6, 8.13.7 releases

              mswinarski Maciej Swinarski (Inactive)
              vfontes Vinicius Fontes
              Affected customers:
              83 This affects my team
              Watchers:
              121 Start watching this issue

                Created:
                Updated:
                Resolved: