Issue Summary

      When a new clustered node starts up, the node will restore the latest index snapshot from the Shared Home but then inadvertently restores an old copy of an index snapshot created from an older index rebuild/propagation .

      • If the copy of the restored index snapshot is more than 48 hours old (e.g. 1 month old), the node's index is now out of date
        • With journalentry table only retaining the past 48 hours worth of index objects, the node is now missing the majority of the past month's indexed objects

      Steps to Reproduce

      1. Deploy a clustered Confluence Data Center (start with one node only)
      2. Once the node is up and running, scale the cluster to two nodes
      3. Create a few brand new pages so there are items to index
      4. Navigate to Confluence Administration » General Configuration » Content Indexing
        • Initiate a Site reindex rebuild
        • Upon completion:
          • The <Shared-Home>/index-snapshots folder will have a copy of the newly saved index snapshot zip files :
            e.g. Oct 14, 12:58pm
            index-snapshots % ls -l  
            total 376
            index-snapshots% ls -l
            -rw-------. 1 confluence confluence 322M Sep 16 10:26 IndexSnapshot_change_index_18934700.zip
            -rw-------. 1 confluence confluence 332M Oct 14 12:59 IndexSnapshot_change_index_19385800.zip
            
            -rw-------. 1 confluence confluence 2.4G Sep 16 10:25 IndexSnapshot_main_index_18933300.zip
            -rw-------. 1 confluence confluence 2.5G Oct 14 12:58 IndexSnapshot_main_index_19385400.zip
            
          • The second node will restore the propagated index
      5. Now create some more new content so there are items to index
      6. Navigate to Confluence Administration » General Configuration » Scheduled Jobs
        • Manually run the Clean Journal Entries (normally run at 2am daily)
          • This will create a new index snapshot with _id files in the <Shared-Home>/index-snapshots folder
            e.g. Nov 14 02:04am
            index-snapshots% ls -l
            -rw-------. 1 confluence confluence 322M Sep 16 10:26 IndexSnapshot_change_index_18934700.zip
            -rw-------. 1 confluence confluence 332M Oct 14 12:59 IndexSnapshot_change_index_19385800.zip
            -rw-------. 1 confluence confluence 334M Nov 14 02:05 IndexSnapshot_change_index_20020900.zip
            -rw-------. 1 confluence confluence 8    Nov 14 02:05 IndexSnapshot_change_index_journal_id
            
            -rw-------. 1 confluence confluence 1.3M Nov 14 02:05 IndexSnapshot_edge_index_30019973.zip
            -rw-------. 1 confluence confluence 8    Nov 14 02:05 IndexSnapshot_edge_index_journal_id
            
            -rw-------. 1 confluence confluence 2.4G Sep 16 10:25 IndexSnapshot_main_index_18933300.zip
            -rw-------. 1 confluence confluence 2.5G Oct 14 12:58 IndexSnapshot_main_index_19385400.zip
            -rw-------. 1 confluence confluence 2.5G Nov 14 02:04 IndexSnapshot_main_index_20019500.zip
            -rw-------. 1 confluence confluence 8    Nov 14 02:04 IndexSnapshot_main_index_journal_id
            
          • All rows in journalentry older than 48 hours are deleted, except the largest ID row of each type. Specifically. the latest RESTORE_INDEX_SNAPSHOT will be retained for system_maintenance:
            e.g.
            entry_id,journal_name,creationdate,type,message,triedtimes
            215300,system_maintenance,2023-10-24 18:58:01.123,RESTORE_INDEX_SNAPSHOT,"{""sourceNodeId"":""1020ab4c"",""indexSnapshots"":[{""index"":""MAIN_INDEX"",""journalId"":19385400},{""index"":""CHANGE_INDEX"",""journalId"":19385800}]}",0 
            
        • Manually run the Clean Index Snapshots (normally run at 3am daily)
          • By default, this will retain up to 3 copies of index-snapshot zip files (as you can see above)
      7. Shutdown Node 2
        • Clear out Node 2's local home directory
      8. Start up Node 2

      Expected Results

      Node 2 should initiate an index-recovery from Shared Home from the latest index-snapshot (typically from the 2am index-snapshot created by 'Clean Journal Entries' scheduled job) and then catch up on any outstanding indexed items since the 2am index-snapshot.

      Actual Results

      Node 2:

      1. Initiate an index-recovery from Shared Home from the latest index-snapshot (typically from the 2am index-snapshot created by 'Clean Journal Entries' scheduled job)
      2. Shortly after the index-recovery is done, it will continue to then restore the older index snapshot due to the RESTORE_INDEX_SNAPSHOT row in journalentry table
        Nov 23 06:11am Node startup
        2023-11-23 06:11:20,838 ERROR [Caesium-1-2] [impl.system.runner.ReIndexMaintenanceTaskRunner] shouldReIndex The job id abcfd123-d122-77fa-9585-6241dd0211bb is in COMPLETE stage, which is not REBUILDING
        2023-11-23 06:11:20,840 INFO [Caesium-1-2] [impl.system.runner.RestoreIndexSnapshotMaintenanceTaskRunner] doRestore Restoring index snapshots
        2023-11-23 06:11:20,854 INFO [Caesium-1-2] [impl.system.runner.RestoreIndexSnapshotMaintenanceTaskRunner] doRestore Index snapshot IndexSnapshot[JournalId=main_index, JournalEntryId=19385400] has been restored
        2023-11-23 06:11:20,864 INFO [Caesium-1-2] [impl.system.runner.RestoreIndexSnapshotMaintenanceTaskRunner] doRestore Index snapshot IndexSnapshot[JournalId=change_index, JournalEntryId=19385800] has been restored
        2023-11-23 06:11:20,864 INFO [Caesium-1-2] [impl.system.runner.RestoreIndexSnapshotMaintenanceTaskRunner] doRestore All index snapshots have been restored successfully
        
      3. In the above example, the node would now need to catch up on 1 month of indexed objects
        • However, with only the past 48 hours of indexed objects retained in journalentry table, the node is now missing almost all of the past month's indexed items
        • The older the last retained propagated index file, the more the node's index will be missing data

      Workaround

      1. Follow the steps in Configuring system properties and add in the following JVM flag for each node:
        -Dindex.snapshot.retain.count=1
        
      2. Restart each node (one at a time) for the change to take effect.
      3. Once all nodes have the above flag configured, manually run the Confluence Administration » Scheduled Jobs » Clean Index Snapshots and this will retain just the latest index file set
      4. The next time a node starts up from scratch, it will only restore the latest index set

            [CONFSERVER-94641] Old index snapshot restored on new node startup

            There are no comments yet on this issue.

              Unassigned Unassigned
              hlam@atlassian.com Eric Lam
              Affected customers:
              2 This affects my team
              Watchers:
              10 Start watching this issue

                Created:
                Updated: