ClusterSafetyJob panics on UpdateTimestampsCache update failure causes Cluster Panic

XMLWordPrintable

    • 1
    • Severity 2 - Major
    • 1

      Summary

      Confluence is shipped with Cluster Safety mechanism in which is designed to ensure that its data cannot become inconsistent.
      This mechanism runs as a scheduled task - ClusterSafetyJob, which runs every 30 seconds by default. The scheduled task operates on a safety number – a randomly generated number that is stored both in the database and in the distributed cache used across a cluster. These random numbers were then compared to ensure data consistency. However, should the number differs, Confluence will then trigger a cluster panic event.

      In a specific event where the safety number update (UpdateTimestampsCache) on either one of the sources failed, the following issue will then occurred:

      1. HazelcastClusterSafetyManager updates numbers in database and cache
      2. HazelcastClusterSafetyManager runs sanity check
      3. During sanity check, Hibernate fails to update UpdateTimestampsCache_v5
      4. Hibernate rolls back transaction
      5. Safety number in the database is rolled back to previous value, but number in the cache remains updated.
      6. Next run of cluster safety job forces cluster panic, as numbers in the cache and database are different

      Expected Results

      Instead of triggering cluster panic event, Confluence should handle this error more gracefully, e.g. Confluence should evict the bad node(s) from its cluster instead.

      Actual Results

      The below exception is thrown in the atlassian-confluence.log file, which then led to cluster panic:

      2018-04-06 12:17:05,038 ERROR [Caesium-1-2] [org.hibernate.internal.ExceptionMapperStandardImpl] mapManagedFlushFailure HHH000346: Error during managed flush [No response for 120000 ms. Aborting invocation! Invocation{ serviceName='hz:impl:mapService', op=PutOperation{atlassian-cache.Cache.net.sf.hibernate.cache.UpdateTimestampsCache_v5}, partitionId=129, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[10.60.2.82]:5801, backupsExpected=0, backupsCompleted=0} No response has been received!  backups-expected:0 backups-completed: 0]
      2018-04-06 12:17:05,838 WARN [http-nio-8090-exec-19] [impl.operationservice.impl.Invocation] log [10.60.2.85]:5801 [dev] [3.5.2-atlassian-36.1] While asking 'is-executing': Invocation{ serviceName='hz:impl:mapService', op=GetOperation{atlassian-cache.Cache.com.atlassian.confluence.impl.themes.persistence.PersistentDecoratorCache}, partitionId=94, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[172.24.204.104]:5801, backupsExpected=0, backupsCompleted=0}
       -- url: /display/SpaceKey/PageTitle | traceId: 6de36904b595f849 | userName: lrura | action: viewpage | page: 179489791
      java.util.concurrent.TimeoutException: Call Invocation{ serviceName='hz:impl:mapService', op=com.hazelcast.spi.impl.operationservice.impl.operations.IsStillExecutingOperation{serviceName='hz:impl:mapService', partitionId=-1, callId=15474398, invocationTime=1523035020831, waitTimeout=-1, callTimeout=5000}, partitionId=-1, replicaIndex=0, tryCount=0, tryPauseMillis=0, invokeCount=1, callTimeout=5000, target=Address[10.60.2.82]:5801, backupsExpected=0, backupsCompleted=0} encountered a timeout
              at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponse(InvocationFuture.java:366)
      ...
              at com.atlassian.confluence.cluster.safety.AbstractClusterSafetyManager.updateSafetyNumber(AbstractClusterSafetyManager.java:163)
      

            Assignee:
            Efim (Inactive)
            Reporter:
            Lauretha Rura
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: