Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: Low
Fix Version/s: 6.6.5, 6.7.3, 6.9.0
Affects Version/s: 6.6.3
Component/s: Data Center - Core
Labels:

Fixed in Long Term Support Release/s:

Download 6.6
Support reference count:
1
Symptom Severity:
Severity 2 - Major
UIS:
1

Summary

Confluence is shipped with Cluster Safety mechanism in which is designed to ensure that its data cannot become inconsistent.
This mechanism runs as a scheduled task - ClusterSafetyJob, which runs every 30 seconds by default. The scheduled task operates on a safety number – a randomly generated number that is stored both in the database and in the distributed cache used across a cluster. These random numbers were then compared to ensure data consistency. However, should the number differs, Confluence will then trigger a cluster panic event.

In a specific event where the safety number update (UpdateTimestampsCache) on either one of the sources failed, the following issue will then occurred:

HazelcastClusterSafetyManager updates numbers in database and cache
HazelcastClusterSafetyManager runs sanity check
During sanity check, Hibernate fails to update UpdateTimestampsCache_v5
Hibernate rolls back transaction
Safety number in the database is rolled back to previous value, but number in the cache remains updated.
Next run of cluster safety job forces cluster panic, as numbers in the cache and database are different

Expected Results

Instead of triggering cluster panic event, Confluence should handle this error more gracefully, e.g. Confluence should evict the bad node(s) from its cluster instead.

Actual Results

The below exception is thrown in the atlassian-confluence.log file, which then led to cluster panic:

2018-04-06 12:17:05,038 ERROR [Caesium-1-2] [org.hibernate.internal.ExceptionMapperStandardImpl] mapManagedFlushFailure HHH000346: Error during managed flush [No response for 120000 ms. Aborting invocation! Invocation{ serviceName='hz:impl:mapService', op=PutOperation{atlassian-cache.Cache.net.sf.hibernate.cache.UpdateTimestampsCache_v5}, partitionId=129, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[10.60.2.82]:5801, backupsExpected=0, backupsCompleted=0} No response has been received!  backups-expected:0 backups-completed: 0]
2018-04-06 12:17:05,838 WARN [http-nio-8090-exec-19] [impl.operationservice.impl.Invocation] log [10.60.2.85]:5801 [dev] [3.5.2-atlassian-36.1] While asking 'is-executing': Invocation{ serviceName='hz:impl:mapService', op=GetOperation{atlassian-cache.Cache.com.atlassian.confluence.impl.themes.persistence.PersistentDecoratorCache}, partitionId=94, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[172.24.204.104]:5801, backupsExpected=0, backupsCompleted=0}
 -- url: /display/SpaceKey/PageTitle | traceId: 6de36904b595f849 | userName: lrura | action: viewpage | page: 179489791
java.util.concurrent.TimeoutException: Call Invocation{ serviceName='hz:impl:mapService', op=com.hazelcast.spi.impl.operationservice.impl.operations.IsStillExecutingOperation{serviceName='hz:impl:mapService', partitionId=-1, callId=15474398, invocationTime=1523035020831, waitTimeout=-1, callTimeout=5000}, partitionId=-1, replicaIndex=0, tryCount=0, tryPauseMillis=0, invokeCount=1, callTimeout=5000, target=Address[10.60.2.82]:5801, backupsExpected=0, backupsCompleted=0} encountered a timeout
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponse(InvocationFuture.java:366)
...
        at com.atlassian.confluence.cluster.safety.AbstractClusterSafetyManager.updateSafetyNumber(AbstractClusterSafetyManager.java:163)

is blocked by: PSR-76 Loading...

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...

Assignee:: Efim (Inactive)
Reporter:: Lauretha Rura
Votes:: 0 Vote for this issue
Watchers:: 13 Start watching this issue

Created:: 09/Apr/2018 8:41 AM
Updated:: 20/Sep/2023 6:48 AM
Resolved:: 01/May/2018 5:20 AM

Details

Description

Summary

Expected Results

Actual Results

Attachments

Issue Links

Activity

People

Dates