Loading...

XML

Word

Printable

Type: Suggestion
Resolution: Fixed
Fix Version/s: 5.9.7
Component/s: None
Labels:

NOTE: This suggestion is for Confluence Server. Using Confluence Cloud? See the corresponding suggestion.

This wouldn't often be recommended, but these values can be altered to trade off the potential for data integrity issues in favour of uptime, or vice versa.

To set the Hazelcast heartbeat to 60s (for example): Add the JVM parameter -Dconfluence.cluster.hazelcast.max.no.heartbeat.seconds=60

To modify the Cluster Safety interval or disable the job entirely: Admin > Scheduled jobs

How these two settings operate

Hazelcast is Confluence's distributed caching mechanism. Data will be stored in the cache on one node only, and when other nodes want that data, they contact the node that has it. If no node has it, they query the database. Hazelcast controls the cluster membership. If a node has not communicated with the other nodes in the cluster for the heartbeat period (30s by default), the other nodes will remove it from the cluster. It will then not be able to contact the shared cache until it has successfully rejoined the cluster.

Cluster safety is a scheduled job that runs on each node in a cluster, every 30s by default. It compares a number in Hazelcast's cache to a number in the database. If the numbers differ, the node will throw a panic event and stop processing. It is intended to ensure that all nodes that are able to communicate with the database, are also communicating with Hazelcast's shared cache, to avoid writing out of date data to the database, and overwriting more recent edits.

Some example configurations

Uptime over data integrity

Change both to 1 minute - this would mean you could have network interruptions or GC pauses up to 1 minute without triggering cluster panics. However, in the rare case that two nodes are no longer communicating, there would be a maximum of a 1 minute interval where the data on each node may be out of sync, and this could cause conflicting data to be written to the database.
Leaving the hazelcast timeout at 30 seconds and cluster safety to 10 minutes would allow nodes to be evicted from and rejoin the cluster (eg after a 30s GC or network blip), without a cluster panic occurring. This would mean that the cluster could potentially operate in a 'split brain' scenario where the nodes do not rejoin, for up to 10 minutes before throwing a panic. However, it would give evicted nodes a lot more time to rejoin the cluster gracefully after eviction.

Data integrity over uptime

Change both to 15 seconds - this sacrifices uptime for more confidence around data integrity. It means a maximum of 15 seconds where nodes could be out of communication with each other and still writing to the database. It could mean that GCs and network blips over 15 seconds will trigger a cluster panic, so you would run the risk of having more outages.
Change the hazelcast timeout to 1 minute and the cluster safety job to 15 seconds - this would mean GCs etc could run for up to 1 minute without nodes being evicted from the cluster. However, once they are evicted, the time they could potentially write to the database is lowered to 15 seconds.

relates to

CONFCLOUD-41014 Allow cluster safety and hazelcast heartbeat intervals to be configurable

Closed

included in

CPU-412 Confluence 6.0.0-OD-2016.12.1-1106

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

PIR - Improvement Action: PIR-594 Loading...

(7 mentioned in, 1 PIR - Improvement Action)

Assignee:: Denise Unterwurzacher [Atlassian] (Inactive)
Reporter:: Denise Unterwurzacher [Atlassian] (Inactive)
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: 10/Mar/2016 12:06 AM
Updated:: 19/Sep/2019 5:23 AM
Resolved:: 16/Mar/2016 11:36 PM

Details

Description

How these two settings operate

Some example configurations

Uptime over data integrity

Data integrity over uptime

Attachments

Issue Links

Activity

People

Dates