Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-41014

Allow cluster safety and hazelcast heartbeat intervals to be configurable

    XMLWordPrintable

Details

    • We collect Confluence feedback from various sources, and we evaluate what we've collected when planning our product roadmap. To understand how this piece of feedback will be reviewed, see our Implementation of New Features Policy.

    Description

      NOTE: This suggestion is for Confluence Server. Using Confluence Cloud? See the corresponding suggestion.

      This wouldn't often be recommended, but these values can be altered to trade off the potential for data integrity issues in favour of uptime, or vice versa.

      To set the Hazelcast heartbeat to 60s (for example): Add the JVM parameter -Dconfluence.cluster.hazelcast.max.no.heartbeat.seconds=60

      To modify the Cluster Safety interval or disable the job entirely: Admin > Scheduled jobs

      How these two settings operate

      Hazelcast is Confluence's distributed caching mechanism. Data will be stored in the cache on one node only, and when other nodes want that data, they contact the node that has it. If no node has it, they query the database. Hazelcast controls the cluster membership. If a node has not communicated with the other nodes in the cluster for the heartbeat period (30s by default), the other nodes will remove it from the cluster. It will then not be able to contact the shared cache until it has successfully rejoined the cluster.

      Cluster safety is a scheduled job that runs on each node in a cluster, every 30s by default. It compares a number in Hazelcast's cache to a number in the database. If the numbers differ, the node will throw a panic event and stop processing. It is intended to ensure that all nodes that are able to communicate with the database, are also communicating with Hazelcast's shared cache, to avoid writing out of date data to the database, and overwriting more recent edits.

      Some example configurations

      Uptime over data integrity
      • Change both to 1 minute - this would mean you could have network interruptions or GC pauses up to 1 minute without triggering cluster panics. However, in the rare case that two nodes are no longer communicating, there would be a maximum of a 1 minute interval where the data on each node may be out of sync, and this could cause conflicting data to be written to the database.
      • Leaving the hazelcast timeout at 30 seconds and cluster safety to 10 minutes would allow nodes to be evicted from and rejoin the cluster (eg after a 30s GC or network blip), without a cluster panic occurring. This would mean that the cluster could potentially operate in a 'split brain' scenario where the nodes do not rejoin, for up to 10 minutes before throwing a panic. However, it would give evicted nodes a lot more time to rejoin the cluster gracefully after eviction.
      Data integrity over uptime
      • Change both to 15 seconds - this sacrifices uptime for more confidence around data integrity. It means a maximum of 15 seconds where nodes could be out of communication with each other and still writing to the database. It could mean that GCs and network blips over 15 seconds will trigger a cluster panic, so you would run the risk of having more outages.
      • Change the hazelcast timeout to 1 minute and the cluster safety job to 15 seconds - this would mean GCs etc could run for up to 1 minute without nodes being evicted from the cluster. However, once they are evicted, the time they could potentially write to the database is lowered to 15 seconds.

      Attachments

        Issue Links

          Activity

            People

              dunterwurzacher Denise Unterwurzacher [Atlassian] (Inactive)
              dunterwurzacher Denise Unterwurzacher [Atlassian] (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: