Uploaded image for project: 'Confluence Server'
  1. Confluence Server
  2. CONFSERVER-41263

Logic bug in split brain scenario causes cluster panics

    Details

    • Symptom Severity:
      Critical
    • Support reference count:
      3
    • Sprint:
      Enterprise is back, Enterprise is awesome, Enterprise is cool, Enterprise is dynamic, Enterprise is electric, Enterprise is fantastic
    • Testing Notes:
      Hide

      1. Standalone should work as expected (panic as soon as another instance starts up).
      2. In a case of split brain, smaller part of the cluster should always panic.
      3. Time-to-survive should also work in clustered environment.

      Show
      1. Standalone should work as expected (panic as soon as another instance starts up). 2. In a case of split brain, smaller part of the cluster should always panic. 3. Time-to-survive should also work in clustered environment.

      Description

      The verify() method in HazelcastClusterSafetyManager contains a log bug that can cause cluster panics.

          @Override
          public void verify() {
              final int nextValue = getNextValue();
              final Optional<String> lastCacheModifier = getLastCacheModifier();
              final Optional<Integer> dbSafetyNumber = getDbSafetyNumber();
              final Optional<Integer> cacheSafetyNumber = getCacheSafetyNumber();
      
              if (dbSafetyNumber.isPresent() && cacheSafetyNumber.isPresent()) {
                  if (!dbSafetyNumber.equals(cacheSafetyNumber)) {
                      log.warn("detected different number in database [ {} ] and cache [ {} ]. Cache number last set by [ {} ]. Triggering panic on current node", dbSafetyNumber.get(), cacheSafetyNumber.get(), lastCacheModifier.get());
      
                      logDetails(nextValue);
      
                      panic();
                      return;
                  }
              } else if (dbSafetyNumber.isPresent())
                  log.debug("found cluster safety number in database [ {} ] but not in cache", dbSafetyNumber.get());
              else if (cacheSafetyNumber.isPresent())
                  log.debug("found cluster safety number in cache [ {} ] but not in database", getCacheSafetyNumber());
      
              logDetails(nextValue);
              clusterSafetyDao.setSafetyNumber(nextValue);
              storeCacheNumber(nextValue);
      
              sanityCheck(getDbSafetyNumber(), getCacheSafetyNumber(), nextValue);
          }

      This statement will return false if there is no cluster safety number in the db or in the hazelcast cache: if (dbSafetyNumber.isPresent() && cacheSafetyNumber.isPresent())
      If that is the case, the node then writes a cluster safety number to both and proceeds as normal. This is the correct behaviour for the first time that the job runs, after a cluster is started.

      However, this can cause problems in a split brain scenario. A split brain scenario occurs where one node (eg node1) is no longer part of the original cluster but is still operating with its own part of the distributed hazelcast cache. If this occurs, and node1's portion of the distributed cache did not contain the cluster safety number, it will write a new one to the database, and to its own cache (but not to nodes 2 or 3's cache). When nodes 2 or 3 next run cluster safety, the numbers will exist in both the db and their own hazelcast cache, but they will not match. This causes a cluster panic event to be thrown, taking down the larger portion of the cluster, and leaving a single node in operation. If the single node cannot handle the amount of load it has to pick up, as is sometimes the case, this can cause a complete outage.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                3 Vote for this issue
                Watchers:
                13 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Last commented:
                  1 year, 11 weeks, 6 days ago