Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-41263

Logic bug in split brain scenario causes cluster panics

    XMLWordPrintable

Details

    Description

      The verify() method in HazelcastClusterSafetyManager contains a log bug that can cause cluster panics.

          @Override
          public void verify() {
              final int nextValue = getNextValue();
              final Optional<String> lastCacheModifier = getLastCacheModifier();
              final Optional<Integer> dbSafetyNumber = getDbSafetyNumber();
              final Optional<Integer> cacheSafetyNumber = getCacheSafetyNumber();
      
              if (dbSafetyNumber.isPresent() && cacheSafetyNumber.isPresent()) {
                  if (!dbSafetyNumber.equals(cacheSafetyNumber)) {
                      log.warn("detected different number in database [ {} ] and cache [ {} ]. Cache number last set by [ {} ]. Triggering panic on current node", dbSafetyNumber.get(), cacheSafetyNumber.get(), lastCacheModifier.get());
      
                      logDetails(nextValue);
      
                      panic();
                      return;
                  }
              } else if (dbSafetyNumber.isPresent())
                  log.debug("found cluster safety number in database [ {} ] but not in cache", dbSafetyNumber.get());
              else if (cacheSafetyNumber.isPresent())
                  log.debug("found cluster safety number in cache [ {} ] but not in database", getCacheSafetyNumber());
      
              logDetails(nextValue);
              clusterSafetyDao.setSafetyNumber(nextValue);
              storeCacheNumber(nextValue);
      
              sanityCheck(getDbSafetyNumber(), getCacheSafetyNumber(), nextValue);
          }

      This statement will return false if there is no cluster safety number in the db or in the hazelcast cache: if (dbSafetyNumber.isPresent() && cacheSafetyNumber.isPresent())
      If that is the case, the node then writes a cluster safety number to both and proceeds as normal. This is the correct behaviour for the first time that the job runs, after a cluster is started.

      However, this can cause problems in a split brain scenario. A split brain scenario occurs where one node (eg node1) is no longer part of the original cluster but is still operating with its own part of the distributed hazelcast cache. If this occurs, and node1's portion of the distributed cache did not contain the cluster safety number, it will write a new one to the database, and to its own cache (but not to nodes 2 or 3's cache). When nodes 2 or 3 next run cluster safety, the numbers will exist in both the db and their own hazelcast cache, but they will not match. This causes a cluster panic event to be thrown, taking down the larger portion of the cluster, and leaving a single node in operation. If the single node cannot handle the amount of load it has to pick up, as is sometimes the case, this can cause a complete outage.

      Attachments

        Issue Links

          Activity

            People

              mfedoryshyn Maksym Fedoryshyh
              dunterwurzacher Denise Unterwurzacher [Atlassian] (Inactive)
              Votes:
              3 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: