Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: High
Fix Version/s: 6.0.1
Affects Version/s: 5.6, 5.7, 5.8, 5.9, 5.10.8
Component/s: Data Center - Core
Labels:

Support reference count:
3
Symptom Severity:
Severity 1 - Critical
Bug Fix Policy:
View Atlassian Server bug fix policy

Description

The verify() method in HazelcastClusterSafetyManager contains a log bug that can cause cluster panics.

    @Override
    public void verify() {
        final int nextValue = getNextValue();
        final Optional<String> lastCacheModifier = getLastCacheModifier();
        final Optional<Integer> dbSafetyNumber = getDbSafetyNumber();
        final Optional<Integer> cacheSafetyNumber = getCacheSafetyNumber();

        if (dbSafetyNumber.isPresent() && cacheSafetyNumber.isPresent()) {
            if (!dbSafetyNumber.equals(cacheSafetyNumber)) {
                log.warn("detected different number in database [ {} ] and cache [ {} ]. Cache number last set by [ {} ]. Triggering panic on current node", dbSafetyNumber.get(), cacheSafetyNumber.get(), lastCacheModifier.get());

                logDetails(nextValue);

                panic();
                return;
            }
        } else if (dbSafetyNumber.isPresent())
            log.debug("found cluster safety number in database [ {} ] but not in cache", dbSafetyNumber.get());
        else if (cacheSafetyNumber.isPresent())
            log.debug("found cluster safety number in cache [ {} ] but not in database", getCacheSafetyNumber());

        logDetails(nextValue);
        clusterSafetyDao.setSafetyNumber(nextValue);
        storeCacheNumber(nextValue);

        sanityCheck(getDbSafetyNumber(), getCacheSafetyNumber(), nextValue);
    }

This statement will return false if there is no cluster safety number in the db or in the hazelcast cache: if (dbSafetyNumber.isPresent() && cacheSafetyNumber.isPresent())
If that is the case, the node then writes a cluster safety number to both and proceeds as normal. This is the correct behaviour for the first time that the job runs, after a cluster is started.

However, this can cause problems in a split brain scenario. A split brain scenario occurs where one node (eg node1) is no longer part of the original cluster but is still operating with its own part of the distributed hazelcast cache. If this occurs, and node1's portion of the distributed cache did not contain the cluster safety number, it will write a new one to the database, and to its own cache (but not to nodes 2 or 3's cache). When nodes 2 or 3 next run cluster safety, the numbers will exist in both the db and their own hazelcast cache, but they will not match. This causes a cluster panic event to be thrown, taking down the larger portion of the cluster, and leaving a single node in operation. If the single node cannot handle the amount of load it has to pick up, as is sometimes the case, this can cause a complete outage.

Attachments

Issue Links

relates to

CONFSERVER-40685 Whole cluster can panic if one node is doing extended GC

Closed

is blocked by: PSR-76 Loading...

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(22 mentioned in)

Activity

People

Assignee:: Maksym Fedoryshyh

Reporter:: Denise Unterwurzacher [Atlassian] (Inactive)

Votes:: 3 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 04/Apr/2016 5:46 AM

Updated:: 17/Dec/2020 9:35 PM

Resolved:: 29/Jul/2016 1:00 AM