Loading...

Type: Bug
Resolution: Fixed
Priority: High
Fix Version/s: 6.5.1
Affects Version/s: 5.8.18, 5.10.8
Component/s: Data Center - Core
Labels:

Support reference count:
10
Symptom Severity:
Severity 3 - Minor
UIS:
11

Problem description

In a Data Center configuration, if Hazelcast is shut down on a node (e.g. in a cluster panic) the logs are quickly flooded with stacktraces like the below:

2017-04-13 08:41:55,001 ERROR [scheduler_Worker-8] [org.quartz.core.ErrorLogger] schedulerError Unable to notify JobListener(s) of Job to be executed: (Job will NOT be executed!). trigger= DEFAULT.IndexQueueFlusher job= DEFAULT.IndexQueueFlusher
org.quartz.SchedulerException: JobListener 'ScheduledJobListener' threw exception: Problem retrieving a value from cache com.atlassian.bandana.BandanaPersister [See nested exception: com.atlassian.cache.CacheException: Problem retrieving a value from cache com.atlassian.bandana.BandanaPersister]
        at org.quartz.core.QuartzScheduler.notifyJobListenersToBeExecuted(QuartzScheduler.java:1951)
        at org.quartz.core.JobRunShell.notifyListenersBeginning(JobRunShell.java:364)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:190)
        at com.atlassian.confluence.schedule.quartz.ConfluenceQuartzThreadPool.lambda$runInThread$0(ConfluenceQuartzThreadPool.java:16)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)
Caused by: com.atlassian.cache.CacheException: Problem retrieving a value from cache com.atlassian.bandana.BandanaPersister
        at com.atlassian.cache.hazelcast.HazelcastCache.getOrLoad(HazelcastCache.java:268)
        at com.atlassian.cache.hazelcast.HazelcastCache.get(HazelcastCache.java:97)
        at com.atlassian.confluence.cache.DefaultConfluenceCache.get(DefaultConfluenceCache.java:46)
        at com.atlassian.confluence.cache.ConfluenceMonitoringCache.get(ConfluenceMonitoringCache.java:44)
        at com.atlassian.confluence.setup.bandana.ConfluenceCachingBandanaPersister.retrieve(ConfluenceCachingBandanaPersister.java:52)
        at com.atlassian.bandana.DefaultBandanaManager.getValue(DefaultBandanaManager.java:32)
        at com.atlassian.bandana.DefaultBandanaManager.getValue(DefaultBandanaManager.java:24)
        at com.atlassian.confluence.schedule.persistence.dao.ClusterSafeScheduledJobDao.getScheduledJobConfiguration(ClusterSafeScheduledJobDao.java:90)
        at com.atlassian.confluence.schedule.managers.DefaultScheduledJobStatusManager.getScheduledJobStatus(DefaultScheduledJobStatusManager.java:44)
        at com.atlassian.confluence.schedule.listeners.ScheduledJobsListener.lambda$executeWithJobAndJobStatus$2(ScheduledJobsListener.java:107)
        at com.atlassian.confluence.schedule.listeners.ScheduledJobsListener$1.doInTransactionWithoutResult(ScheduledJobsListener.java:131)
        at org.springframework.transaction.support.TransactionCallbackWithoutResult.doInTransaction(TransactionCallbackWithoutResult.java:33)
        at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:128)
        at com.atlassian.confluence.schedule.listeners.ScheduledJobsListener.executeInTransaction(ScheduledJobsListener.java:128)
        at com.atlassian.confluence.schedule.listeners.ScheduledJobsListener.executeWithJobAndJobStatus(ScheduledJobsListener.java:102)
        at com.atlassian.confluence.schedule.listeners.ScheduledJobsListener.jobToBeExecuted(ScheduledJobsListener.java:59)
        at org.quartz.core.QuartzScheduler.notifyJobListenersToBeExecuted(QuartzScheduler.java:1949)
        ... 4 more
Caused by: com.hazelcast.core.HazelcastInstanceNotActiveException: Hazelcast instance is not active!
        at com.hazelcast.spi.AbstractDistributedObject.getService(AbstractDistributedObject.java:93)
        at com.hazelcast.map.impl.proxy.MapProxySupport.toData(MapProxySupport.java:1046)
        at com.hazelcast.map.impl.proxy.MapProxyImpl.get(MapProxyImpl.java:83)
        at com.atlassian.cache.hazelcast.HazelcastCache.getOrLoad(HazelcastCache.java:244)
        ... 20 more

These stacktraces are recorded at the rate of tens/hundreds of times per minute and can quickly fill up the default log configuration of ~120 MB of atlassian-confluence.log files with nothing but these entries. The logs can roll in a matter of minutes.

The reason this is a problem is that it can cause the root cause of the node shutting down (e.g. the cluster panic event) to be rolled off the logs way before administrators have a chance to react/intervene to recover the node or manually pull the logs before rolling off. In other words, this flood of logging can hamper efforts for root cause analysis for large clients.

Ideally, we can log this type of stacktrace once and minimize the verbosity future entries of a similar type instead of repeatedly logging the stacktrace.

Workaround

The current workaround is to extend the default logging so that more logs are kept in archive, in order to give administrators a greater window to capture the relevant logs. This can be done my modifying the following two parameters in <confluence_install>/confluence/WEB-INF/classes/log4j.properties, which influences size of log files and number of logs kept in archive, respectively:

log4j.appender.confluencelog.MaxFileSize=20480KB
log4j.appender.confluencelog.MaxBackupIndex=5

Note however that increasing the logs kept will impact the disk footprint and size of Support Zips generated.

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

Details

Description

Problem description

Workaround

Attachments

Issue Links

Activity

People

Dates