Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-53203

Logs flooded excessively with "Hazelcast instance is not active!" once Hazelcast has shut down

    XMLWordPrintable

Details

    Description

      Problem description

      In a Data Center configuration, if Hazelcast is shut down on a node (e.g. in a cluster panic) the logs are quickly flooded with stacktraces like the below:

      2017-04-13 08:41:55,001 ERROR [scheduler_Worker-8] [org.quartz.core.ErrorLogger] schedulerError Unable to notify JobListener(s) of Job to be executed: (Job will NOT be executed!). trigger= DEFAULT.IndexQueueFlusher job= DEFAULT.IndexQueueFlusher
      org.quartz.SchedulerException: JobListener 'ScheduledJobListener' threw exception: Problem retrieving a value from cache com.atlassian.bandana.BandanaPersister [See nested exception: com.atlassian.cache.CacheException: Problem retrieving a value from cache com.atlassian.bandana.BandanaPersister]
              at org.quartz.core.QuartzScheduler.notifyJobListenersToBeExecuted(QuartzScheduler.java:1951)
              at org.quartz.core.JobRunShell.notifyListenersBeginning(JobRunShell.java:364)
              at org.quartz.core.JobRunShell.run(JobRunShell.java:190)
              at com.atlassian.confluence.schedule.quartz.ConfluenceQuartzThreadPool.lambda$runInThread$0(ConfluenceQuartzThreadPool.java:16)
              at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)
      Caused by: com.atlassian.cache.CacheException: Problem retrieving a value from cache com.atlassian.bandana.BandanaPersister
              at com.atlassian.cache.hazelcast.HazelcastCache.getOrLoad(HazelcastCache.java:268)
              at com.atlassian.cache.hazelcast.HazelcastCache.get(HazelcastCache.java:97)
              at com.atlassian.confluence.cache.DefaultConfluenceCache.get(DefaultConfluenceCache.java:46)
              at com.atlassian.confluence.cache.ConfluenceMonitoringCache.get(ConfluenceMonitoringCache.java:44)
              at com.atlassian.confluence.setup.bandana.ConfluenceCachingBandanaPersister.retrieve(ConfluenceCachingBandanaPersister.java:52)
              at com.atlassian.bandana.DefaultBandanaManager.getValue(DefaultBandanaManager.java:32)
              at com.atlassian.bandana.DefaultBandanaManager.getValue(DefaultBandanaManager.java:24)
              at com.atlassian.confluence.schedule.persistence.dao.ClusterSafeScheduledJobDao.getScheduledJobConfiguration(ClusterSafeScheduledJobDao.java:90)
              at com.atlassian.confluence.schedule.managers.DefaultScheduledJobStatusManager.getScheduledJobStatus(DefaultScheduledJobStatusManager.java:44)
              at com.atlassian.confluence.schedule.listeners.ScheduledJobsListener.lambda$executeWithJobAndJobStatus$2(ScheduledJobsListener.java:107)
              at com.atlassian.confluence.schedule.listeners.ScheduledJobsListener$1.doInTransactionWithoutResult(ScheduledJobsListener.java:131)
              at org.springframework.transaction.support.TransactionCallbackWithoutResult.doInTransaction(TransactionCallbackWithoutResult.java:33)
              at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:128)
              at com.atlassian.confluence.schedule.listeners.ScheduledJobsListener.executeInTransaction(ScheduledJobsListener.java:128)
              at com.atlassian.confluence.schedule.listeners.ScheduledJobsListener.executeWithJobAndJobStatus(ScheduledJobsListener.java:102)
              at com.atlassian.confluence.schedule.listeners.ScheduledJobsListener.jobToBeExecuted(ScheduledJobsListener.java:59)
              at org.quartz.core.QuartzScheduler.notifyJobListenersToBeExecuted(QuartzScheduler.java:1949)
              ... 4 more
      Caused by: com.hazelcast.core.HazelcastInstanceNotActiveException: Hazelcast instance is not active!
              at com.hazelcast.spi.AbstractDistributedObject.getService(AbstractDistributedObject.java:93)
              at com.hazelcast.map.impl.proxy.MapProxySupport.toData(MapProxySupport.java:1046)
              at com.hazelcast.map.impl.proxy.MapProxyImpl.get(MapProxyImpl.java:83)
              at com.atlassian.cache.hazelcast.HazelcastCache.getOrLoad(HazelcastCache.java:244)
              ... 20 more
      

      These stacktraces are recorded at the rate of tens/hundreds of times per minute and can quickly fill up the default log configuration of ~120 MB of atlassian-confluence.log files with nothing but these entries. The logs can roll in a matter of minutes.

      The reason this is a problem is that it can cause the root cause of the node shutting down (e.g. the cluster panic event) to be rolled off the logs way before administrators have a chance to react/intervene to recover the node or manually pull the logs before rolling off. In other words, this flood of logging can hamper efforts for root cause analysis for large clients.

      Ideally, we can log this type of stacktrace once and minimize the verbosity future entries of a similar type instead of repeatedly logging the stacktrace.

      Workaround

      The current workaround is to extend the default logging so that more logs are kept in archive, in order to give administrators a greater window to capture the relevant logs. This can be done my modifying the following two parameters in <confluence_install>/confluence/WEB-INF/classes/log4j.properties, which influences size of log files and number of logs kept in archive, respectively:

      log4j.appender.confluencelog.MaxFileSize=20480KB
      log4j.appender.confluencelog.MaxBackupIndex=5
      

      Note however that increasing the logs kept will impact the disk footprint and size of Support Zips generated.

      Attachments

        Issue Links

          Activity

            People

              nhdang Nhan Dang
              rchang Robert Chang
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: