Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-65197

JIRA Data Center Cache replication stops due to Heartbeat jobs failure

    XMLWordPrintable

Details

    Description

      Summary

      JIRA Data Center's Heartbeat jobs can be delayed due to busy Caesium threads (timed operations) causing failure in an instance's heartbeat jobs. This can cause cluster nodes to be inconsistent for longer than normal because sync jobs cannot proceed until the Caesium queue is cleared.

      Instance Health Checks show two failed checks:

      • Cluster Cache Replication
      • Shared Home

      The observed cases have involved Caesium threads busy processing incoming mail handlers. It could be any other long running scheduled task (Directory sync, Email sending, etc)

      Environment

      • JIRA Data Center

      Steps to Reproduce

      1. Configure Mail handler to process an inbox with a very large number of emails (thousands).
      2. Leave JIRA Data Center running for 10-15 minutes after restart

      Expected Results

      Heartbeat job are not blocked/delayed by other long running scheduled jobs.

      Actual Results

      Heartbeat job doesn't run and that cause the Cluster keep-alive timeout (and nodes marked offline). That leads to cache replications stop working. Health checks will throw an error on Cache replication between nodes and communication to Shared Home directory.

      Thread dumps show all 4 Caesium threads are busy handling emails. Example thread:

      "Caesium-1-1" daemon prio=5 tid=0x00000000000000a3 nid=0 runnable 
         java.lang.Thread.State: RUNNABLE
      	at java.net.SocketInputStream.socketRead0(Native Method)
      	- locked <0x0000000032c2276e> (a java.lang.Object)
      	at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)
      ...
      	at com.sun.mail.iap.ResponseInputStream.readResponse(ResponseInputStream.java:103)
      	at com.sun.mail.iap.Response. (Response.java:114)
      	at com.sun.mail.imap.protocol.IMAPResponse. (IMAPResponse.java:60)
      	at com.sun.mail.imap.protocol.IMAPProtocol.readResponse(IMAPProtocol.java:390)
      	at com.sun.mail.iap.Protocol.command(Protocol.java:354)
      	- locked <0x000000005994ebc6> (a com.sun.mail.imap.protocol.IMAPProtocol)
      	at com.sun.mail.imap.protocol.IMAPProtocol.fetch(IMAPProtocol.java:2113)
      ...
      	at com.sun.mail.imap.protocol.IMAPProtocol.peekBody(IMAPProtocol.java:1705)
      	at com.sun.mail.imap.IMAPMessage.getHeader(IMAPMessage.java:878)
      	- locked <0x000000007ace4ecd> (a java.lang.Object)
      	at com.atlassian.jira.plugins.mail.handlers.AbstractMessageHandler.getPrecedenceHeader(AbstractMessageHandler.java:1409)
      	at com.atlassian.jira.plugins.mail.handlers.AbstractMessageHandler.checkBulk(AbstractMessageHandler.java:473)
      	at com.atlassian.jira.plugins.mail.handlers.AbstractMessageHandler.canHandleMessage(AbstractMessageHandler.java:415)
      	at com.atlassian.jira.plugins.mail.handlers.CreateOrCommentHandler.handleMessage(CreateOrCommentHandler.java:54)
      	at com.atlassian.jira.service.services.mail.MailFetcherService$1.process(MailFetcherService.java:376)
      	at com.atlassian.jira.service.services.mail.MailFetcherService$MessageProviderImpl.getAndProcessMail(MailFetcherService.java:255)
      	at com.atlassian.jira.service.services.mail.MailFetcherService.runImpl(MailFetcherService.java:366)
      	at com.atlassian.jira.service.services.file.AbstractMessageHandlingService.run(AbstractMessageHandlingService.java:229)
      	at com.atlassian.jira.service.JiraServiceContainerImpl.run(JiraServiceContainerImpl.java:61)
      	at com.atlassian.jira.service.ServiceRunner.runService(ServiceRunner.java:62)
      	at com.atlassian.jira.service.ServiceRunner.runServiceId(ServiceRunner.java:44)
      	at com.atlassian.jira.service.ServiceRunner.runJob(ServiceRunner.java:32)
      	at com.atlassian.scheduler.core.JobLauncher.runJob(JobLauncher.java:153)
      	at com.atlassian.scheduler.core.JobLauncher.launchAndBuildResponse(JobLauncher.java:118)
      	at com.atlassian.scheduler.core.JobLauncher.launch(JobLauncher.java:97)
      	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.launchJob(CaesiumSchedulerService.java:443)
      	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeClusteredJob(CaesiumSchedulerService.java:438)
      	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeClusteredJobWithRecoveryGuard(CaesiumSchedulerService.java:462)
      	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeQueuedJob(CaesiumSchedulerService.java:390)
      	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService$1.consume(CaesiumSchedulerService.java:285)
      	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService$1.consume(CaesiumSchedulerService.java:282)
      	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeJob(SchedulerQueueWorker.java:65)
      	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeNextJob(SchedulerQueueWorker.java:59)
      	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.run(SchedulerQueueWorker.java:34)
      	at java.lang.Thread.run(Thread.java:745)
         Locked ownable synchronizers:
      	- None
      

      Suggested Fixes

      • Implemented - Separate heartbeat service from shared Caesium service.
      • Process only unread messages?

      Workaround

      • If it is blocked/delayed by Mail handler jobs, delete or move the older, previously read messages in the inbox. This accelerates email processing, freeing the Caesium threads more quickly and avoiding the problem.

      Attachments

        Issue Links

          Activity

            People

              lwlodarczyk Lukasz Wlodarczyk
              znoorsazali Zul NS [Atlassian]
              Votes:
              6 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: