Uploaded image for project: 'Jira Server and Data Center'
  1. Jira Server and Data Center
  2. JRASERVER-65197

JIRA Data Center Cache replication stops due to Heartbeat jobs failure

    XMLWordPrintable

    Details

      Description

      Summary

      JIRA Data Center's Heartbeat jobs can be delayed due to busy Caesium threads (timed operations) causing failure in an instance's heartbeat jobs. This can cause cluster nodes to be inconsistent for longer than normal because sync jobs cannot proceed until the Caesium queue is cleared.

      Instance Health Checks show two failed checks:

      • Cluster Cache Replication
      • Shared Home

      The observed cases have involved Caesium threads busy processing incoming mail handlers. It could be any other long running scheduled task (Directory sync, Email sending, etc)

      Environment

      • JIRA Data Center

      Steps to Reproduce

      1. Configure Mail handler to process an inbox with a very large number of emails (thousands).
      2. Leave JIRA Data Center running for 10-15 minutes after restart

      Expected Results

      Heartbeat job are not blocked/delayed by other long running scheduled jobs.

      Actual Results

      Heartbeat job doesn't run and that cause the Cluster keep-alive timeout (and nodes marked offline). That leads to cache replications stop working. Health checks will throw an error on Cache replication between nodes and communication to Shared Home directory.

      Thread dumps show all 4 Caesium threads are busy handling emails. Example thread:

      "Caesium-1-1" daemon prio=5 tid=0x00000000000000a3 nid=0 runnable 
         java.lang.Thread.State: RUNNABLE
      	at java.net.SocketInputStream.socketRead0(Native Method)
      	- locked <0x0000000032c2276e> (a java.lang.Object)
      	at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)
      ...
      	at com.sun.mail.iap.ResponseInputStream.readResponse(ResponseInputStream.java:103)
      	at com.sun.mail.iap.Response. (Response.java:114)
      	at com.sun.mail.imap.protocol.IMAPResponse. (IMAPResponse.java:60)
      	at com.sun.mail.imap.protocol.IMAPProtocol.readResponse(IMAPProtocol.java:390)
      	at com.sun.mail.iap.Protocol.command(Protocol.java:354)
      	- locked <0x000000005994ebc6> (a com.sun.mail.imap.protocol.IMAPProtocol)
      	at com.sun.mail.imap.protocol.IMAPProtocol.fetch(IMAPProtocol.java:2113)
      ...
      	at com.sun.mail.imap.protocol.IMAPProtocol.peekBody(IMAPProtocol.java:1705)
      	at com.sun.mail.imap.IMAPMessage.getHeader(IMAPMessage.java:878)
      	- locked <0x000000007ace4ecd> (a java.lang.Object)
      	at com.atlassian.jira.plugins.mail.handlers.AbstractMessageHandler.getPrecedenceHeader(AbstractMessageHandler.java:1409)
      	at com.atlassian.jira.plugins.mail.handlers.AbstractMessageHandler.checkBulk(AbstractMessageHandler.java:473)
      	at com.atlassian.jira.plugins.mail.handlers.AbstractMessageHandler.canHandleMessage(AbstractMessageHandler.java:415)
      	at com.atlassian.jira.plugins.mail.handlers.CreateOrCommentHandler.handleMessage(CreateOrCommentHandler.java:54)
      	at com.atlassian.jira.service.services.mail.MailFetcherService$1.process(MailFetcherService.java:376)
      	at com.atlassian.jira.service.services.mail.MailFetcherService$MessageProviderImpl.getAndProcessMail(MailFetcherService.java:255)
      	at com.atlassian.jira.service.services.mail.MailFetcherService.runImpl(MailFetcherService.java:366)
      	at com.atlassian.jira.service.services.file.AbstractMessageHandlingService.run(AbstractMessageHandlingService.java:229)
      	at com.atlassian.jira.service.JiraServiceContainerImpl.run(JiraServiceContainerImpl.java:61)
      	at com.atlassian.jira.service.ServiceRunner.runService(ServiceRunner.java:62)
      	at com.atlassian.jira.service.ServiceRunner.runServiceId(ServiceRunner.java:44)
      	at com.atlassian.jira.service.ServiceRunner.runJob(ServiceRunner.java:32)
      	at com.atlassian.scheduler.core.JobLauncher.runJob(JobLauncher.java:153)
      	at com.atlassian.scheduler.core.JobLauncher.launchAndBuildResponse(JobLauncher.java:118)
      	at com.atlassian.scheduler.core.JobLauncher.launch(JobLauncher.java:97)
      	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.launchJob(CaesiumSchedulerService.java:443)
      	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeClusteredJob(CaesiumSchedulerService.java:438)
      	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeClusteredJobWithRecoveryGuard(CaesiumSchedulerService.java:462)
      	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeQueuedJob(CaesiumSchedulerService.java:390)
      	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService$1.consume(CaesiumSchedulerService.java:285)
      	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService$1.consume(CaesiumSchedulerService.java:282)
      	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeJob(SchedulerQueueWorker.java:65)
      	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeNextJob(SchedulerQueueWorker.java:59)
      	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.run(SchedulerQueueWorker.java:34)
      	at java.lang.Thread.run(Thread.java:745)
         Locked ownable synchronizers:
      	- None
      

      Suggested Fixes

      • Implemented - Separate heartbeat service from shared Caesium service.
      • Process only unread messages?

      Workaround

      • If it is blocked/delayed by Mail handler jobs, delete or move the older, previously read messages in the inbox. This accelerates email processing, freeing the Caesium threads more quickly and avoiding the problem.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              lwlodarczyk Lukasz Wlodarczyk
              Reporter:
              znoorsazali Zul NS [Atlassian]
              Votes:
              6 Vote for this issue
              Watchers:
              19 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: