Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-69652

Asynchronous cache replication can cause extra overhead in case of large number cache updates and many stale nodes

XMLWordPrintable

    • 7.09
    • 46
    • Severity 2 - Major
    • 11
    • Hide
      Atlassian Update – 22 September 2025

      Dear Customers,

      We have investigated the bug again after implementing automated stale node removal in https://jira.atlassian.com/browse/JRASERVER-42916. From the information we collected, the cause of the issue was always linked to an unusual number of stale nodes, and the automated removal should help resolve most of those cases. To address situations where the default configuration isn’t sufficient, we also created a Knowledge Base article (https://support.atlassian.com/jira/kb/stale-no-heartbeat-nodes-negatively-affect-performance-in-jira-data-center/) explaining how to tweak the automation or the node removal process to better fit your needs and ensure your instance’s performance remains unaffected.

      Starting with Jira 11.2.0, the existing logged message identifying the detected stale nodes will also include a link to this article to facilitate the discovery of the issue and potential solutions.

      Based on this, we will close the bug. If the problem persists on your instance, please comment on this ticket or contact Atlassian Support so we can better understand the real impact and investigate other potential root causes beyond stale nodes.

      Best regards

      Jacek Foremski

      Principal Software Engineer, Jira DC

      Show
      Atlassian Update – 22 September 2025 Dear Customers, We have investigated the bug again after implementing automated stale node removal in https://jira.atlassian.com/browse/JRASERVER-42916 . From the information we collected, the cause of the issue was always linked to an unusual number of stale nodes, and the automated removal should help resolve most of those cases. To address situations where the default configuration isn’t sufficient, we also created a Knowledge Base article ( https://support.atlassian.com/jira/kb/stale-no-heartbeat-nodes-negatively-affect-performance-in-jira-data-center/ ) explaining how to tweak the automation or the node removal process to better fit your needs and ensure your instance’s performance remains unaffected. Starting with Jira 11.2.0, the existing logged message identifying the detected stale nodes will also include a link to this article to facilitate the discovery of the issue and potential solutions. Based on this, we will close the bug. If the problem persists on your instance, please comment on this ticket or contact Atlassian Support so we can better understand the real impact and investigate other potential root causes beyond stale nodes. Best regards Jacek Foremski Principal Software Engineer, Jira DC

      Summary

      Asynchronous cache replication can cause extra overhead in case of large number cache updates and many stale nodes.

      Environment

      • Jira DC
      • A large number of stale nodes (see JRASERVER-42916)
      • Plugin (code) generating a large number of cache update events, eg reaching 2000 messages/min.

      Steps to Reproduce

      1. Open a URL which produces the cache update event while computing the business logic
        • Eg. #* /rest/servicedesk/1/<PRJ>/webfragments/sections/sd-queues-nav,servicedesk.agent.queues,servicedesk.agent.queues.ungrouped
      2. Measure response time and number of replication events

      Expected Results

      Performance doesn't degrade with a number of old nodes.

      Actual Results

      Performance degrades with a number of old stale nodes.

      • While taking thread dumps you can see a lot of threads busy in the following stack:
          java.lang.Thread.State: RUNNABLE
        	at java.io.RandomAccessFile.writeBytes(Native Method)
        	at java.io.RandomAccessFile.write(RandomAccessFile.java:512)
        	at com.squareup.tape.QueueFile.writeHeader(QueueFile.java:184)
        	at com.squareup.tape.QueueFile.add(QueueFile.java:321)
        	- locked <0x00000003ce9b43e0> (a com.squareup.tape.QueueFile)
        	at com.squareup.tape.FileObjectQueue.add(FileObjectQueue.java:46)
        	at com.atlassian.jira.cluster.distribution.localq.tape.TapeLocalQCacheOpQueue.add(TapeLocalQCacheOpQueue.java:151)
        	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpQueueWithStats.add(LocalQCacheOpQueueWithStats.java:115)
        	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheManager.addToQueue(LocalQCacheManager.java:370)
        	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheManager.addToAllQueues(LocalQCacheManager.java:354)
        	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheReplicator.replicateToQueue(LocalQCacheReplicator.java:85)
        	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheReplicator.replicatePutNotification(LocalQCacheReplicator.java:65)
        	at com.atlassian.jira.cluster.cache.ehcache.AbstractJiraCacheReplicator.notifyElementUpdated(AbstractJiraCacheReplicator.java:123)
        	at net.sf.ehcache.event.RegisteredEventListeners.internalNotifyElementUpdated(RegisteredEventListeners.java:228)
        	at net.sf.ehcache.event.RegisteredEventListeners.notifyElementUpdated(RegisteredEventListeners.java:206)
        ...
        
      • From client's case, we saw 15 - 20% of all threads doing replicateToQueue

      Notes

      None

      Workaround

      Clean-up old node data manually, see JRASERVER-42916

              38063008c24c Jacek Foremski
              ayakovlev@atlassian.com Andriy Yakovlev [Atlassian]
              Votes:
              6 Vote for this issue
              Watchers:
              22 Start watching this issue

                Created:
                Updated: